Vers un calcul de la pertinence
- ML: la pertinence distingue les données du bruit
- ML: est pertinent ce qui est digne d’être appris/mémorisé
- ML: un outlier n’est pas pertinent...
- ... ou est très pertinent en tant qu’anomalie
Vers un calcul de la pertinence
- XAI: une explication est pertinente si elle peut convaincre
- NLG: une intervention est pertinente si elle est appropriée
- IA: Par définition (selon A. Turing),
une machine intelligente
se doit d’être pertinente.
- → Un problème fondamental pour l’IA !
- → Un problème fondamental en science !
- → Le problème scientifique le plus fondamental !
Qu’y a-t-il de commun ?
- bruit
- data utiles
- outliers
- anomalies
- explication
- intelligence
Réponse dans la Théorie algorithmique de l’information
Algorithmic Information Theory (AIT)
Intuition: Est pertinent ce qui apporte de l’information.
- L’information est ce qui reste
quand toute redondance est éliminée
- L’information se mesure en bits
- L’information au sens de Shannon est une moyenne statistique sur des événements répétés.
- Elle ne convient pas pour mesurer la pertinence
Vers un calcul de la pertinence
- Pertinence & Machine Learning
- Pertinence & AIQ (artificial IQ)
- Pertinence des événements
Vers un calcul de la pertinence
- Pertinence & Machine Learning
- Pertinence & AIQ (artificial IQ)
- Pertinence des événements
Algorithmic Information Theory (AIT)
Intuition: Est pertinent ce qui apporte de l’information.
Représenter des données de manière pertinente.
Algorithmic Information Theory (AIT)
Intuition: Est pertinent ce qui apporte de l’information.
Représenter des données de manière pertinente.Exemple: les clusters pertinents sont ceux qui "amènent" le plus d’information.
Algorithmic Information Theory (AIT)
Intuition: Est pertinent ce qui apporte de l’information.
Représenter des données de manière pertinente.Un outlier est du bruit car il augmente artificiellement l’information
ou une anomalie car sa caractérisation demande peu d’information.
Algorithmic Information Theory (AIT)
Definition de l’information algorithmique:
Ray Solomonoff 1964, Andrei Kolmogorov 1965, Gregory Chaitin 1966
$$C(x) = min\{|p|; U(p) = x\}.$$
Taille du plus petit programme
qui engendre \(x\) sur la machine \(U\).
Thèse:
La pertinence correspond à une baisse de complexité
= compression
AIT & Machine learning
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
AIT & Machine learning
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
Example: K-Means algorithm.
AIT & Machine learning
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
Example: K-Means algorithm.
AIT & Machine learning
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
AIT & Machine learning
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype.
AIT & Machine learning
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype. We get smaller numbers, hence the compression.
AIT & Machine learning
Compression
The complexity of a point \(x_j\) in the dataset \(X\) is bounded by the complexity of its coordinates \((x_{j_1}, \dotsc, x_{j_d})\):
$$ C(x_j) \leq C(x_{j_1}) + \dotsc + C(x_{j_d}),$$
where the complexity of coordinates depends on the precision \(\delta\) used to convert them into integers.
We get a "naive" upper bound of the complexity of the whole dataset \(X\):
$$C(X) \leq \sum_j C(x_j).$$
AIT & Machine learning
Compression
If prototypes \(\{p_k\}\) are introduced,
the complexity of \(X\) is now:
$$ C(X) \leq \sum_k C(p_k) + \sum_i C(x_j|\{p_k\}),$$
where:
\(C(x_j|\{p_k\}) = \min_k C(x_j-p_k).\)
If the dataset exhibits some structure, suitable prototypes can be found and this estimate of \(C(X)\) will turn out to be significantly smaller that the "naive" one (hence the compression).
Conversely, and quite remarkably, this expression can be used to find the best prototype set \(\{p_k\}\)
(Cilibrasi & Vitányi, 2005).
AIT & Machine learning
Compression
Prototypes \(\{p_k\}\) are relevant inasmuch as they lead to compression.
AIT & Machine learning
Classification
Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.
AIT & Machine learning
Classification
Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.
In doing so, they offer a compressed version of the classification.
AIT & Machine learning
Classification
In a classification task, we consider a dataset \(X\) of \(|X|\) points \(x_j\). Each \(x_j\) is supposed to belong to a class labelled as \(y_j\). The problem is to learn which label \(y_j\) taken from the set of labels \(Y\) corresponds to \(x_j\).
The complexity of the classification is bounded by the amount of information required by rote learning.
$$C(\{(x_j, y_j)\} | X) \leq \sum_j C(\{y_j\}) \approx |X| \times \log(|Y|).$$
AIT & Machine learning
Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i.$$
AIT & Machine learning
Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i.$$
The decision function \(\beta\) just returns the label of the closest prototype.
AIT & Machine learning
Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) + \sum_{j} {C((x_j, y_j) | X, \mathcal{M})}.$$
Since the decision function \(\beta\) is just a loop over the prototypes,
\(C((x_j, y_j) | X, \mathcal{M}) = 0\). So \(C(\{(x_j, y_j)\} | X)\) is found to be much smaller if \(K \ll |X|\) than if we only rely on rote learning.
AIT & Machine learning
Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) \ll |X| \times \log(|Y|).$$
We just have to remember the prototypes and their label to retrieve all labels, hence the compression.
AIT & Machine learning
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.
AIT & Machine learning
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.If not:
\(C((x_j, y_j) | X, \mathcal{M}) \neq 0\).
For each misclassified point, we may just specify its index and the correct value of its label.
$$\sum_{j}{C((x_j, y_j) | X, \mathcal{M})} \leq \sum_{i=1}^{|X|} (C(i) + C(y_i)) \times \mathbb{I}(y_i \neq \beta(x_i)),$$
where \(\mathbb{I}\) is a function that converts a Boolean value into 0 or 1.
AIT & Machine learning
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.If not:
\(C((x_j, y_j) | X, \mathcal{M}) \neq 0\).More simply, if there are \(n\) misclassified points, we get:
$$\sum_{j}{C(\{(x_j, y_j)\} | X, \mathcal{M})} \leq n \times (\log(|X|) +\log(|Y|)).$$
AIT & Machine learning
Classification To sum up:
$$C(\{(x_j, y_j)\} | X) \leq \sum_k C(p_k) + K \times \log (|Y|) + n \times (\log(|X|) +\log(|Y|)).$$
If \(n\) is small, this may still represent a huge compression as compared to rote learning.
The right-hand side of this inequality can be used as a loss-function to determine the best prototype-based classification.
Minimum description length (MDL)
Minimum description length (MDL)
Minimum description length (MDL)
Minimum description length (MDL)
Machine learning: find a model \(\mathcal{M}\) to compress data \(\mathcal{D} = \{d_j\}\).
$$C(\mathcal{D}) \le C(\mathcal{M}) + \sum_j{C(\mathcal{d_j} | \mathcal{M})}.$$
The most relevant model achieves the best tradeoff between its own complexity and the complexity of exceptions.
We may then keep only the model
(= lossy compression if the model isn’t perfect)
to predict future data.
Vers un calcul de la pertinence
- Pertinence & Machine Learning
- Pertinence & AIQ (artificial IQ)
- Pertinence des événements
Artificial IQ
Guess what comes after:
1,2,2,3,3,3,4,4,4,4... ?Can you figure out a relevant answer?
Artificial IQ
Guess what comes after:
1,2,2,3,3,3,4,4,4,4,5,5,5,5,5
1,2,2,3,3,3,4,4,4,4,4,4,3,4
Artificial IQ
Guess what comes after:
1,2,2,3,3,3,4,4,4,4,5,5,5,5,5
The most relevant answer achieves the greatest compression.$$n^{*n}$$
\(n\) repeated \(n\) times.
The most relevant answer minimizes the overall complexity of the sequence.
Artificial IQ
Analogical equation
\(A\) → \(B\)
\(C\) → \(x\)
"\(A\) is to \(B\) as \(C\) is to \(x\)"
Doug Hofstadter
is well-known for his original study of analogy. Let’s consider one of his favourite examples.
\(abc\) → \(abd\)
\(ppqqrr\) → \(x\)
Have you come up with a relevant solution?
Artificial IQ
Analogical equation
\(abc\) → \(abd\)
\(ppqqrr\) → \(x\)
Most people would instantly answer
\(ppqqrr\) → \(ppqq\)\(ss\)
Why should we consider it a more relevant solution than, say,
\(ppqqrr\) → \(ppqqr\)\(d\)
Artificial IQ
Analogical equation
\(abc\) → \(abd\)
\(ppqqrr\) → \(ppqq\)\(ss\)
\(ppqqrr\) → \(ppqqr\)\(d\)
We have to compare two rules:
- increment the last item
- replace the last letter by d
Rule 2. has two drawbacks in terms of complexity:
- - it introduces a constant, d.
- - if we do not want to lose the structure of \(ppqqrr\), the rule should read:
"break the last item back into letters, and replace the last letter by d"
This may explain why x = ppqqss appears more relevant, and thus smarter.
Artificial IQ
Analogical equation -- Theory
\(A\) → \(B\)
\(C\) → \(x\)
"\(A\) is to \(B\) as \(C\) is to \(x\)"
The most relevant answer minimizes the overall complexity.$$ x = argmin(C(A, B, C, x)).$$
Artificial IQ
Analogical equation -- Application
\(rosa\) → \(rosam\)
\(vita\) → \(x\)
Artificial IQ
Analogical equation -- Application
\(rosa\) → \(rosam\)
\(vita\) → \(vita\)\(m\)
Children learn their mother language from limited evidence. Analogy may play an important role.
A system based on the complexity principle performed better than the state of the art on lexical analogies.
Vers un calcul de la pertinence
- Pertinence & Machine Learning
- Pertinence & AIQ (artificial IQ)
- Pertinence des événements
Relevance of events
What makes an event relevant?Applications:
- Personal news feed, chatbots
- Security (e.g. nuclear plant)
- Smart homes
- Diagnosis
- Abduction, CBR
- Anomaly detection
- Fraud detection
- . . .
Anomaly detection
What counts as an anomaly?
Anomaly detection
What counts as an anomaly?
Anomaly detection
What counts as an anomaly?
- outlier
- impossibility
- unanticipated events
Anomaly detection
What counts as an anomaly?
- outlier
- impossibility
- unanticipated events
- coincidences
Anomaly detection
What counts as an anomaly?
- outlier
- impossibility
- unanticipated events
- coincidences
Anomaly detection
What counts as an anomaly?
- outlier
- impossibility
- unanticipated events
- coincidences
Anomaly detection
A Plagiarism story
$$ U = C(N) - C(A) $$
Unexpected events
Simple events seem unexpected, and thus relevant.
This effect can be quantified.
Unexpected events
Fortuitous encounter
$$ U(s) = C(L) - C(p).$$
Subjective (im)probability
relevance = complexity drop
$$ p = 2^{-U} $$
$$ U(s) = C_w(s) - C(s) $$
Vers un calcul de la pertinence
- Pertinence & Machine Learning
- Pertinence & AIQ (artificial IQ)
- Pertinence des événements
Vers un calcul de la pertinence
Conclusion
La pertinence
- dans le cadre du machine learning
- dans le cadre d’une inducion (tests QI)
- pour les événements (anomalies, inattendu, ...)
- (et pour les actions)
se caractérise par une baisse de complexité
(au sens de Kolmogorov).
relevance = complexity drop
Ce constat ouvre la voie à un calcul de la pertinence.
Vers un calcul de la pertinence
Pour en savoir plus...