Webinaire Data&IA @ IMT - https://mintel.imt.fr/voir.php?id=5212
Vers un calcul de la pertinence
Jean-Louis Dessalles
www.dessalles.fr /slideshows
Télécom Paris

Représenter des données de manière pertinente.
Représenter des données de manière pertinente.Exemple: les clusters pertinents sont ceux qui "amènent" le plus d’information.
Représenter des données de manière pertinente.Un outlier est du bruit car il augmente artificiellement l’information
Definition de l’information algorithmique:
Thèse: La pertinence correspond à une baisse de complexité = compression
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
Example: K-Means algorithm.
Prototype-based models
A prototype is a point of the input space used to provide information about the distribution of data points.
Example: K-Means algorithm.
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
Compression
Prototype-based models can be seen as compression methods: Instead of describing data points by specifying their absolute coordinates,
Compression
The complexity of a point \(x_j\) in the dataset \(X\) is bounded by the complexity of its coordinates \((x_{j_1}, \dotsc, x_{j_d})\):
$$ C(x_j) \leq C(x_{j_1}) + \dotsc + C(x_{j_d}),$$
where the complexity of coordinates depends on the precision \(\delta\) used to convert them into integers.
Compression
If prototypes \(\{p_k\}\) are introduced,
the complexity of \(X\) is now:
$$ C(X) \leq \sum_k C(p_k) + \sum_i C(x_j|\{p_k\}),$$
where:
\(C(x_j|\{p_k\}) = \min_k C(x_j-p_k).\)
If the dataset exhibits some structure, suitable prototypes can be found and this estimate of \(C(X)\) will turn out to be significantly smaller that the "naive" one (hence the compression).
Conversely, and quite remarkably, this expression can be used to find the best prototype set \(\{p_k\}\)
(Cilibrasi & Vitányi, 2005).
Compression
Prototypes \(\{p_k\}\) are relevant inasmuch as they lead to compression.
Classification
Now consider a classification task as an example of supervised learning.
Classification
Now consider a classification task as an example of supervised learning.
Classification
In a classification task, we consider a dataset \(X\) of \(|X|\) points \(x_j\). Each \(x_j\) is supposed to belong to a class labelled as \(y_j\). The problem is to learn which label \(y_j\) taken from the set of labels \(Y\) corresponds to \(x_j\).
The complexity of the classification is bounded by the amount of information required by rote learning.
$$C(\{(x_j, y_j)\} | X) \leq \sum_j C(\{y_j\}) \approx |X| \times \log(|Y|).$$
Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i.$$
Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i.$$
The decision function \(\beta\) just returns the label of the closest prototype.
Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) + \sum_{j} {C((x_j, y_j) | X, \mathcal{M})}.$$
Since the decision function \(\beta\) is just a loop over the prototypes,
\(C((x_j, y_j) | X, \mathcal{M}) = 0\). So \(C(\{(x_j, y_j)\} | X)\) is found to be much smaller if \(K \ll |X|\) than if we only rely on rote learning.
Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) \ll |X| \times \log(|Y|).$$
We just have to remember the prototypes and their label to retrieve all labels, hence the compression.
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.If not:
\(C((x_j, y_j) | X, \mathcal{M}) \neq 0\).
For each misclassified point, we may just specify its index and the correct value of its label.
$$\sum_{j}{C((x_j, y_j) | X, \mathcal{M})} \leq \sum_{i=1}^{|X|} (C(i) + C(y_i)) \times \mathbb{I}(y_i \neq \beta(x_i)),$$
where \(\mathbb{I}\) is a function that converts a Boolean value into 0 or 1.
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.If not:
\(C((x_j, y_j) | X, \mathcal{M}) \neq 0\).More simply, if there are \(n\) misclassified points, we get:
$$\sum_{j}{C(\{(x_j, y_j)\} | X, \mathcal{M})} \leq n \times (\log(|X|) +\log(|Y|)).$$
Classification To sum up:
$$C(\{(x_j, y_j)\} | X) \leq \sum_k C(p_k) + K \times \log (|Y|) + n \times (\log(|X|) +\log(|Y|)).$$
If \(n\) is small, this may still represent a huge compression as compared to rote learning.
The right-hand side of this inequality can be used as a loss-function to determine the best prototype-based classification.


We may then keep only the model
\(n\) repeated \(n\) times.
\(A\) → \(B\)
\(C\) → \(x\)
"\(A\) is to \(B\) as \(C\) is to \(x\)"
Doug Hofstadter is well-known for his original study of analogy. Let’s consider one of his favourite examples.
\(abc\) → \(abd\)
\(ppqqrr\) → \(x\)
\(abc\) → \(abd\)
\(abc\) → \(abd\)
\(A\) → \(B\)
\(C\) → \(x\)
"\(A\) is to \(B\) as \(C\) is to \(x\)"
The most relevant answer minimizes the overall complexity.$$ x = argmin(C(A, B, C, x)).$$| Cornuéjols, A. (1996). Analogie, principe d’économie et complexité algorithmique. Actes des 11èmes Journées Françaises de l’Apprentissage. Sète, France. |
| Murena, P.-A., Al-Ghossein, M., Dessalles, J.-L. & Cornuéjols, A. (2020). Solving analogies on words based on minimal complexity transformation. International Joint Conference on Artificial Intelligence (IJCAI), 1848-1854. |




| Dessalles, J.-L. (2006). A structural model of intuitive probability.
ICCM-2006, 86-91. Trieste, IT: Edizioni Goliardiche. |
relevance = complexity drop
$$ p = 2^{-U} $$
$$ U(s) = C_w(s) - C(s) $$
Conclusion
relevance = complexity drop
Ce constat ouvre la voie à un calcul de la pertinence.
| Dessalles, J.-L. (2008). La pertinence et ses origines cognitives - Nouvelles théories. Paris: Hermes Science. |
| Dessalles, J.-L. (2019). Des intelligences très artificielles. Paris: Odile Jacob. |
| Dessalles, J.-L. (2013). Algorithmic simplicity and relevance. In Algorithmic probability and friends - LNAI 7070, 119-130. Springer. |