• News
  • Automatic generation of multiple labeled test data sets

News of the Institute of Computer Science Polish Academy of Sciences

Automatic generation of multiple labeled test data sets

The article "On the Discrepancy between Kleinberg's Clustering Axioms and k-Means Clustering Algorithm Behavior" has been published in the journal "Machine Learning". The article was co-authored by Professor Mieczysław Kłopotek from the Institute of Computer Science of the Polish Academy of Sciences.

This paper investigates why the k-means algorithm violates the Kleinberg's axioms for clustering functions. The reason is a mismatch between informal intuitions and formal formulations of these axioms. We show a way to reconcile k-means with Kleinberg's consistency requirement via introduction of centric consistency which is neither a subset nor superset of Kleinberg's consistency, but rather a k-means clustering model specific adaptation of the general idea of shrinking the cluster. What is more, if the axioms are to be applicable continuously, then the centric consistency is a much less restrictive requirement for clustering algorithms than Kleinberg's consistency.

The contribution of the paper from theoretical point of view is to create a sound axiomatic system for k-means and its derivatives (including ones with adaptive k). From practical point of view, a foundation for a testbed for such algorithms is created in that many test data sets can be derived from one labeled dataset for such algorithms via the centric consistency transformation.

The article is available under the Open Access license from the Springer publishing house: doi: 10.1007/s10994-023-06308-x.

© 2021 INSTITUTE OF COMPUTER SCIENCE POLISH ACADEMY OF SCIENCES | Privacy policy | Accessibility declaration