Data

Small data sets for binary supervised classification:

The table below contains data sets used in the joint project of the University of Cologne and the Hochschule Merseburg “Classifying real-world data with the DDα-procedure”. Comprehensive description of the methodology, and experimental settings and results of the study are presented in the work (please cite if you find this useful):

Mozharovskyi, P., Mosler, K., and Lange, T. (2015): Classifying real-world data with the DDα-procedure. Advances in Data Analysis and Classification, 9(3), 287–314. [arXiv:1407.5185]

50 binary classification tasks have been obtained from partitioning 33 freely accessible data sets. Multiclass problems were reasonably split into binary classification problems, some of the data set were slightly processed by removing objects or attributes and selecting prevailing classes. Each data set is provided with a (short) description and brief descriptive statistics. The name reflects the origination of the data. A letter after the name is a property filter, letters (also their combinations) in brackets separated by “vs” are the classes opposed. The letters (combinations or words) stand for labels of classes (names of properties) and are intuitive. Each description contains a link to the original data.

The data have been collected as open source data in January 2013. The owner of this web page decline any responsibility regarding their correctness or consequences of their usage. If you publish material based on these data, please quote the original source. Special requests regarding citations are found on data set’s web page.

Download all the data sets as a single *.zip: zipAll

Data table:

# Dataset n1 n2 n1+n2 d ln(n1/n2) (n1+n2)/d ties Download
1. Baby 161 86 247 5 0,626 49,4 0 dat zip
2. Banknoten 100 100 200 6 0 33,3 0 dat zip
3. Biomedical 67 127 194 4 -0,635 48,5 0 dat zip
4. Blood Transfusion 178 570 748 3 -1,171 249,3 246 dat zip
5. Breast Cancer Wisconsin 458 241 699 9 0,642 77,7 236 dat zip
6. Bupa Liver Disorder 145 200 345 6 -0,329 57,5 4 dat zip
7. Chemical Diabetes (C vs N) 36 76 112 5 -0,755 22,4 0 dat zip
8. Chemical Diabetes (C vs O) 36 33 69 5 0,086 13,8 0 dat zip
9. Chemical Diabetes (N vs O) 76 33 109 5 0,833 21,8 0 dat zip
10. Cloud 54 54 108 7 0 15,4 0 dat zip
11. Crabs (B vs O) 100 100 200 5 0 40,0 0 dat zip
12. Crabs (M vs F) 100 100 200 5 0 40,0 0 dat zip
13. Crabs B (M vs F) 50 50 100 5 0 20,0 0 dat zip
14. Crabs F (B vs O) 50 50 100 5 0 20,0 0 dat zip
15. Crabs M (B vs O) 50 50 100 5 0 20,0 0 dat zip
16. Crabs O (M vs F) 50 50 100 5 0 20,0 0 dat zip
17. Cricket (C vs P) 78 78 156 4 0 39,0 7 dat zip
18. Diabetes (of Pima Indians) 268 500 768 8 -0,616 96,0 0 dat zip
19. Ecoli (CP vs IM) 143 77 220 5 0,621 44,0 0 dat zip
20. Ecoli (CP vs PP) 143 52 195 5 1,012 39,0 0 dat zip
21. Ecoli (IM vs PP) 77 52 129 5 0,392 25,8 0 dat zip
22. Gemsen (M vs F) 796 553 1349 6 0,365 224,8 27 dat zip
23. Glass (F vs NF) 70 76 146 9 -0,083 16,2 1 dat zip
24. Groessen (M vs F) 116 114 230 3 0,020 76,7 0 dat zip
25. Haberman’s Survival 225 81 306 3 1,022 102,0 23 dat zip
26. Heart 120 150 270 13 -0,223 20,8 0 dat zip
27. Hemophilia 30 45 75 2 -0,400 37,5 0 dat zip
28. Indian Liver Patient (1 vs 2) 414 165 579 10 0,920 57,9 13 dat zip
29. Indian Liver Patient (M vs F) 140 439 579 9 -1,139 64,3 13 dat zip
30. Iris Plants (SET vs VER) 50 50 100 4 0 25,0 2 dat zip
31. Iris Plants (SET vs VIR) 50 50 100 4 0 25,0 3 dat zip
32. Iris Plants (VER vs VIR) 50 50 100 4 0 25,0 1 dat zip
33. Irish Educational Transitions (M vs F) 250 250 500 5 0 100,0 44 dat zip
34. Kidney (M vs F) 20 56 76 5 -1,022 15,2 0 dat zip
35. PIMA (training) 132 68 200 7 0,663 28,6 0 dat zip
36. Plasma Retinol and Beta-Carotene Levels (M vs F) 273 42 315 13 1,872 24,2 0 dat zip
37. Segmentation (C vs W) 330 330 660 10 0 66,0 62 dat zip
38. Social Mobility (I vs NI) 578 578 1156 5 0 231,2 45 dat zip
39. Social Mobility (W vs B) 578 578 1156 5 0 231,2 8 dat zip
40. Teaching Assistan Evaluation (E vs NE) 29 122 151 5 -1,427 30,2 43 dat zip
41. Tennis (M vs F) 42 45 87 15 -0,073 5,8 0 dat zip
42. Tips (D vs N) 176 68 244 6 0,952 40,7 1 dat zip
43. Tips (M vs F) 87 157 244 6 -0,598 40,7 1 dat zip
44. US Crime (S vs N) 16 31 47 13 -0,654 3,6 0 dat zip
45. Vertebral Column 210 100 310 6 0,742 51,7 0 dat zip
46. Veteran Lung Cancer (S vs T) 69 68 137 7 0,010 19,6 0 dat zip
47. Vowel (M vs F) 528 462 990 13 0,131 76,2 0 dat zip
48. Wine (1 vs 2) 59 71 130 13 -0,186 10,0 0 dat zip
49. Wine (1 vs 3) 59 48 107 13 0,207 8,2 0 dat zip
50. Wine (2 vs 3) 71 48 119 13 0,392 9,2 0 dat zip

Links to other web-pages containing data sets: