This repository is a virtual place to collect chemometric datasets to be used for teaching and research purposes. 

Submission of datasets for publication can be made through the following link. Each dataset must be submitted as a rar or zip file (maximum size 10MB). An example of the structure of the file can be downloaded at this link. In any case, each compressed file must include:

  • the readme.txt text file containing the following information: dataset title, reference, contact, brief explanation of the dataset, dataset typology (exploratory analysis, classification, regression), dimensions of matrices, presence of missing values, explanation of columns with labels
  • the dataset, preferably in csv format, with column labels given on the first row and, eventually, sample labels given on the first column

Other dataset repositories exist, here some links:

QSAR biodegradation

Author: Davide Ballabio (University of Milano Bicocca)

Description: the dataset contains the values of 41 variables (molecular descriptors) used to classify the samples (molecules) into two classes (356 biodegradable and 699 non-biodegradable molecules). The data were used to develop Quantitative Structure Activity Relationships (QSAR) models for studying the relationships between chemical structure and biodegradation of molecules.

Reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878 [link]

Typology: explorative analysis, classification

Dimensions: 1055 samples, 42 variables

Download: download the dataset

EVOO dataset

Author: Eugenio Alladio (University of Turin)

Description: The dataset contains the values of 79 untargeted and targeted (UT) features used to classify samples (137 EVOO samples) into two classes (Country A and B). The data were used to develop supervised classification models for studying the relationships between EVOO samples and the evaluated features. Samples have been analysed using a GC×GC-MS/FID methodology. Moreover, several feature selection approaches have been implemented to identify the predictors that mostly discriminate EVOO samples from countries A and B. The dataset contains also the class information of 10 different regions where the EVOO samples have been collected.

Reference: Stilo, F., Alladio, E., Squara, S., Bicchi, C., Vincenti, M., Reichenbach, S.E., Cordero, C., Ribeiro Bizzo, H. (2023). Delineating unique and discriminant chemical traits in Brazilian and Italian extra-virgin olive oils (EVOO) by quantitative 2D-fingerprinting and pattern recognition algorithms. Journal of Food Composition and Analysis, 115, 104899 [link]

Typology: explorative analysis, classification

Dimensions: 137 samples, 79 variables

Downloaddownload the dataset