This repository is a virtual place to collect chemometric datasets to be used for teaching and research purposes. 

Submission of datasets for publication can be made through the following link. Each dataset must be submitted as a rar or zip file (maximum size 10MB). An example of the structure of the file can be downloaded at this link. In any case, each compressed file must include:

  • the readme.txt text file containing the following information: dataset title, reference, contact, brief explanation of the dataset, dataset typology (exploratory analysis, classification, regression), dimensions of matrices, presence of missing values, explanation of columns with labels
  • the dataset, preferably in csv format, with column labels given on the first row and, eventually, sample labels given on the first column

Other dataset repositories exist, here some links:

QSAR biodegradation

Author: Davide Ballabio (University of Milano Bicocca)

Description: the dataset contains the values of 41 variables (molecular descriptors) used to classify the samples (molecules) into two classes (356 biodegradable and 699 non-biodegradable molecules). The data were used to develop Quantitative Structure Activity Relationships (QSAR) models for studying the relationships between chemical structure and biodegradation of molecules.

Reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878 [link]

Typology: explorative analysis, classification

Dimensions: 1055 samples, 42 variables

Download: download the dataset


EVOO dataset

Author: Eugenio Alladio (University of Turin)

Description: The dataset contains the values of 79 untargeted and targeted (UT) features used to classify samples (137 EVOO samples) into two classes (Country A and B). The data were used to develop supervised classification models for studying the relationships between EVOO samples and the evaluated features. Samples have been analysed using a GC×GC-MS/FID methodology. Moreover, several feature selection approaches have been implemented to identify the predictors that mostly discriminate EVOO samples from countries A and B. The dataset contains also the class information of 10 different regions where the EVOO samples have been collected.

Reference: Stilo, F., Alladio, E., Squara, S., Bicchi, C., Vincenti, M., Reichenbach, S.E., Cordero, C., Ribeiro Bizzo, H. (2023). Delineating unique and discriminant chemical traits in Brazilian and Italian extra-virgin olive oils (EVOO) by quantitative 2D-fingerprinting and pattern recognition algorithms. Journal of Food Composition and Analysis, 115, 104899 [link]

Typology: explorative analysis, classification

Dimensions: 137 samples, 79 variables

Downloaddownload the dataset


Coffee barley NIR

Authors: Heshmatollah Ebrahimi-Najafabadi, Riccardo Leardi, Paolo Oliveri, Maria Chiara Casolino, Mehdi Jalali-Heravi, Silvia Lanteri (University of Genova)

Description: Nine different types of coffee including pure Arabica, Robusta and mixtures of them at different roasting degrees were blended with four types of barley. The blending degrees were between 2 and 20 wt% of barley. Samples, pure and additioned with milled roasted barley powder, were analysed by transmission FT-NIR. Aim of the study: 1) build a class model for authenticity verification of coffee and 2) quantification of barley additioned (% w/w). Samples are divided in three sets:
Training: 117 samples (9 types of coffee and 4 types of barley, at different % levels)
Internal test: 30 samples (same 9 types of coffee and 4 types of barley, at different % levels)
External test: 11 samples (a different coffee and a different barley, at 11 % levels)

Reference: Heshmatollah Ebrahimi-Najafabadi 1, Riccardo Leardi, Paolo Oliveri, Maria Chiara Casolino, Mehdi Jalali-Heravi, Silvia Lanteri, Detection of addition of barley to coffee using near infrared spectroscopy and chemometric techniques, Talanta 2012 99, 175-179, doi: 10.1016/j.talanta.2012.05.036 [link]

Typology: regression, classification

Dimensions: 158 samples, 1501 variables

Downloaddownload the dataset


Mushrooms NIR

Authors: Monica Casale, Lucia Bagnasco, Mirca Zotti, Simone Di Piazza, Nicola Sitta, Paolo Oliveri (University of Genova)

Description: Near-infrared spectroscopy (NIRS) was used to identify extraneous species within dried porcini batches and detect related commercial frauds. To this goal, 80 dried fungi including BEAS, Tylopilus spp., and Boletus violaceofuscus were analysed by reflection FT-NIR. For each sample, 3 different parts of the pileus (pileipellis, flesh and hymenium) were analysed. Aim of the study: build a class model for authenticity verification of Boletus edulis and allied species (BEAS) mushrooms. Samples were thus divided in three classes: Class 1 = Boletus edulis and allied species (BEAS), Class 2 = Tylopilus spp, Class 3 = Boletus violaceofuscus. Class 1 is the class to be modelled. Class 3 is the most similar to the target class (not only from a spectroscopic viewpoint, but also for taxonomic reasons).

Reference: Monica Casale, Lucia Bagnasco, Mirca Zotti, Simone Di Piazza, Nicola Sitta, Paolo Oliveri, A NIR spectroscopy-based efficient approach to detect fraudulent additions within mixtures of dried porcini mushrooms, Talanta 2016 160, 729-734, doi: 10.1016/j.talanta.2016.08.004 [link]

Typology: classification, data fusion

Dimensions: 80 samples, 1250 variables

Downloaddownload the dataset


Wines NIR UV-vis

Authors: M. Casale, P. Oliveri, C. Armanino, S. Lanteri, M. Forina (University of Genova)

Description: NIR and UV–vis spectroscopy, together with chemometric pattern recognition techniques, were applied in addressing a food authentication problem: the distinction between wine samples from the same Italian oenological region, according to the grape variety. 59 certified samples belonging to the Barbera d’Alba and Dolcetto d’Alba appellations and collected from the same vintage (2007) were analysed. Aim of the study: build a class model for Barbera d’Alba and Dolcetto d’Alba. Class 1 = Barbera d’Alba (23 samples), Class 2 = Dolcetto d’Alba (36 samples).

Reference: M. Casale, P. Oliveri, C. Armanino, S. Lanteri, M. Forina, NIR and UV–vis spectroscopy, artificial nose and tongue: Comparison of four fingerprinting techniques for the characterisation of Italian red wines, Analytica Chimica Acta, 2010, 668, 143-148, doi: 10.1016/j.aca.2010.04.021 [link]

Typology: classification, data fusion

Dimensions: 59 samples, 911 (UV–vis) and 1501 (NIR) variables

Downloaddownload the dataset


Olive oils multi-block

Authors: Paolo Oliveri, Monica Casale, M. Chiara Casolino, M. Antonietta Baldo, Fiammetta Nizzi Grifi, Michele Forina (University of Genova)

Description: An authentication study of the Italian PDO (protected designation of origin) olive oil Chianti Classico, based on near-infrared and UV–Visible spectroscopy, an artificial nose and an artificial tongue, with a set of samples representative of the whole Chianti Classico production and a considerable number of samples from a close production area (Maremma) was performed. Aim of the study: build a class model for authenticity verification of Chianti Classico PDO olive oil. Class 1 = Chianti Classico (23 samples); Class 2 = Maremma (34 samples). Class 1 is the class to be modelled, Classes 2 is included to assess specificity (i.e., type II or β error) of Class 1 models.

Reference: Paolo Oliveri, Monica Casale, M. Chiara Casolino, M. Antonietta Baldo, Fiammetta Nizzi Grifi, Michele Forina, Comparison between classical and innovative class-modelling techniques for the characterisation of a PDO olive oil, Analytical and Bioanalytical Chemistry, 2011, 399, 2105–2113, doi: 10.1007/s00216-010-4377-1 [link]

Typology: classification, data fusion

Dimensions: 57 samples, 441 (UV–vis) + 1126 (NIR) + 3945 (e-tongue) + 46 (e-nose) variables

Downloaddownload the dataset


Olives in brine NIR

Authors: Paolo Oliveri, M. Isabel López, M. Chiara Casolino, Itziar Ruisánchez, M. Pilar Callao, Luca Medini, Silvia Lanteri (University of Genova)

Description: Samples of olives in brine from different harvest years analysed by reflection FT-NIR spectroscopy. Aim of the study: build a class model for authenticity verification of Taggiasca cultivar olives. Class 1 = Taggiasca, Class 2 = Leccino, Class 3 = Coquillo. Class 1 is the class to be modelled. Classes 2 and 3 are included to assess specificity (i.e., type II or β error) of Class 1 models. Samples are divided in training (olives from harvests 2010-11 and 2011-12), internal (olives from harvests 2010-11 and 2011-12) and external (olives from harvests 2012-13) test sets.

Reference: Paolo Oliveri, M. Isabel López, M. Chiara Casolino, Itziar Ruisánchez, M. Pilar Callao, Luca Medini, Silvia Lanteri, Partial least squares density modeling (PLS-DM) – A new class-modeling strategy applied to the authentication of olives in brine by near-infrared spectroscopy, Analytica Chimica Acta 851 (2014) 30–36, doi: 10.1016/j.aca.2014.09.013 [link]

Typology: classification

Dimensions: 233 samples, 1243 variables

Downloaddownload the dataset


Protein foods freshness

Authors: Lisa Rita Magnaghi (University of Pavia)

Description: RGB triplets describing colour evolution of a colorimetric sensors array for protein foods freshness monitoring during chilled storage.

Reference: Magnaghi, L. R.; Capone, F.; Zanoni, C.; Alberti, G.; Quadrelli, P.; Biesuz, R., Colorimetric sensor array for monitoring, modelling and comparing spoilage processes of different meat and fish foods. Foods, 2020, 9(5), 684 [link]

Typology: explorative analysis

Dimensions: 120 samples, 18 variables

Downloaddownload the dataset


VOCs waste dataset

Authors: Caterina Durante (University of Modena and Reggio Emilia- Italy)

Description: Gas Chromatographic Characterization of Volatile Compounds of Food Waste Sample. The dataset contains the area values of the 162 compounds resolved by the PARADISe approach applied on Gas-Chromatography-Mass Spectrometry signals obtained by the analysis of volatile compounds (VOCs) of waste samples caming from the production of pasta condiments. The data was used to characterize samples and to study the variability of VOCs as function of storage time.

Reference: Strani, L., Farioli, G., Cocchi, M., Durante, C., Olarini, A. (2024). Chemical Characterization and Temporal Variability of Pasta Condiment By-Products for Sustainable Waste Management. Foods, 13(18), 3018 [link]

Typology: explorative analysis

Dimensions: 16 samples, 162 variables

Downloaddownload the dataset


Milk freshness

Authors: Lisa Rita Magnaghi (University of Pavia)

Description: RGB triplets describing colour evolution of a colorimetric sensors array for milk freshness monitoring during chilled storage.

Reference: Magnaghi, L. R.; Zanoni, C.; Alberti, G.; Quadrelli, P.; Biesuz, R. Towards intelligent packaging: BCP-EVOH@ optode for milk freshness measurement, Talanta, 2022, 241, 123230 [link]

Typology: classification

Dimensions: 279 samples, 3 variables

Downloaddownload the dataset


AgNPs@OPE

Authors: Lisa Rita Magnaghi (University of Pavia)

Description: UV-Vis spectra of AgNPs@OPE added with increasing concentrations of Cd2+ or Pb2+

Reference: Zannotti, M.; Piras, S.; Magnaghi, L. R.; Biesuz, R.; Giovannetti, R. Silver nanoparticles from orange peel extract: Colorimetric detection of Pb2+ and Cd2+ ions with a chemometric approach, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2024, 323, 124881, [link]

Typology: classification

Dimensions: 85 samples, 450 variables

Downloaddownload the dataset


Melanin synthesis

Authors: Lisa Rita Magnaghi (University of Pavia)

Description: Kinetics of melanin synthesis monitored by UV-Vis spectra: Possibility of small particles formation by spectra analysis and dopamine consumption by first derivative modelling Possibility of both PCA and 3-WAY PCA application

Reference: Schifano, F.; Magnaghi, L. R., Monzani, E.; Casella, L.; Biesuz, R. Exploiting Principal Component Analysis (PCA) to reveal temperature, buffer and metal ions’ role in neuromelanin (NM) synthesis by dopamine (DA) oxidative polymerization
Journal of Inorganic Biochemistry, 256, 2024, 112548 [link]

Typology: exploratory analysis

Dimensions: 221 samples, 351 variables

Downloaddownload the dataset