Let's find your dataset! There's no machine learning or deep learning without data. It's the unavoidable starting point for the current progress on visual recognition and object detection. Fortunately the data scientists community is good at sharing, and the Internet has a large quantity of open source datasets.  These prove to be very useful for testing your algorithms.  But how do you choose the right training set for your project? 

Here's an overview of the best free datasets available on line:


MNIST is often considered to be the most accessible source of data for training algorithms, with 70,000 images (60,000 training images and 140,000 test images) of handwritten numbers with about 250 different styles of handwriting in black and white, standardised at 28 x 28 pixels. With a minimum of reprocessing and reformatting, these datasets enable familiarisation with model recognition methods using real data.

A dataset similar to MNIST, called EMNIST, was published in 2017 and offers 280,000 images of the same type (240,000 training images and 40, 000 test images).



CIFAR-10  represents 60,000 colour images, standardised at 32 x 32 pixels and segmented into 10 different categories (planes, cars, birds, cats, stags, dogs, frogs, horses, boats and trucks). CIFAR-10 is also among the most used datasets for training machine learning and computer vision algorithms. For the more ambitious, CIFAR-100 proposes 100 different classes, each with 600 images.



Inspired by CIFAR-10, the STL-10  dataset aims to be an improved version for image recognition, with a corpus of 500 coloured training images, in a higher resolution of 96 x 96 pixels, in 10 categories. An additional 100,000 non annotated images are also available.


Google Open Images

Published in 2016, Google Open Images groups together over 9 million links to manually pre-annotated images, with labels covering over 6,000 categories. Each image comprises an average of 8.4 objects. This dataset is also subdivided into three categories: training, validation (about 41,000 images) and tests (about 125,500 images).



ImageNet is undoubtedly one of the biggest datasets, with over 14 million links to manually pre-annotated images following the WordNet  object categorisation model, divided into over 20,000 categories, each containing several hundred images. ImageNet has the specificity of calling upon the community of researchers and data scientists to contribute to the manual annotation of its datasets.



MS COCO (for Common Object In Context) is designed for the detection and segmentation of objects and persons, and caption generation. The MS COCO dataset contains over 330,000 generic images, of which over 200,000 are annotated, in 80 categories. In common with most of its fellows, MS COCO proposes training, validation and test datasets.


Facial recognition

Labeled Faces in the Wild

Labeled Faces in the Wild  is a reference for the training of facial recognition algorithms without constraints, with 13,000 close-up facial images of 5,750 different persons – including 1,680 which appear on at least two images – detected by the Viola-Jones detection framework, and annotated with an identifier name.



UMDFaces  contains two types of data: fixed images and views taken from videos. UMDFaces contains 367,888 annotated fixed images of faces of 8,277 subjects, and 3.75 million annotated video captures extracted from 22,075 videos for 3,107 subjects.



Google once again, with YouTube-8M  offering a dataset of annotated videos from over 6 million public YouTube videos representing nearly 35,000 hours of images.

Each video comprises an average of 3 automatically generated labels on the basis of over 3,800 Knowledge Graph entities.




OpenStreetMap is an international project created in 2004. OSM – for the cognoscenti – provides free geographical data and cartographies of the entire globe, derived from satellite photos, data from a GPS receiver or data from government bodies (TIGER data provided by the US Census Bureau, the French land registry or National Geographical Institute, etc.). Voluntary contributions from cybernauts play a major role in the development of this dataset.



Landsat8 proposes free downloading of a dataset of images of the entire surface of the Earth, updated every two weeks. To date, Landsat8 (from the name of the source satellite) has collected millions of images, which are used, among other applications, for the training of models applied to the fields of climate, de agriculture, habitat management, etc.


And if you wish to move on from generic datasets to sets specifically annotated for your project, every day Ingedata constitutes datasets which train, validate and optimise machine learning algorithms.

Originally published on July 20, 2018 Topics: Machine Learning Data Science Datasets


You may also like: