It can never be said often enough: the reliability and accuracy of your datasets are as important as the performance and quality of your algorithms. Thousands of dataset directories, in every possible speciality, are available in open source. Let’s find yours.
Do you want to develop a machine learning application? Whether you are a student, a researcher, or working on a business solution, you probably have a good idea of the algorithmic challenges facing you. The problem of AI often lies with the data.
The ideal solution is obviously to have a sufficient quantity of data sets adapted to your own specific problem. But creating your own datasets can quickly become time consuming and costly.
There is an alternative, at least to start with: open source. Thousands of dataset directories, in every possible speciality, are available free of charge on line. You just need to know where to find them! Here’s a brief overview of the best non-specialised dataset directories accessible via open source.
Microsoft Research Open Data
Launched in June 2018, Microsoft Research Open Data organises the sharing of datasets produced by researchers. This new directory concerns computer vision as well as natural language. It contains data in fields from data processing to the social sciences, via physics and biology.
Kaggle Datasets is no doubt one of the most popular directories with no less than 13 321 datasets organised by subject, validated by data scientists for data scientists. Kaggle offers research filters (by format, size, tag and type of licence), which allows you to find the data set suitable for your project rapidly.
Github Awesome Public Datasets
As well-known as Kaggle, GitHub Awesome Public Datasets is an aggregator of dataset sources classified by 30 subjects, most of which are free, and recommended by GitHub developer community. The obligation to comply with format guidelines guarantees both quality and uniformity.
UCI Machine Learning Repository
The oldest. Today the UCI directory proposes 438 datasets which can be accessed free of charge without prior registration. UCI also allows you to filter your research by type of task, type and number of attributes, number of instances, type of data, research field or format type. The datasets are considered to be clean enough not to require reprocessing prior to use.
The public datasets
Driven as much by the need for transparency as by a determination to encourage data science projects, several governments have progressively made public a wide range of data sets.
Since 2009, the United States has provisioned Data.gov with all the datasets from all the American government agencies. In France the data.gouv.fr project was recently boosted with opendatafrance.net, which aggregates the data provided by local authorities. The British government has set up The UK Data Centre.
Practically every country in the world could be included here, but mention must be made of the Indian demographic data and the Open Government Data Platform.
Finally, if none of these standard options meet your needs... think of us. Every day, Ingedata creates datasets which train, validate and optimise machine learning algorithms.