Do you remember the web before Google? People called it unusable. Too much information in too many different places. Too much noise. Too many URLs to remember. Then the search giant showed up and did a stellar job at indexing the world’s information.
Well, since September 5th 2018, Google is now doing the same thing with scientific datasets. Through Dataset Search, everyone will be able to openly access datasets in a variety of fields, everywhere there’s Internet coverage. It’s kind of a big deal.
Laying the Library’s Foundation
At the time of writing, Dataset Search is still in Beta. It only focuses on dataset across topics such as environmental and social sciences, as well as data from other disciplines including government data and data provided by news organizations like ProPublica.
Still, this will be a true goldmine of information for data scientists, journalists, and anyone interested in powering deep learning or machine learning engines. And it’s also extremely likely the service’s popularity could trigger an avalanche of submissions from scientists and institutions whose goal it is to publicly share their findings.
Aggregating Done the Google Way
Datasets have always been notoriously hard to find. The information is at times locked behind expensive paywalls, other times hiding in plain sight in multiple repositories. As The Verge reports, a scientist couldn’t find data about ocean temperatures, not because it was private but because the search function in the repo performed so poorly.
But if there’s one thing Google knows how to do, it’s create a robust search engine. One that is user friendly, powerful, and good at understanding what people really look for. Autocomplete? Google has it. Search syntax? Yes, sir. Semantic search? Google Knowledge Graph does it best. In short, things would have to go very wrong for Google not to become the de facto solution to retrieve datasets.
A Step Towards Standardisation
Now indexing is only possible with the right standardisation and data labeling. Will Google be able to nudge researchers towards a uniform way of writing metadata? It’s quite likely. If the service reaches critical mass, there is no reason why the search engine’s preferred formatting couldn’t become the norm. And they are already letting datasets resurface in their standard search engine when they are supporting the schema.org dataset markup.
So yes, you can already submit your own dataset using the right formats, if you are providing tables, CSVs, and for the deep learning geeks, things like trained parameters or neural structure definitions.
Giving Open Data A New Home
Finally, the release of Google Dataset Search comes at an opportune time. The web’s democratisation of information has already inspired a plethora of open data initiatives, and it was only a matter of time before someone gathered them all in one spot.
Let's not forget Google has already done wonders with Google Scholar, its search tool dedicated to academic papers.