Why and how does data preparation reduce productivity and motivation?
We expect our Machine Learning (ML) and artificial intelligence algorithms to be able to understand information, act and interact with our environment in the most natural and human way possible. Google Assistant, which was launched at I/O18, is a striking example of this.
But the performance of the models used depends not only on the calculation power allocated and the algorithms chosen but also on the quality of the data fed into them.
Machine Learning: why relevant data sets are essential to performance
Your ML algorithms need to be trained with good data, which means data optimized according to the issue you are dealing with. This means the data needs to be sourced, formatted, cleaned, sampled and aggregated, etc.: data preparation is the essential first step in creating high-quality data sets.
You also need to carry out data annotation, which involves tagging each piece of data with its corresponding attribute in order to create data sets that have been classified sufficiently enough to be useful to your algorithms.
Of course, the problem is not creating or obtaining data sets. Not only are there public data sets available (such as those offered by AWS or Kaggle) but there are also outsourcing solutions, and even crowdsourcing solutions (like Amazon Mechanical Turk) where you can get the data sets you need. The problem is having reliable data sets suitable for building a high-performance model.
This often creates the temptation to internalize the preparation and annotation of data, to ensure complete control over the quality, confidentiality and relevance of the data you use. After all, very specific professional skills and extremely astute knowledge of your business sector are often required for this process.
Should data preparation really be the job of data scientists?
The risk to quality posed by inhouse annotation
Internalizing your data protection is very often a far less qualitative process than you might imagine. From data collection to cleaning, sampling and annotation, you are probably aware that data preparation requires a huge amount of human and financial resources and requires the implementation of particularly complex flows to mechanize often repetitive tasks.
By asking your data scientists to dissect, sort and classify increasingly large volumes of data, you increase the risk of error.
And errors can cost your organization heavily. In a report published in 2016, Gartner estimated that bad or poor quality data could cost an organization up to 13.5 million dollars a year, compromising the relevance and performance of an algorithm.
Annotation: a laborious task
According to CrowdFlower (now Figure Eight) in 2016, data preparation represented between 50% and 80% of the work of data scientists, 60% of which was dedicated entirely to cleaning and organizing raw data. According to the same study, 76% of data scientists also considered data preparation the most laborious part of their work.
An often inefficient allocation of your resources
The time your data scientists spend preparing data is obviously time that could have been spent on tasks and projects with a higher added value, not only for themselves, but also for the entire organization. Internalizing the data preparation and annotation process slows your growth by diverting brain power away from its core expertise. Which is to say, improving the performance and precision of the model, refining your algorithm, and assessing the relevance and quality of your results to identify opportunities for improving your data sets. Given that the average annual salary of a data scientist was this year 123,000 dollars in the United States and 65,000 euros in France, it makes sense to consider externalizing dataset production.
Externalization for better control: what are your options?
Offshoring or crowdsourcing
As mentioned above, there are various options for externalizing your data preparation: l’offshoring, which offers a high capacity for processing at a low cost, but with a weak capacity for evolutivity and a large coordination burden; crowdsourcing, which offers high scalability and quick delivery, but often at the cost of lower quality data sets and no guarantee of confidentiality.
The 3rd way : Smart outsourcing
The third alternative is smart outsourcing, which means turning to a workforce that specializes in the field of the data being processed, and which has been trained on the specific features of your data sets. And which, as it is focused completely on your project, reveals no hidden costs when the bill arrives. This allows you to concentrate your resources and the resources of your data scientists on the heart of your business, without having to worry about the data training your ML engines or the costs incurred in its preparation.