Are those dear data scientists in the best position to clean databases?

Do you know the true cost of your data? Gartner, who analyzed the cost represented by the use of poor data, found that dirty data was responsible for the loss of $15 billion annually. Projects involving machine learning or deep learning, relying directly on data, are the most vulnerable. Garbage In, Garbage Out: if the basic data is ineffective it will only give a mediocre result. The cost of dataset creation has multiple variables but the underlying constant remains: there can be no development without reliable data. So whose task is it to clean data? Although it often falls to the data scientist, this can be shown to be counter-productive.

Data cleansing in detail

Data preparation, which takes up 80% of a data scientist’s time, is a very arduous process. Most of the work consists in cleansing existing data bases so that they can then be interpreted by algorithms. Among the key types of error are syntax, semantics and cover. Syntax errors concern lexical, format and irregularity faults. Semantic issues include integrity, contradictions and the duplication or invalidity of data. Cover concerns values, or the head-ache of missing data.

To facilitate the task, several tools dominate the market, such as Winpure, Data Ladder, TIBCO Clarity or Trifacta Wrangler.

Data science vs. data cleansing

In 2018 the average salary of a data scientist in France is estimated to be €65,000, while it is $123,000 in the USA. Their involvement carries a cost. Entrusting data scientists with data cleansing makes the cost of acquisition of datasets exorbitant.

Another issue is development time. Where would artificial intelligence be if data scientists worked five times as quickly? This is indeed the ratio at stake because lost time during data preparation is a key discussion point. While the data scientist is cleansing, he isn’t building anything. When business services based on machine learning are spreading into every sector, delays in releasing an application can be fatal for a company.

If the data scientist is to remain the “sexiest” speciality of the 21st century, according to the Harvard Business Review, we need to separate the tasks and leave them to do what they do best: drive innovation. 

Originally published on September 07, 2018 Topics: Data Science Datasets


You may also like: