Dirty data delays innovation. Before it can be used in training algorithms for machine learning applications, data must be collected, formatted, cleaned, combined and annotated.

Most companies turn to their own AI team for this task, trusting they’d do a better job of turning coffee into organised training data than outsiders. The problem with bringing this extensive process in-house is that it’s not actually as low-cost or even as effective as it seems.

Sourcing and labelling data is a tedious and time-sapping process that requires a lot of human and financial resources. It’s also profoundly complex and involves increasingly large volumes of data, which invites human error and sabotages dataset quality.

So before handing disorderly data over to your AI team in an effort to save costs and ensure quality, you should first understand why in-house data labelling can actually stymie your ML projects.

It’s expensive

Data scientists and AI engineers aren’t cheap. They’re among the top ten most in-demand skills and demand salaries well over $120,000 a year.

‘You can’t expect people who have such high salaries to do this labor-intensive work’, says Zhou Junkai, a data labeller in China’s booming AI industry—a hub for companies looking to affordably outsource their repetitive data labelling.

There are various solutions to this expensive problem, such as outsourcing and crowdfunding, that are much more wallet-friendly and allow your team to focus on the higher-skilled tasks they’re trained and paid to do. (We’ll talk more about these solutions later.)

It slows down progress

According to the IBM Cloud Blog, most companies continue to entertain the so-called ‘80/20’ rule, which states that ‘80% of a data scientist’s valuable time is spent simply finding, cleaning, and organising data, leaving only 20% to actually perform analysis’.

Diverting your team’s brain power to repetitive and time-consuming tasks is a bad use of their precious time. If they could dedicate their work hours to analysing data and refining the algorithm for a higher-performing model— you’d get more value a lot sooner.

Data scientists are essential for pushing your business forward in the AI race. You don’t want your business falling behind because your data science team is too busy fiddling with low-value tasks.

It negatively affects dataset quality

You know the motto: a model is only as good as the data used to build it.

Companies often assume that they’ll have more control over quality if their expert in-house team is in charge of organising the training data. But this isn’t always true.

Considering there’s a little something called deadlines, data scientists flitting between multiple projects may compromise the quality of the data to get the job done quickly and move onto the next one. Cutting corners in data preparation or annotation can result in an unusable model when it’s put into production, which is a multi-million dollar problem for many businesses.

Additionally, data scientists aren’t equipped for the intricately complex process required for consistent data labelling. This skill gap often invokes human error which ultimately affects the performance and accuracy of predictive models.

It’s not scalable

Enhancing a model means feeding it more qualitative data. If you have a small AI team dissecting and sorting training data for each and every model, your projects will only get so far.

Given that businesses thrive on growth and innovation, unless you shift the manual and repetitive work elsewhere, your team will always be bogged down with endless labelling and your business will never find the time or resources to expand its data science efforts.

It distracts the team from innovation

Remember Google Duplex—the AI assistant that makes phone calls to businesses with uncanny human-like quality? Imagine if Google had forced its brilliant Duplex AI team to focus on data labelling rather than furthering the AI that ultimately passed the Turing test.

You don’t have to be a tech giant to realise data preparation and annotation is a distraction for data scientists who are eager to innovate. The reason you hire data scientists in the first place is to develop algorithms and build machine learning models that further business goals—so let them do it.

It drives data scientists to quit

‘Data munging’. That’s the term data scientist Mike Driscoll used to describe the ‘painful process of cleaning, parsing and proofing one’s data’ in his iconic post on the three sexy skills of data geeks.

The reality is that data scientists despise preparing and organising data. In fact, 57% of data scientists consider it the least enjoyable part of their work.

In more concerning news, in 2017 the Financial Times reported that most data scientists spend 1-2 hours a week looking for a new job—in part due to the underwhelming tasks given to them by clueless employers.

Bottom line is: if you want to keep your data scientists onboard, don’t force them to hunch over messy data for hours on end. That’s not what they signed up for.


When it comes to data labelling, the saying ‘if you want it done right, do it yourself’ isn’t always true.

While the idea of incurring lower costs and having more control over the process is tempting, it’s also misguided. Alternatively, businesses can take other approaches for their data labelling needs. The most common approaches are crowdsourcing and outsourcing.

In short, crowdsourcing involves a freelancing platform where multiple contractors are assigned to data labelling projects for small rewards. Crowdsourcing is faster, cheaper and easily scalable compared to in-house data labelling. But the low price comes at the cost of reduced data set quality, consistency and confidentiality.

On the other hand, outsourcing means hiring an external team that specialises in training data preparation. This option is more costly than crowdsourcing but ensures higher quality data sets. Think of them as an extension of your AI team—but requiring much less supervision. The main advantage of outsourcing is your team can be relieved from tedious tasks and finally focus on their core expertise, which is what you hired them for.

To learn more about how you can break the 80/20 rule and leverage your AI team’s skill set to its fullest potential, read our post on Crowdsourcing Platforms vs Outsourcing Companies.

Originally published on March 19, 2019 Topics: Machine Learning Data Science Datasets Computer vision


You may also like: