You know your algorithms will only be as good as your training data, but just the thought of manually labelling data sets for weeks on end is already exhausting.Data scientists and AI engineers feel the same way. Labelling data is a tremendously tedious task that only distracts them from their core expertise: analysing data, refining algorithms and improving machine learning models.
So how can you create can reliable annotated data without encumbering your in-house data science team? Among the many solutions for this common conundrum exist crowdsourcing platforms and outsourcing companies. Both can be very effective solutions for your data science needs, but only if you choose the right one—which is precisely the problem.
To help you adopt the best approach for your business, we’ll briefly cover their benefits and pitfalls then wrap it up with a no-nonsense recommendation on when it makes sense to adopt crowdsourcing or outsourcing.
In a nutshell, crowdsourcing platforms assign freelancers from around the world to annotate your data. Most platforms break down a large project into microtasks which are then distributed among dozens of freelancers (or microworkers) to complete in tandem. Naturally, the more people work on a task, the faster it’ll be completed.
To gain access to these magical human workers willing to perform the necessary evil that is data labelling, there are various platforms to choose from, like Amazon Mechanical Turk. The way these platforms usually work is you place an order, define the specifications, and the platform’s team will post the work for a select number of microworkers to pick up. Voilà.
Now let’s get into the pros and cons of crowdsourcing your annotated data.
If budget is an issue, then crowdsourcing solutions offer a wide range of prices. Like everything else in the gig economy, crowdsourcing offers a free marketplace where freelancers can snatch up short-term engagements in exchange for (usually low) wages.
Using a low-paid virtual crowd also means reduced fees for companies seeking data labelling services. Naturally, this apparent affordability makes crowdsourcing an attractive solution for most short term projects. But even though you pay per individual task, low costs add up quickly. Not to mention there are additional fees if you want to ensure a certain level of quality.
There is a fair amount of project management involved when turning over your data science needs to a virtual crowd. Communication is essential but can also get difficult when you’re managing multiple people. Most importantly, you have to be very specific in your instructions and outline exactly what constitutes right or wrong submissions (illustrations included).
Furthermore, if quality is top priority (as it should be), you want to set up thorough screening strategies to vet your workers before they get started. All of this consumes even more of your time (and money).
However, if you have a small project and time to spare, this extra attention on your part will be a small price to pay in exchange for quick and easy results.
Confidentiality is a major concern in crowdsourcing. On some platforms, microworkers are typically unvetted people sitting at home with no special security measures in place to protect your data. Not to mention they have access to continuous streams of private information.
Rochelle LaPlante, an MTurk worker showcased the risk of crowdsourcing in a concerning tweet where she revealed how she could see people’s Uber receipts with their full name, pick up, and even drop off addresses.
Since then, MTurk has enabled better security measures to protect its clients’ data from leakage, and most crowdsourcing platforms have their own screening process to weed out scammers.
Workers are also bound by confidentiality agreements, but there’s really no telling what the wrong person might do with the data they receive. After all, as Jeffrey Bigham, a crowdsource researcher at Carnegie Mellon University, told Wired, “Every product that uses AI also uses people”.
As with most services, the quality depends on how much you’re willing to pay. Most microworkers just want to get as many tasks done in a day to get their pay. It’s a side job and labelling data from a smartphone on your way to work doesn’t really sing accuracy. They’re not particularly invested in quality assurance and won’t lose sleep over mislabelling your data.
If you’re paying for the most basic data labelling service, you can expect quality issues and irregular response rates from your workers. So it’s worth paying a bit more to get the quality you’re aiming for.
While most crowdsourcing platforms have their own quality management measures to guarantee effective services, they can only do so much. It’s mostly up to you to screen your workers beforehand and ensure everything is up to par before releasing any payments.
The good thing about working with a virtual crowd is you can have five or 500 workers ploughing away at your project in tandem. This makes your data science efforts easily scalable without troubling yourself with hiring and training new employees.
The scalability factor is what makes crowdsourcing a popular approach. In just a few clicks, you can immediately access a pool of microworkers ready to jump into your project. This makes crowdsourcing the best option if, for example, you need basic image annotation for thousands of images in a short amount of time.
Crowdsourcing has a bad reputation for its low wages and unstable income. One study found that workers performing a combined 3.8 million tasks on MTurk only earned a median hourly wage of $2 an hr.
With such underwhelming wages, freelancers are often forced to take on a significant amount of work to make ends meet. In turn, the constant chase for work leaves them no time to develop higher-value skills, making them dependant on crowdsourcing platforms for income. (That is until their skills are phased out there too.)
Time to dig into the second approach. Outsourcing basically consists of contracting an external team to work on your AI projects. But instead of summoning freelancers with dubious qualifications, these companies screen and hire highly skilled workers to handle your data preparation and annotation.
The first step to outsourcing data labelling is defining the work and outlining the project’s specifications. Your request is then assumed by a trained team that continually works with you to ensure accurate and consistent data sets. They also manage their own quality assurance and take a rigorous approach to data privacy and security.
Each company offers a variety of services and some focus on different areas of artificial intelligence—so be sure to do your research when deciding on the right company for your business.
Now that you understand the gist of outsourcing companies, let’s move onto the pros and cons of this approach and touch on how it compares to crowdsourcing.
The cost of outsourcing data annotation depends on the company you choose to partner with. Most offer packages or plans where you define the number of working hours, the services you need, and the size of your dataset in order to calculate the team’s hourly or monthly fee.
Outsourcing charges may seem high, but it’s a flat fee that typically covers quality verification, employee screening, project management and what you’d usually pay extra for on a crowdsourcing platform. Not to mention it’s still significantly cheaper than in-house data labelling and undoubtedly more effective.
An outsourced team essentially works as an extension of your company, just without the need for micro-management. That means no outlining step-by-step instructions or breathing down their necks to make sure they’re following the rules.
When you hire a professional team to curate your datasets, you can expect them to possess the necessary expertise, follow established methodologies, perform their own quality assurance tests and refine their processes to increase efficiency.
By paying by the hour and not by the task, you’ll be much more likely to have him by your side when it comes to looking for those productivity increases.
As an added benefit, working with the same team for the duration of your project means they’ll become more familiar with your work—requiring less of your attention and improving overall productivity.
Outsourcing companies often position themselves as a safer alternative to crowdsourcing. To deliver on this promise, they meticulously screen their employees and take strict measures to guarantee a suitable level of security.
Moreover, outsourced employees work from company offices using official equipment and following established security practices. You can rest assured that your data will be treated carefully and restricted to only two cloud platforms: yours and the outsourcer’s.
To verify if your partner can be trusted with your sensitive data, check their security policies and have a discussion with their IT teams.
Thanks to expert teams armed with substantial experience solving data science challenges, the quality of results delivered by outsourcing companies are in the end considerably higher than any crowdsourcing platform.
Besides, as you continuously work with the same team and project managers you can come to expect a certain level of consistency in the quality of each dataset. This consistency allows you to maintain specific quality check procedures instead of setting up new ones with every iteration.
You do, of course, have to select your partner with caution to secure such benefits. Take the time to research their expertise, processes and business model to avoid quality issues down the line.
Most outsourcing companies provide cloud-based scaling to match your data science ambitions. However, when it comes to human-led annotation, you can’t multiply your team into thousands of workers as you could with a virtual crowd.
For this reason, outsourcing is more appropriate for projects that don’t demand copious amounts of annotation before your coffee gets cold.
Although in defence of outsourcing companies, the rush to complete tasks on crowdsourcing platforms is what commonly leads to mistakes in data labelling. As a post on ZeroFOX explains: ‘Good requests take effort to create’.
Unlike the gig economy, outsourcing companies employ long-term workers and continually invest in them to keep their skills up-to-date. After all, it’s in the company’s best interest to grow their employees’ competencies to effectively tackle future projects.
As for wages, they’re actually liveable. Outsourcing companies compensate their employees fairly in exchange for proficient work. (Plus they want to avoid the high costs of staff turnover.) Although each company has it's own policies, so make sure to check them before supporting a partner with your business.
As with most business conundrums, the answer to which approach is most suitable for your business needs is, ‘it depends’. But the fact is that sourcing clean, labelled data depends almost entirely on what your project requires and the resources you’re willing to invest.
For example, if you require a large number of basic tasks completed in the least amount of time, then crowdsourcing is your best bet. Although to make it work you need to maintain strong communication with your workers.
On the other hand, if your project requires specialised processes or experts on board and your budget is flexible, it’s worth outsourcing your dataset to a qualified team. This is also recommended if your data is sensitive in nature and you’re too busy to babysit each project.
While both approaches certainly offer powerful solutions to complex, data-driven challenges, there are clear benefits of partnering with an outsourcing company rather than with a crowdsourcing platform.
For one, you can let a dedicated team fuss over your messy data while your own AI team focuses on growth and innovation. You can also be confident that you’re sourcing accurately annotated data. In essence, outsourcing companies provide you with the quality and consistency required to build high-performing models that actually work in production.
To get started with a smart outsourcing agency, contact us at Ingedata to discuss the best human solution to your machine learning challenges.