Here at Tasq.ai, we create labeled datasets for a variety of business use-cases. We use an army of Tasqers to get this done. But how can we be sure of the quality of our labels? And what do we do if there is a disagreement between labelers?
There are several methods we employ. A crucial one is that we use adaptive sampling.
Adaptive sampling is a statistical technique that dramatically improves your results’ speed, accuracy, and quality when obtaining answers from a population.
This article will explain what adaptive sampling is and why it is beneficial to choose a data labeling service that uses it.
Adaptive Sampling – An Example
Let’s say you work in a hospital with 10 doctors that see 100 patients in total each day. How do you correctly diagnose each patient as fast as possible while ensuring maximum quality for each diagnosis? You must choose a sample of doctors to see each patient.
Sample size = 1
Sending one doctor to each patient is the fastest method, but it compromises quality. If a doctor sees an illness they are uncertain about, they cannot get a second (or third) opinion, and the quality of that diagnosis will suffer.
Sample size = 10
Alternatively, you could send every doctor to every patient. This method ensures maximum quality for each diagnosis, but it is slow and wasteful. You don’t need 10 doctors to tell you that you have a broken arm that needs to be put in a cast.
Sample size = 3 (fixed sampling)
Send a small, fixed number of doctors to each patient, e.g., 3. Choose a number small enough to be more efficient than sending every doctor and let majority voting decide on the result. This method is ok but is still wasteful. Even if a diagnosis is simple and the first doctor is 100% certain in their diagnosis, they will still go and get two more points of view. Thus, you still have a sub-optimal allocation of resources.
Sample size = varied (adaptive sampling)
A final option is to vary the number of doctors you send to each patient based on the difficulty of the diagnosis. This method is called adaptive sampling. If a diagnosis is simple and the doctor is confident in the result, they can treat the patient themselves and don’t need to get another opinion. However, if the diagnosis is difficult, they can get more points of view from other doctors. This method leads to a much more efficient allocation of resources than fixed sampling in terms of time and quality. If the hospital gets many easy diagnoses in a row, only a few doctors need to see each patient. If the hospital receives many complex cases in a row, they can send more doctors to see the patients and ensure top-quality diagnoses.
The goal is to give top-quality diagnoses as fast as possible. Adaptive sampling ensures that simple illnesses are diagnosed rapidly and focuses more resources on the more difficult diseases to provide correct diagnoses.
Adaptive Sampling and Data Labeling
Now that you have an intuitive understanding of adaptive sampling and its utility let’s explain how this relates to data labeling.
Note; if you swap out ‘doctors’ for ‘data labelers’ and ‘patients’ for ‘images’ in the above example, you should be able to see the similarities.
Let’s say you want to label a collection of satellite images and classify the buildings within them as residential or business. For each photo, traditional data-labeling companies would take 10 votes from 10 samplers and use majority voting to decide the result. If 6 labelers think one section is residential and 4 think it’s business, it’s labeled as residential.
We believe this is a flawed approach. If a model had 60% accuracy, it would be marked for further training. Likewise, if 60% of your labelers think a section of an image is residential, it is essential to obtain additional votes and increase the confidence in the label.
As the saying goes, garbage in, garbage out. Your datasets are the foundation of your machine learning models. Your models will never perform well if you have poor quality data, no matter how much you tune them.
Let’s see how we can improve this result using adaptive sampling.
If you want your model to be 99% accurate when classifying these images, you’d like your labels also to be at least 99% accurate.
With adaptive sampling, we keep sampling from the population until we are sure of the quality of the label. In the images example, it means we’d need to test more
In other situations, adaptive sampling means we’d need to collect fewer labels from fewer labelers. For example, if the first N labelers a section of an image is residential, we can stop sampling once we reach the 99% confidence interval. If there were doubt the picture was residential, at least one labeler would have marked it as so.
Overall, in our platform, we determine the quality confidence using advanced statistics, combining a variation of the Dirichlet distribution and workers’ labeling abilities. This results in a reduction in the number of labels for more straightforward images and an increase for more challenging ones. Moreover, since we know the confidence interval, we can label incredibly complex images just as easily as simple ones. No image is too difficult for our labelers.
It is a waste of money to have too many people label a simple image. Thanks to adaptive sampling, we maximize the capabilities of our labelers. Moreover, it ensures a more efficient allocation of your resources and a rapid delivery of high-quality datasets to you and your team.
Adaptive sampling is a faster and more efficient way to sample a population than traditional, fixed sampling. Accurately labeled data is the foundation for your machine learning models. A component this fundamental should be created with the highest quality processes.
In this article, we’ve explained why adaptive sampling is essential and why you would greatly benefit from using a data labeling service that uses it.
Book a demo with Tasq.ai today to get high-quality data at your fingertips rapidly.