Cut Down on Labeling Costs with DISTIL

DECILE
5 min readMay 3, 2021
Distil Logo

In this Article

  1. Introduction
  2. Reducing Labeling Costs
  3. DISTIL
  4. Robustness against Redundancy
  5. Video Explanation
  6. Conclusion
  7. Resources

1. Introduction

Much of deep learning owes its success to the staggering amount of data used in model training. While throwing data at these deep models has shown to improve their accuracies time and time again, it comes at the great expense of data labeling. Indeed, mid-size datasets of tens of thousands of points may cost anywhere from a couple thousand USD to a couple hundred thousand USD.

For example, if your labeling task does not require specialist knowledge, Google’s AI Platform Data Labeling Service can be used to procure labeled data. In that instance, labeling 50,000 units (see Google’s per-1000-unit pricing chart) could cost up to $43,500 for one annotation per data point. Typically, multiple people annotate the data to ensure the quality of the labels, so this large cost is made even worse by a factor of the number of annotations required! This example even precludes the possibility that your data needs specialist knowledge to label. For example, a medical dataset of images often requires specialist knowledge for most labeling tasks. If you end up needing to label a very large dataset with difficult labels, well… I hope you have some spare pallets of cash lying around.

2. Reducing Labeling Costs

If you are like most people, you do not have a couple hundred thousand USD to shell out on labeling. A natural question to ask is how you can alleviate your labeling costs. A promising area of machine learning is active learning, which serves to answer the following question: Based on my model performance so far, what data should I have labeled so that my model’s performance is maximized once I train on the new collection of labeled data? The answer to this question effectively allows you to cut to the chase — instead of labeling all your data, you can instead label only the most important data to achieve good model performance. In essence, active learning aims to distil the large amount of unlabeled data at your disposal so that you can get the best labeling efficiency allowable under current methods.

3. Deep dIverSified inTeractIve Learning (DISTIL)

Luckily, an open-source Python library exists to make active learning easy and accessible! DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It allows users to modularly insert active learning selection into their pre-existing training loops with minimal change. Most importantly, it features promising results in achieving high model performance with less amount of labeled data. By comparing the performance of these active learning algorithms against the strategy of randomly selecting points to label, the labeling efficiency of these active learning algorithms becomes clear. Here are some of the results obtained on common datasets using some of the active learning algorithms in DISTIL:

The best strategies show 2x labeling efficiency compared to random sampling. BADGE does better than entropy sampling with a larger budget, and all strategies do better than random sampling.
All strategies exhibit a gain over random sampling, but the per-batch version of BADGE performs similarly to random sampling. (Regular BADGE does not scale to CIFAR-100!)
All strategies exhibit a gain over random sampling, and both entropy sampling and BADGE achieve a 4x labeling efficiency compared to random sampling.
All strategies exhibit a gain over random sampling, and both entropy sampling and BADGE achieve a 4x labeling efficiency compared to random sampling.
All strategies exhibit a gain over random sampling, and both entropy sampling and BADGE achieve a 3x labeling efficiency compared to random sampling.

4. Robustness against Redundancy

A valid criticism of the above results might be that the above datasets are not representative of real-world datasets. Indeed, many datasets used in industry feature an astronomical amount of data. In fact, it is often the case where much of the data is redundant. A natural question to ask, then, is whether active learning is robust against redundancy. An answer to this question would give some evidence to the effectiveness of active learning on real-world datasets.

Very large datasets have many redundant data instances.

Luckily, DISTIL offers a wide repertoire of active learning algorithms, and some of them are robust against redundancy. In particular, we can examine how entropy sampling and BADGE perform on redundant data versus random sampling. The following shows some results on a modified CIFAR-10 dataset, where only a few unique points are drawn and increasingly duplicated:

BADGE and entropy sampling perform better than random sampling. The labeling efficiency is not pronounced with few unique points to select.
With more redundancy, BADGE begins to perform better than before. Entropy sampling begins to perform worse than random.
With even more redundancy, BADGE continues to do better than random sampling, while entropy sampling continues to do worse than random sampling.
Takeaway: Compared to random sampling, entropy sampling handles redundant data poorly while BADGE handles redundant data proficiently.

5. Video Explanation

6. Conclusion

As you can see, the active learning algorithms in DISTIL show promise in greatly reducing the number of labeled data points required for your model, and DISTIL offers a wide enough range of active learning algorithms to handle your problem instance. Hence, DISTIL can save you the cost of labeling significant portions of your data, which allows you to deploy your final models quicker, which also saves you development costs! Better yet, DISTIL is actively expanding its repertoire of active learning algorithms to ensure state-of-the-art performance. As such, if you are looking to cut down on labeling costs, DISTIL should be your go-to for getting the most out of your data.

7. Resources

More about Active Learning & DISTIL:

YouTube Playlist:

Author:

Nathan Beck

DECILE Research Group

--

--