Active Learning Strategies & DISTIL

DECILE
7 min readMay 5, 2021
DISTIL Logo

In this Article

  1. Introduction
  2. Deep dIverSified inTeractIve Learning (DISTIL)
  3. Various Active Learning Strategies
  • Uncertainty Sampling
  • Coreset
  • FASS
  • BADGE
  • GLISTER-ACTIVE
  • Adversarial Techniques
  • BALD

4. Video Explanation

5. Resources

1. INTRODUCTION

Deep learning models, which are often deemed to be the state of the art, are specially equipped to find hidden patterns from large datasets as they learn to craft features. However, training these deep learning models is very demanding both in terms of computational resources and large training data. The deeper the model, the more are parameters to be learnt. This makes models more and more data-hungry to achieve good generalization. This begs the question what is the cost of acquiring the data? Are the datasets always labelled and if not, what is the cost incurred in getting unlabeled datasets labelled?

Price for labelling 1000 data points. (Source: https://cloud.google.com/ai-platform/data-labeling/pricing). Similar rates can be found at https://aws.amazon.com/sagemaker/groundtruth/pricing/

Though depending on the underlying task, the cost of labelling varies, still, it is clear that even for the simplest of tasks labelling cost could be staggering if one wants to label the data points to train modern-day deep models. Also, it is important to note that these prices are for one annotator and even the task doesn’t require a domain expert. Often, to improve the reliability more annotators are needed.

Can something be done to reduce this staggering labelling cost when a labelled dataset is unavailable? Are all data points needed to achieve good performance?

It turns out that large datasets often have a lot of redundancies. Therefore, if carefully chosen, even with a few data points models can get good accuracy. This is where active learning comes into play. Active learning allows machine learning algorithms to achieve greater accuracy with fewer training labels. Here a machine learning algorithm chooses the data from which it wants to learn and gets it labelled by an oracle (e.g., a human annotator). Active learning is useful where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming, or expensive to obtain.

2. Deep dIverSified inTeractIve Learning (DISTIL)

Active learning can be easily incorporated with the new DISTIL Toolkit. DISTIL is a library that features many state-of-the-art active learning algorithms. Implemented in PyTorch, it gives fast and efficient implementations of these active learning algorithms. It has the most state-of-the-art active learning strategies. DISTIL employs mini-batch adaptive active learning, which is more appropriate for deep neural networks. Thus, in each round DISTIL strategies select k sized mini-batch for n rounds. Now let’s see the various strategies present in DISTIL.

3. Various Active Learning Strategies

1. Uncertainty Sampling

One way to reduce labelling cost is to identify the data points that the underlying model finds most difficult to classify and provide labels only for those. We score a data point as simple or complex based on the softmax output for that point. Suppose the model has ncl output nodes and each output node is denoted by Zj. Thus, j ∈ [1,ncl]. Then for an output node Zi from the model, the corresponding softmax would be

A. Least Confidence
Then the softmax can be used to pick k no. of elements for which the model has the lowest confidence as follows,

where U denotes the Data without labels.

B. Margin Sampling
Then Margin sampling would pick k no. of elements using softmax as follows,

where U denotes the Data without labels.

C. Entropy
Then Entropy sampling would pick k no. of elements using softmax as follows,

where U denotes the Data without labels.

Interestingly, we see that both least confidence sampling and margin sampling pick some data points that have pairwise confusion however entropy focuses on the data points which have confusion among most of the labels.

2. Coreset

This technique tries to find data points that can represent the entire dataset. For this, it tries to solve a k-Center Problem on the set of points represented by the embeddings obtained from the penultimate layer of the model. Embeddings from the penultimate layer can be thought of as the extracted features, therefore, solving the k-Center Problem in this new feature space can help us get representative points. The idea in Coreset strategy is that if those representative points are labelled, then the model will have enough information. For example, Coreset strategy would select the blue points if the union of red and blue points were given as input and the budget was 4.

4 centres chosen. Source: Core-Set Paper

3. FASS

Filtered Active Submodular Selection (FASS) combines uncertainty sampling idea with Coreset idea to most representative points. To select the most representative points it uses a submodular data subset selection framework.
Here we select a subset F of size β based on uncertainty sampling, such that β ≥ k.
Using one of the submodular functions — ‘facility location’, ‘graph cut’, ‘saturated coverage’, ‘sum redundancy’, ‘feature based’, we select subset S of size k.

Submodular functions are often used to get the most representative or diverse subsets.

4. BADGE

Batch Active learning by Diverse Gradient Embeddings (BADGE) samples groups of points that are disparate and high magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. This allows it to trades off between uncertainty and diversity without requiring any hand-tuned hyperparameters. Here at each round of selection, loss gradients are computed using the hypothesized labels.

5. GLISTER-ACTIVE

Glister-Active performs data selection jointly with parameter learning by trying to solve a bi-level optimization problem,

Inner level optimization: This is very similar to the problem encountered while training a model except that here the data points used are from a subset. Therefore this tries to maximize the log-likelihood (LLT) with the given subset.
Outer level Optimization: This is also a log-likelihood maximization problem. The objective here is to select a subset S that maximizes the log-likelihood of the validation set with given model parameters.

This bi-level optimization is often expensive or impractical to solve for general loss functions, especially when the inner optimization problem cannot be solved in closed form. Therefore, instead of solving the inner optimization problem completely, a one-step approximation is made as follows,

while solving the outer optimization.

6. Adversarial Techniques

These techniques are motivated by the fact that often the distance computation from decision boundary is difficult and intractable for margin-based methods. Adversarial techniques such as Deep-Fool, BIM(Basic Iterative Method) etc. have been tried out in active learning setting to estimate how much adversarial perturbation is required to cross the boundary. The smaller the required perturbation, the closer the point is to the boundary.

Choosing samples in adversarial Setting. Paper: Adversarial active learning for deep networks: a margin based approach

7. BALD

Bayesian Active Learning by Disagreement(BALD) assumes a Bayesian setting. Therefore the parameters are probability distributions. This allows the model to quantify its beliefs: a wide distribution for a parameter means that the model is uncertain about its true value, whereas a narrow one quantifies high certainty. BALD scores a data point x based on how well the model’s predictions y inform us about the model parameters. For this, it uses mutual information,

Since the mutual information can be re-written as:

Looking at the two terms in the equation, for the mutual information to be high, the left term has to be high and the right term low. The left term is the entropy of the model prediction, which is high when the model’s prediction is uncertain. The right term is an expectation of the entropy of the model prediction over the posterior of the model parameters and is low when the model is overall certain for each draw of model parameters from the posterior. Both can only happen when the model has many possible ways to explain the data, which means that the posterior draws are disagreeing among themselves. Therefore each round, k points are selected as follows,

The intuition behind BALD. Areas in grey contribute to the BALD score. Paper: Efficient and diverse batch acquisition for deep Bayesian active learning

4. Video Explanation

5. Resources

More about Active Learning & DISTIL:

YouTube Playlist:

Author:

Durga Subramanian

DECILE Research Group

--

--