Long Tailed Visual Recognition
As part of an undergrad course on PRML, I had the opportunity to present this to the class. This document might be unavailable, linked to a uni account.
- Long Tailed Visual Recognition
What are Long Tailed Distributions ?
We have a long tail when the frequency for a large number of classes is low and the data is concentrated only in a few classes. Data in the tail classes is often insufficient to represent the true distribution. When a class is severely underrepresented, it becomes difficult to determine the decision boundary in our parameter/search space.

Data quality affects performance of our learner, and due to data being dominated from few classes, the learning of tail classes is severely underdeveloped.

How to measure long tailed-ness of a distribution ?
- Imbalance factor : ratio of maximum number of samples in a class and minimum number of samples in a class
- Standard deviation : difficult to objectively express, as it is relative
- Mean / median ratio : reflects skew
- Gini coefficient : measure of inequality
0 perfect equality ; 1 one class has all samples
Differences btw i and j, n is the total number of samples and μ is the mean of the distribution
Using conventional methods for learning the distributions often result in poor performance as these assume that the data (both train and test) satisfy the I.I.D (independent, identically distributed) condition.
Naive methods to treat Long Tailed Distributions
These methods are simple, data processing based, more of these are explored in the paper “A survey on Long Tailed Visual Recognition” by the Beijing University of Posts and Telecommunications.
Aim of these methods is to eliminate / minimize the imbalance between head and tail classes.
-
Over sampling Increase the instance number of the tail classes.
Class Aware sampling ensures that the probability of occurrence of each class is the same in each training batch.
Probability of each class is 1/C, where C is the number of classes in our entire set. -
Under sampling Decreases the instance number of head classes.
Random undersampling randomly removes instances of the head classes.
For long tailed distributions, we loose a lot of information because of the large difference between the head and tail classes.
Drawbacks : a. May lead to overfitting, as we reduce overall size of our set b. Effects of noise / other defects are exaggerated c. If we remove too many instances, we risk not learning anything, due to under learning of the head classes as well.
- Data Augmentation
Generate and synthesize new samples from the tail classes.
Some common ways to augment image data are.
Image flipping, Scaling (zoom), Rotating and cropping.
SMOTE is a common method for data synthesis for the minority class.
- For each sample in the minoriotty class, identify k-nearest neighbours within the same class based on distance.
- Randomly choose a neighbour and interpolate a new datapoint using the formula, where λ is between 0 and 1 x_newsample = x + λ ( x_neighbour — x )
What are Contrastive pairs ?
Samples used to help the learner make decisions, what’s similar and what’s different.
There could be positive samples, i.e similar to the class we are talking about. The learner tries to push these closer together in the feature space.
Negative samples are ones different from the current class, and the learner tries to push these apart in the feature space.
For Supervised Learning approaches,
- Positive pairs: Any two samples from the same class label
- Negative pairs: Any two samples from different class labels
Logit Adjustment
modify the log loss function to account for frequency of the class as well.
In the usual log loss, we treat all classes as equals and hence the learner is biased towards reducing loss majorly for the dominant class. Factoring in the frequency of the classes help the learner also focus on rarer classes (present on the tail).
For class i, if the score at the output layer is z_i, the class probability is given by :
We normalize all attributes, by dividing each value with the norm of the attribute.
von Misher Fisher vMF distribution
Equivalent of a Gaussian distribution; for data on a sphere/hypersphere where all points lie on the surface of a unit sphere (length = 1).
Defined by two parameters:
- $\mu$ : mean direction
- kappa: concentration parameter; how tightly clustered is the data
Imagine spreading butter on a sphere:
$\mu$ tells you the center point of where you’re spreading
K tells you how widely you spread it
Higher K = more concentrated (like cold butter)
Lower K = more spread out (like melted butter)
Notice the bessel function Ip/2–1 term in Cp(k)
Visual Analogy
proCO; solution proposed

Issues with Supervised Contrastive Learning
-
Long Tailed distributions make it even harder to have +ve samples, as for rarer classes we might need very large batches to get adequate number of samples.
- If batch is too small, we might
- not have examples with the same labels and
- miss out on important -ve examples
- Having a large batch causes
- longer training time
- higher memory usage
- higher computational cost
Parameter Estimation
Here fp is the probabiliy distribution function of a vMF, and $\pi$y is the probability of class y.
We estimate the mean and kappa params of the feature distribution.
Loss Function
To compute higher order bessel functions, make use of recurrence relation; which is numerically unstable with lower values of kappa, the concentration parameter
\[I_{v+1}(k) = {2~v\over k} I_v(k) - I_{v-1}(k)\]Two Branch Design
Imagine a tree, with the root feeding into two branches
-
Classification Branch Uses Linear classifier, optimized with Adjusted Logit Learns to classify
-
Representation Branch Uses neural network, optimized with ProCO α is the strength parameter Learns what the good features are The two branches handle rare cases without sacrificing performance for examples from common classes
Final Loss function $ L = L_{\text{adjusted logit}} + \alpha L_{\text{ProCO}} $
Some relevant details
- I got to know about Long Tailed Distributions from an introductory video fo the Vision and AI Lab at IISc
- The ProCo paper is from Tsinghua University, which is the Chinese equivalent of IISc
- While reading through a literature review paper, I identified a small error in the paper and even mailed the authors regarding the same.

References
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition by Tsinghua University