Loading

ADDI Alzheimers Detection Challenge

Dealing with Class Imbalance

Looking at different ways to address class imbalance in this dataset

Johnowhitaker

In this notebook I take a close look at some different ways we can address the difference in class balance between train and validation (and presumably test)

We look at changing sample weights, over and under-sampling, SMOTE and some other tips and tricks.

I hope you find it helpful :)

Introduction

So you've made your first model for this challenge and it's getting a log loss of ~0.9 - a ways behind the leaders on 0.6X. You're starting to think about feature engineering, adding more models to your ensemble, maybe trying one of those tabular deep learning models the cool kids are talking about. STOP! Before any of that, there is one BIG issue we need to deal with: class (im)balance.

Class%20balance.png

The validation set (and presumably the test set) has a different class distribution to the training data. In this notebook we will look at many different ways we can correct for this class imbalance - picking one of these will boost your score tremendously (we're taking ~0.66 with a single simple random forest model). So, let's dive in.

Setup

Importing the libraries we'll be using, loading the data and getting ready to run our experiments.

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import f1_score, log_loss
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
In [2]:
#The training data
df = pd.read_csv('ds_shared_drive/train.csv')
print(df.shape)
df.head(2)
(32777, 122)
Out[2]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 ... bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis
0 S0CIXBKIUEOUBNURP 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.52617 0.524975 0.474667 0 0 0 1 -105.0 0.0 normal
1 IW1Z4Z3H720OPW8LL 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.00081 0.516212 0.483330 0 1 0 1 NaN NaN normal

2 rows × 122 columns

In [3]:
# The validation data (we merge in the labels for convenience)
val = pd.read_csv('ds_shared_drive/validation.csv')
val = pd.merge(val, pd.read_csv('ds_shared_drive/validation_ground_truth.csv'), 
               how='left', on='row_id')
print(val.shape)
val.head()
(362, 122)
Out[3]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 ... bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis
0 LA9JQ1JZMJ9D2MBZV 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.499368 0.553194 0.446447 0 0 0 1 NaN NaN post_alzheimer
1 PSSRCWAPTAG72A1NT 6.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.427196 0.496352 0.503273 0 1 0 1 NaN NaN normal
2 GCTODIZJB42VCBZRZ 11.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.505583 0.503047 0.496615 1 0 0 0 0.0 0.0 normal
3 7YMVQGV1CDB1WZFNE 3.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 ... 0.444633 0.580023 0.419575 0 1 0 1 NaN NaN post_alzheimer
4 PHEQC6DV3LTFJYIJU 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.395976 0.494990 0.504604 0 0 0 1 150.0 0.0 normal

5 rows × 122 columns

In [4]:
# We'll keep track of how different approaches perform
results = []

Baseline #1 - Training on all data

This is a case where we don't do any correction for the class imbalance. Some models will do better than others - tree-based models like CatBoost will be less sensitive than some other model types, but they will still over-estimate the probability that a given sample will fall into the majority class when making predicitons on the validation set (since the 'normal' class is so much more common in the training data).

In [5]:
# Prep the data
X = df.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']

# Train the model
model = CatBoostClassifier(verbose=False, cat_features=['intersection_pos_rel_centre'])

# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)

# Store results
r = {'Approach':'No modifications', 
     'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
     'F1':f1_score(y_val, model.predict(X_val), average='macro') 
    }
results.append(r)

print(r) # Show results
{'Approach': 'No modifications', 'Log Loss': 0.6745549053231596, 'F1': 0.2848101265822785}

A log loss of 0.67 on the validation set isn't terrible. We are using the validation set for early stopping - without that in place we get a log loss of 0.78 on our validation set and 0.8X on the leaderboard. So in a way, by using the validation set for early stopping we are already starting to combat our class balance problem... but we can do much better!

Adjusting Sample Weights

Models like CatBoost allow us to assign more weight to specific samples. In this case, we use this to place less weight on samples in the over-represented classes, combating the bias introduced by the imbalance:

In [6]:
# Prep the data
X = df.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']

#Our class weights
weights = {
    'normal':9/74, # Chosen based on some quick mental maths comparing the distribution of train vs val
    'post_alzheimer':0.85,
    'pre_alzheimer':1
}

# Applying these weights as sample weights by using Pool to wrap our training data
train_data = Pool(
    data = X,
    label = y,
    weight = y.map(weights), # << The important bit
    cat_features = [list(X.columns).index('intersection_pos_rel_centre')]
)

eval_data = Pool(
    data = X_val,
    label = y_val,
    weight = y_val.map(lambda x: 1.0), # all validation samples get a weight of 1
    cat_features = [list(X.columns).index('intersection_pos_rel_centre')]
)

# Train the model
model = CatBoostClassifier(verbose=False)

# Evaluate on val set
model.fit(train_data, eval_set = eval_data, early_stopping_rounds = 30)

# Store results
r = {'Approach':'Modifying Sample Weights', 
     'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
     'F1':f1_score(y_val, model.predict(X_val), average='macro') 
    }
results.append(r)

print(r) # Show results
{'Approach': 'Modifying Sample Weights', 'Log Loss': 0.5593949556085255, 'F1': 0.4727603953181218}

As you can see, we're now getting a log loss of 0.56 on our validation set - a significant improvement! This approach is especially appealing since we don't throw out any data. However, not all models support sample weights. Next, let us investigate techniques for over- and under-sampling.

SMOTE

Look for resources on training with imbalanced data and odds are high you will encounter Synthetic Minority Oversampling Technique (SMOTE). This is a technique for synthesizing additional samples for the under-represented classes. You can check out this link for an more info.

Applying it is fairly simple. In this case, we run an experiment using SMOTE (from the imblearn library) to generate larger and larger numbers of synthetic examples, controlled by the factor parameter.

In [7]:
for factor in [2, 4, 6, 10, 15, 20, 30, 40]:
    # Prep the data
    X = df.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
    y = df['diagnosis']
    X_val = val.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
    y_val = val['diagnosis']

    # SMOTE on the training data
    oversample = SMOTE(sampling_strategy={
        'normal':31208,
        'post_alzheimer':1149*factor,
        'pre_alzheimer':420*factor
    })
    X, y = oversample.fit_resample(X, y)

    # Train the model
    model = CatBoostClassifier(verbose=False)

    # Evaluate on val set
    model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)

    # Store results
    r = {'Approach':f'SMOTE (factor of {factor})', 
         'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
         'F1':f1_score(y_val, model.predict(X_val), average='macro') 
        }
    results.append(r)

    print(r) # Show results
{'Approach': 'SMOTE (factor of 2)', 'Log Loss': 0.6521451124388101, 'F1': 0.29584350028090123}
{'Approach': 'SMOTE (factor of 4)', 'Log Loss': 0.6371260967092448, 'F1': 0.3169377522318699}
{'Approach': 'SMOTE (factor of 6)', 'Log Loss': 0.6272541505942832, 'F1': 0.3445360195360196}
{'Approach': 'SMOTE (factor of 10)', 'Log Loss': 0.6178523829621976, 'F1': 0.346937773193851}
{'Approach': 'SMOTE (factor of 15)', 'Log Loss': 0.6120448842672219, 'F1': 0.3520166536559979}
{'Approach': 'SMOTE (factor of 20)', 'Log Loss': 0.6000621132749585, 'F1': 0.3755389762582076}
/home/jonathan/anaconda3/lib/python3.7/site-packages/imblearn/utils/_validation.py:326: UserWarning: After over-sampling, the number of samples (34470) in class post_alzheimer will be larger than the number of samples in the majority class (class #normal -> 31208)
  n_samples_majority,
{'Approach': 'SMOTE (factor of 30)', 'Log Loss': 0.6031935307693084, 'F1': 0.40465015210777927}
/home/jonathan/anaconda3/lib/python3.7/site-packages/imblearn/utils/_validation.py:326: UserWarning: After over-sampling, the number of samples (45960) in class post_alzheimer will be larger than the number of samples in the majority class (class #normal -> 31208)
  n_samples_majority,
{'Approach': 'SMOTE (factor of 40)', 'Log Loss': 0.606239826680783, 'F1': 0.41754719279015956}

At some point we start getting warnings that we've generated so many synthetic examples that one of our minority classes now outnumbers the original majority class. Up to a point, generating more synthetic samples does increase the performance, with our top score being ~0.59 - an improvement over our baseline but not as good as the previous approach.

SMOTE isn't often used in isolation - instead, we typically see it in conjunction with some undersampling. Let's try that next:

Oversampling + Undersampling

imblearn's Pipeline makes it easy to chain together multiple stages. First, we use SMOTE to synthetically 'oversample' our smaller classes, then we use the RandomUnderSampler to undersample the 'normal' class while keeping all the synthetic samples of the smaller classes. You can tweak the amout - in this example we create 5x the examples from the smaller classes and cut the number of samples down to 25%.

In [8]:
# Prep the data
X = df.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y_val = val['diagnosis']

# SMOTE on the training data
over = SMOTE(sampling_strategy={
    'normal':31208,
    'post_alzheimer':1149*5,
    'pre_alzheimer':420*5
})
under = RandomUnderSampler(sampling_strategy={
    'normal':int(31208/4), # keeping 25%
    'post_alzheimer':1149*5, # Keeping all of the samples we generated in the previous step
    'pre_alzheimer':420*5
})
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)

# Train the model
model = CatBoostClassifier(verbose=False)

# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)

# Store results
r = {'Approach':f'Over (10) + under (3)', 
     'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
     'F1':f1_score(y_val, model.predict(X_val), average='macro') 
    }
results.append(r)

print(r) # Show results
{'Approach': 'Over (10) + under (3)', 'Log Loss': 0.5799389927231922, 'F1': 0.4556547491919753}

Pretty good - and you could improve this further with a little extra tweaking.

Now on to the strategy I've actually been using in m submissions:

An Unreasonably Good Strategy...

After all of the above, we come back to the approach I first took when I noticed there was a class imbalance and it was different between the train set and the val set... just throwing out most of the samples for the majority class. The code is simple - we just take a small (~17% in this case) sample of the rows diagnosed as 'normal'. It turns out this is one of the best strategies, outperforming all the attempts at oversampling and nearly beating our entry with modified sample weights. A benefit of this approach is that it works even when your model doesn't support sample weights - for example, I have used this with a simple Random Forest Classifier with great results.

In [9]:
df_us = pd.concat([
    df.loc[df.diagnosis == 'pre_alzheimer'],
    df.loc[df.diagnosis == 'post_alzheimer'],
    df.loc[df.diagnosis == 'normal'].sample(frac=1/6),
]).reset_index().drop('index', axis=1)

X = df_us.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df_us['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']

# Train the model
model = CatBoostClassifier(verbose=False, cat_features=['intersection_pos_rel_centre'])

# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)

# Store results
r = {'Approach':'Just throw away some of the majority class samples', 
     'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
     'F1':f1_score(y_val, model.predict(X_val), average='macro') 
    }
results.append(r)

print(r) # Show results
{'Approach': 'Just throw away some of the majority class samples', 'Log Loss': 0.5569981832627415, 'F1': 0.4683894696089818}

It's a little annoying how good this does - despite lots of tweaking I have yet to find anything that consistently does much better than this! Good news for all of you reading this too - simply copy the first few lines to transform your training dataset then carry on with whatever models you were previously using :)

View our results

Taking a look at the final table of results, we can see that there are several approaches that do well, but ultimately ANYTHING that addresses the class imbalance problem will give a big boost over naively training on all of the data.

In [10]:
pd.DataFrame(results).sort_values(by='Log Loss')
Out[10]:
Approach Log Loss F1
11 Just throw away some of the majority class sam... 0.556998 0.468389
1 Modifying Sample Weights 0.559395 0.472760
10 Over (10) + under (3) 0.579939 0.455655
7 SMOTE (factor of 20) 0.600062 0.375539
8 SMOTE (factor of 30) 0.603194 0.404650
9 SMOTE (factor of 40) 0.606240 0.417547
6 SMOTE (factor of 15) 0.612045 0.352017
5 SMOTE (factor of 10) 0.617852 0.346938
4 SMOTE (factor of 6) 0.627254 0.344536
3 SMOTE (factor of 4) 0.637126 0.316938
2 SMOTE (factor of 2) 0.652145 0.295844
0 No modifications 0.674555 0.284810

Conclusions

I hope you've found this interesting. If you use this and get better results, please let me know! And if you have corrections, suggestions or other techniques that have worked for you, please share them in the discussions and let's all learn together.

Good luck!

Jonathan (@johnowhitaker)

Appendix

Producing the plot of class distribution:

In [11]:
from matplotlib import pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
(df['diagnosis'].value_counts()*100/len(df)).plot(kind='bar', ax=ax1, title='Class Distribution (Train)')
(val['diagnosis'].value_counts()*100/len(val)).plot(kind='bar', ax=ax2,  title='Class Distribution (Val)')
plt.tight_layout()
plt.savefig('Class balance.png')

Comments

Johnowhitaker
Over 3 years ago

I’ve just noteiced that I can comment here rather then as a discussion thread. Two points: - You’ll need to modify the paths to the data based on where your notebook is located, probably something like os.getenv(“DATASET_PATH”, “/ds_shared_drive/train.csv”) - One technique not included here is modifying the predicted probabilities post-hoc. I’ve seen this occasionally work to give a sliiight edge, but I don’t like it because 1) It’s fairly arbitrary 2) you end up with probabilities that don’t sum to 1, which is ?? and 3) it’s one of those tricks that might give a small improvement on the LB but adds no value to the solution itself.

nlevin
Over 3 years ago

Hey John!

Just wanted to say thanks for this great notebook. I really appreciate they way you lay out each approach to balancing the classes and collect the results in a neat table at the end.

Not only did I learn some things about balancing classes I also feel like I learned a few tips for organizing my notebook and cleaner more efficient testing. Kudos!

Any chance you might make a similar notebook detailing some strategies for filling in missing values? ;)

stefanos_fafalios
Over 3 years ago

But are we allowed to use the validation dataset for training?

Will our submitions be evaluated in a third dataset after the competition finishes?

You must login before you can post a comment.

Execute