ADDI Alzheimers Detection Challenge
Dealing with Class Imbalance
Looking at different ways to address class imbalance in this dataset
In this notebook I take a close look at some different ways we can address the difference in class balance between train and validation (and presumably test)
We look at changing sample weights, over and under-sampling, SMOTE and some other tips and tricks.
I hope you find it helpful :)
Introduction¶
So you've made your first model for this challenge and it's getting a log loss of ~0.9 - a ways behind the leaders on 0.6X. You're starting to think about feature engineering, adding more models to your ensemble, maybe trying one of those tabular deep learning models the cool kids are talking about. STOP! Before any of that, there is one BIG issue we need to deal with: class (im)balance.
The validation set (and presumably the test set) has a different class distribution to the training data. In this notebook we will look at many different ways we can correct for this class imbalance - picking one of these will boost your score tremendously (we're taking ~0.66 with a single simple random forest model). So, let's dive in.
Setup¶
Importing the libraries we'll be using, loading the data and getting ready to run our experiments.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import f1_score, log_loss
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
#The training data
df = pd.read_csv('ds_shared_drive/train.csv')
print(df.shape)
df.head(2)
# The validation data (we merge in the labels for convenience)
val = pd.read_csv('ds_shared_drive/validation.csv')
val = pd.merge(val, pd.read_csv('ds_shared_drive/validation_ground_truth.csv'),
how='left', on='row_id')
print(val.shape)
val.head()
# We'll keep track of how different approaches perform
results = []
Baseline #1 - Training on all data¶
This is a case where we don't do any correction for the class imbalance. Some models will do better than others - tree-based models like CatBoost will be less sensitive than some other model types, but they will still over-estimate the probability that a given sample will fall into the majority class when making predicitons on the validation set (since the 'normal' class is so much more common in the training data).
# Prep the data
X = df.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']
# Train the model
model = CatBoostClassifier(verbose=False, cat_features=['intersection_pos_rel_centre'])
# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)
# Store results
r = {'Approach':'No modifications',
'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
'F1':f1_score(y_val, model.predict(X_val), average='macro')
}
results.append(r)
print(r) # Show results
A log loss of 0.67 on the validation set isn't terrible. We are using the validation set for early stopping - without that in place we get a log loss of 0.78 on our validation set and 0.8X on the leaderboard. So in a way, by using the validation set for early stopping we are already starting to combat our class balance problem... but we can do much better!
Adjusting Sample Weights¶
Models like CatBoost allow us to assign more weight to specific samples. In this case, we use this to place less weight on samples in the over-represented classes, combating the bias introduced by the imbalance:
# Prep the data
X = df.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']
#Our class weights
weights = {
'normal':9/74, # Chosen based on some quick mental maths comparing the distribution of train vs val
'post_alzheimer':0.85,
'pre_alzheimer':1
}
# Applying these weights as sample weights by using Pool to wrap our training data
train_data = Pool(
data = X,
label = y,
weight = y.map(weights), # << The important bit
cat_features = [list(X.columns).index('intersection_pos_rel_centre')]
)
eval_data = Pool(
data = X_val,
label = y_val,
weight = y_val.map(lambda x: 1.0), # all validation samples get a weight of 1
cat_features = [list(X.columns).index('intersection_pos_rel_centre')]
)
# Train the model
model = CatBoostClassifier(verbose=False)
# Evaluate on val set
model.fit(train_data, eval_set = eval_data, early_stopping_rounds = 30)
# Store results
r = {'Approach':'Modifying Sample Weights',
'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
'F1':f1_score(y_val, model.predict(X_val), average='macro')
}
results.append(r)
print(r) # Show results
As you can see, we're now getting a log loss of 0.56 on our validation set - a significant improvement! This approach is especially appealing since we don't throw out any data. However, not all models support sample weights. Next, let us investigate techniques for over- and under-sampling.
SMOTE¶
Look for resources on training with imbalanced data and odds are high you will encounter Synthetic Minority Oversampling Technique (SMOTE). This is a technique for synthesizing additional samples for the under-represented classes. You can check out this link for an more info.
Applying it is fairly simple. In this case, we run an experiment using SMOTE (from the imblearn library) to generate larger and larger numbers of synthetic examples, controlled by the factor parameter.
for factor in [2, 4, 6, 10, 15, 20, 30, 40]:
# Prep the data
X = df.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y_val = val['diagnosis']
# SMOTE on the training data
oversample = SMOTE(sampling_strategy={
'normal':31208,
'post_alzheimer':1149*factor,
'pre_alzheimer':420*factor
})
X, y = oversample.fit_resample(X, y)
# Train the model
model = CatBoostClassifier(verbose=False)
# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)
# Store results
r = {'Approach':f'SMOTE (factor of {factor})',
'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
'F1':f1_score(y_val, model.predict(X_val), average='macro')
}
results.append(r)
print(r) # Show results
At some point we start getting warnings that we've generated so many synthetic examples that one of our minority classes now outnumbers the original majority class. Up to a point, generating more synthetic samples does increase the performance, with our top score being ~0.59 - an improvement over our baseline but not as good as the previous approach.
SMOTE isn't often used in isolation - instead, we typically see it in conjunction with some undersampling. Let's try that next:
Oversampling + Undersampling¶
imblearn's Pipeline
makes it easy to chain together multiple stages. First, we use SMOTE to synthetically 'oversample' our smaller classes, then we use the RandomUnderSampler
to undersample the 'normal' class while keeping all the synthetic samples of the smaller classes. You can tweak the amout - in this example we create 5x the examples from the smaller classes and cut the number of samples down to 25%.
# Prep the data
X = df.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y = df['diagnosis']
X_val = val.drop(['row_id', 'diagnosis', 'intersection_pos_rel_centre'], axis=1).fillna(0)
y_val = val['diagnosis']
# SMOTE on the training data
over = SMOTE(sampling_strategy={
'normal':31208,
'post_alzheimer':1149*5,
'pre_alzheimer':420*5
})
under = RandomUnderSampler(sampling_strategy={
'normal':int(31208/4), # keeping 25%
'post_alzheimer':1149*5, # Keeping all of the samples we generated in the previous step
'pre_alzheimer':420*5
})
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)
# Train the model
model = CatBoostClassifier(verbose=False)
# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)
# Store results
r = {'Approach':f'Over (10) + under (3)',
'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
'F1':f1_score(y_val, model.predict(X_val), average='macro')
}
results.append(r)
print(r) # Show results
Pretty good - and you could improve this further with a little extra tweaking.
Now on to the strategy I've actually been using in m submissions:
An Unreasonably Good Strategy...¶
After all of the above, we come back to the approach I first took when I noticed there was a class imbalance and it was different between the train set and the val set... just throwing out most of the samples for the majority class. The code is simple - we just take a small (~17% in this case) sample of the rows diagnosed as 'normal'. It turns out this is one of the best strategies, outperforming all the attempts at oversampling and nearly beating our entry with modified sample weights. A benefit of this approach is that it works even when your model doesn't support sample weights - for example, I have used this with a simple Random Forest Classifier with great results.
df_us = pd.concat([
df.loc[df.diagnosis == 'pre_alzheimer'],
df.loc[df.diagnosis == 'post_alzheimer'],
df.loc[df.diagnosis == 'normal'].sample(frac=1/6),
]).reset_index().drop('index', axis=1)
X = df_us.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y = df_us['diagnosis']
X_val = val.drop(['row_id', 'diagnosis'], axis=1).fillna(0)
y_val = val['diagnosis']
# Train the model
model = CatBoostClassifier(verbose=False, cat_features=['intersection_pos_rel_centre'])
# Evaluate on val set
model.fit(X, y, eval_set = (X_val, y_val), early_stopping_rounds = 30)
# Store results
r = {'Approach':'Just throw away some of the majority class samples',
'Log Loss':log_loss(y_val, model.predict_proba(X_val)),
'F1':f1_score(y_val, model.predict(X_val), average='macro')
}
results.append(r)
print(r) # Show results
It's a little annoying how good this does - despite lots of tweaking I have yet to find anything that consistently does much better than this! Good news for all of you reading this too - simply copy the first few lines to transform your training dataset then carry on with whatever models you were previously using :)
View our results¶
Taking a look at the final table of results, we can see that there are several approaches that do well, but ultimately ANYTHING that addresses the class imbalance problem will give a big boost over naively training on all of the data.
pd.DataFrame(results).sort_values(by='Log Loss')
Conclusions¶
I hope you've found this interesting. If you use this and get better results, please let me know! And if you have corrections, suggestions or other techniques that have worked for you, please share them in the discussions and let's all learn together.
Good luck!
Jonathan (@johnowhitaker)
Appendix¶
Producing the plot of class distribution:
from matplotlib import pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
(df['diagnosis'].value_counts()*100/len(df)).plot(kind='bar', ax=ax1, title='Class Distribution (Train)')
(val['diagnosis'].value_counts()*100/len(val)).plot(kind='bar', ax=ax2, title='Class Distribution (Val)')
plt.tight_layout()
plt.savefig('Class balance.png')
Content
Comments
You must login before you can post a comment.
I’ve just noteiced that I can comment here rather then as a discussion thread. Two points: - You’ll need to modify the paths to the data based on where your notebook is located, probably something like os.getenv(“DATASET_PATH”, “/ds_shared_drive/train.csv”) - One technique not included here is modifying the predicted probabilities post-hoc. I’ve seen this occasionally work to give a sliiight edge, but I don’t like it because 1) It’s fairly arbitrary 2) you end up with probabilities that don’t sum to 1, which is ?? and 3) it’s one of those tricks that might give a small improvement on the LB but adds no value to the solution itself.
Hey John!
Just wanted to say thanks for this great notebook. I really appreciate they way you lay out each approach to balancing the classes and collect the results in a neat table at the end.
Not only did I learn some things about balancing classes I also feel like I learned a few tips for organizing my notebook and cleaner more efficient testing. Kudos!
Any chance you might make a similar notebook detailing some strategies for filling in missing values? ;)
But are we allowed to use the validation dataset for training?
Will our submitions be evaluated in a third dataset after the competition finishes?