ADDI Alzheimers Detection Challenge
What about neural networks? (0.694 LogLoss , 0.497 F1)
Trying out neural networks, imputing NaN values with KNNs, and exploring the data!
An exploration of what neural networks can bring to the competition.
Training a Nerual Network (and other stuff)
Table of contents:
- Quick EDA & Discussion
- Impute the NaNs!
- Neural Networks!
- Cross validation.
- Submit!
Before we begin, a quick tip when working with ADDI workspaces.
There's a clipboard button on the top right that allows you to copy things from local clipboard to workspace clipboard so you don't have to type everything manually.
Obvious, right? I discovered that too late. Maybe it will save someone's time!
First, let's categorize the problem:
- It's a 3-class classifcation problem
- with class imbalance
- lots of missing values
The best solutions will most likely be big ensembles with good cross-validation (don't overfit to the leaderboard!) and some smart feature engineering.
It's also always a good idea to check top solutions for previous competitions with similar properties, here are some:
We'll be using a library called deeptables for the neural network. It's made for fitting deep learning models on tabular data fast and with a lot of options. Let's try it!
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
Import common packages¶
Please import packages that are common for training and prediction phases here.
!pip install -q -U aicrowd-cli numpy pandas scikit-learn deeptables
%load_ext aicrowd.magic
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me
import joblib
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.metrics import log_loss
from sklearn.impute import KNNImputer
from deeptables.models import deeptable
from sklearn.model_selection import StratifiedKFold
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
Load training data¶
Let's start by loading the data and encoding the diagnosis labels.
train = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "train"))
valid = pd.read_csv(AICROWD_DATASET_PATH)
valid_gt = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "validation_ground_truth"))
train.loc[train.diagnosis == "normal", "diagnosis"] = 0
train.loc[train.diagnosis == "pre_alzheimer", "diagnosis"] = 1
train.loc[train.diagnosis == "post_alzheimer", "diagnosis"] = 2
valid_gt.loc[valid_gt.diagnosis == "normal", "diagnosis"] = 0
valid_gt.loc[valid_gt.diagnosis == "pre_alzheimer", "diagnosis"] = 1
valid_gt.loc[valid_gt.diagnosis == "post_alzheimer", "diagnosis"] = 2
Now let's do a quick EDA and start with the NaN values
We'll see how many NaN values are there per row and per column:
null_row_count = train.isna().sum(axis=1).value_counts()
null_row_count = null_row_count.sort_values(ascending=False).head(20)
fig = sns.barplot(x=null_row_count.index, y=null_row_count)
plt.xlabel("Number of NaNs")
plt.ylabel("Number of rows")
plt.title("Number of NaNs per row")
plt.show(fig)
#We'll replace column names with their indices for easier plots
train_ind_cols = train.copy()
train_ind_cols.columns = [x for x in range(len(train.columns))]
null_column_count = train_ind_cols.isna().sum(axis=0)
null_column_count = null_column_count.sort_values(ascending=False).head(20)
fig = sns.barplot(x=null_column_count.index, y=null_column_count)
plt.xlabel("Column Index")
plt.ylabel("Number of NaNs in column")
plt.title("Number of NaNs per column")
plt.show(fig)
#And here are the names of the columns
pd.Series(train.iloc[:,null_column_count.index].columns, index = null_column_count.index)
As we can see, there are lots of nan values in each row and column. There are many ways to deal with them. We'll try here to use KNNImputer on continuous values, where it basically replaces NaNs with the closest value based on other rows. For categorical columns (we'll define it as less than 20 unique values) we'll just fill by most frequent category.
Now let's take a look at the number of digits drawn for each class:
#percentage of each class with at least one missing digit
(train.number_of_digits[train.number_of_digits < 12].groupby(train.diagnosis).count() / [train.diagnosis.value_counts()[0],train.diagnosis.value_counts()[1],train.diagnosis.value_counts()[2]]) * 100
More than half of the normal category (0) have at least 1 missing digit!
This is a reminder that normal is not normal. As mentioned in the competition overview, normal means "not an Alzheimer's patient", and it doesn't rule out other neurodegenerative diseases. This could help when doing feature engineering.
Train your model¶
Now let's do some preprocessing. We'll use the KNNImputer implementation from sklearn for continuous values, and fill categorical values with the most common value.
most_commons = []
continuous,categorical = [],[]
for i in train.columns[1:-1]:
#Define categorical as less than 20 unique values
if len(train[i].unique()) < 20:
categorical.append(i)
else:
continuous.append(i)
for i in categorical:
most_common = train[i].value_counts().keys()[0]
most_commons.append(most_common)
train[i] = train[i].fillna(most_common)
imputer = KNNImputer() #Here's the fun part (It takes some time)
train[continuous] = imputer.fit_transform(train[continuous])
#One-hot encode the only non-numeric value we have
one_hot = pd.get_dummies(train["intersection_pos_rel_centre"])
train = train.drop("intersection_pos_rel_centre", axis=1)
train = train.join(one_hot)
meta_train = [categorical,continuous,most_commons]
for i,col in enumerate(categorical):
#We'll fill valid and test with most common in train
valid[col] = valid[col].fillna(most_commons[i])
imputer = KNNImputer()
valid[continuous] = imputer.fit_transform(valid[continuous])
one_hot = pd.get_dummies(valid["intersection_pos_rel_centre"])
valid = valid.drop("intersection_pos_rel_centre", axis=1)
valid = valid.join(one_hot)
A little bit of feature engineering: Add variance of "euc_distance" and "dist from cent" columns
m = [col for col in train.columns if col.endswith("dist from cen")]
train["dist from cen_var"] = train[m].var(axis=1)
m = [col for col in train.columns if col.startswith("euc_dist_digit_")]
train["euc_dist_digit_var"] = train[m].var(axis=1)
m = [col for col in valid.columns if col.endswith("dist from cen")]
valid["dist from cen_var"] = valid[m].var(axis=1)
m = [col for col in valid.columns if col.startswith("euc_dist_digit_")]
valid["euc_dist_digit_var"] = valid[m].var(axis=1)
We have arrived at the fun part!
We'll use the apply_class_weight parameter to let deeptables deal with the imbalance for us!
For cross validation we'll do 10 folds and also we'll try predicting and scoring the valid set.
You can try experimenting with different nets, hyperparameters, etc. from the documentation.
There's also a fit_cross_validation function available you can try out!
There are lots of models to choose from (and combine). And you could also tune hyperparameters for most of these models. Happy tinkering!
from sklearn.model_selection import StratifiedKFold as skf
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
X_train = train.drop(["diagnosis", "row_id"], axis=1)
y_train = train['diagnosis'].astype(int)
dnn_params = {"hidden_units":(
(512,0,False),
(256,0,False),
(128,0,False),
(64,0,False),
(32,0,False),
(16,0,False),
(8,0,False),
),
"dnn_activation":"relu"}
conf = deeptable.ModelConfig(nets=['dnn_nets', 'autoint_nets'],
optimizer="RMSprop",
earlystopping_patience=5,
apply_class_weight=True,
monitor_metric="val_loss",
dnn_params=dnn_params)
score = 0.0
for fold, (itrain, ivalid) in enumerate(skf.split(X_train, y_train)):
print("-"*40)
print(f"Running for fold {fold}")
xtr,ytr = X_train.iloc[itrain], y_train.iloc[itrain]
xv,yv = X_train.iloc[ivalid], y_train.iloc[ivalid]
dt = deeptable.DeepTable(config=conf)
model, history = dt.fit(xtr,ytr,epochs=100,verbose=0, validation_data=(xv,yv))
ll = log_loss(yv.values.astype(int), dt.predict_proba(xv))
print("-"*40)
print(ll)
score += ll
dt.save(f"nn_{fold}")
print(score/10)
dts = []
for i in range(10):
model_filename = f'nn_{i}'
dt = deeptable.DeepTable.load(model_filename)
dts.append(dt)
f = True
preds = 0.0
for dt in dts:
if f:
preds = dt.predict_proba(valid.iloc[:,1:])
f = False
else:
preds = np.add(preds,dt.predict_proba(valid.iloc[:,1:]))
preds = preds / 10.0
One more thing to note, the model can produce wildly different results each time it's trained (but you can save weights after training), is there a better way to deal with the randomness? I'll leave that to you.
from sklearn.metrics import log_loss
log_loss(valid_gt.diagnosis.values.astype(int),preds)
Save your trained model¶
meta = {
"categorical": meta_train[0],
"continuous": meta_train[1],
"most_commons": meta_train[2],
}
!cp nn* {AICROWD_ASSETS_DIR} -r
meta_filename = f'{AICROWD_ASSETS_DIR}/nn_meta.pkl'
joblib.dump(meta, meta_filename)
Prediction phase 🔎¶
Please make sure to save the weights from the training section in your assets directory and load them in this section
meta_filename = f'{AICROWD_ASSETS_DIR}/nn_meta.pkl'
meta = joblib.load(meta_filename)
dts = []
for i in range(10):
model_filename = f'{AICROWD_ASSETS_DIR}/nn_{i}'
dt = deeptable.DeepTable.load(model_filename)
dts.append(dt)
categorical = meta['categorical']
continuous = meta['continuous']
most_commons = meta['most_commons']
Load test data¶
test_data = pd.read_csv(AICROWD_DATASET_PATH)
Generate predictions¶
for i,col in enumerate(categorical):
test_data[col] = test_data[col].fillna(most_commons[i])
imputer = KNNImputer()
test_data[continuous] = imputer.fit_transform(test_data[continuous])
one_hot = pd.get_dummies(test_data["intersection_pos_rel_centre"])
test_data = test_data.drop("intersection_pos_rel_centre", axis=1)
test_data = test_data.join(one_hot)
m = [col for col in test_data.columns if col.endswith("dist from cen")]
test_data["dist from cen_var"] = test_data[m].var(axis=1)
m = [col for col in test_data.columns if col.startswith("euc_dist_digit_")]
test_data["euc_dist_digit_var"] = test_data[m].var(axis=1)
X_test = test_data.drop("row_id",axis=1)
preds = 0.0
f = True
for dt in dts:
if f:
preds = dt.predict_proba(X_test)
f = False
else:
preds = np.add(preds,dt.predict_proba(X_test))
preds = preds / 10.0
predictions = {
"row_id": test_data["row_id"].values,
"normal_diagnosis_probability": preds[:,0],
"pre_alzheimer_diagnosis_probability": preds[:,1],
"post_alzheimer_diagnosis_probability": preds[:,2]
}
predictions_df = pd.DataFrame.from_dict(predictions)
Save predictions 📨¶
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)
Submit to AIcrowd 🚀¶
NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
--assets-dir $AICROWD_ASSETS_DIR \
--challenge addi-alzheimers-detection-challenge
Reflections
This was an exploration into what a neural network could do in this competition. The model shown has a bit higher log loss on the public leaderboard. Maybe with some hyperparameter tuning it will help in an ensemble!
Content
Comments
You must login before you can post a comment.