ADDI Alzheimers Detection Challenge
Simple EDA and Baseline - LB 0.66 (0.616 with a magic)
Simple EDA and Baseline - LB 0.66 (0.616 with a magic)
This notebook contains 1) a simple analysis 2) a simple feature engineering 3) a simple k-fold model whose CV is 0.76 and LB 0.66.
The magic is the ratio in cell 15. Change it from "nb_neg = nb_pos" to "nb_neg = nb_pos*2" will score 0.616 LB
Simple EDA and baseline models¶
The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:
1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)
In machine learning terms: this is a 3-class classification task.
How to use this notebook? 📝¶
- Update the config parameters. You can define the common variables here
Variable | Description |
---|---|
AICROWD_DATASET_PATH |
Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path. |
AICROWD_PREDICTIONS_PATH |
Path to write the output to. |
AICROWD_ASSETS_DIR |
In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation. |
AICROWD_API_KEY |
In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me |
- Installing packages. Please use the Install packages 🗃 section to install the packages
- Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section
Setup AIcrowd Utilities 🛠¶
We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.
!pip install -q -U aicrowd-cli
%load_ext aicrowd.magic
AIcrowd Runtime Configuration 🧷¶
Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR
. We will copy the contents of this directory to your final submission file 🙂
The dataset is available under /ds_shared_drive
on the workspace.
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me
Install packages 🗃¶
Please add all pacakage installations in this section
!pip install numpy pandas
!pip install seaborn lightgbm scikit-learn
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
Import common packages¶
Please import packages that are common for training and prediction phases here.
import numpy as np
import pandas as pd
# some precessing code
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
import joblib
import warnings
warnings.filterwarnings("ignore")
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
# model = define_your_model
Load training data¶
# load your data
AICROWD_DATASET_PATH
target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021
target_values = ["normal", "post_alzheimer", "pre_alzheimer"]
train = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "train"))
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)
print(train.shape)
features = train.columns[1:-1].to_list()
numeric_features = [c for c in features if c not in cat_cols]
for c in numeric_features:
train[c] = train[c].astype(float)
print(train[target_col].value_counts())
train.tail(3)
Target¶
sns.countplot(x=target_col, data=train);
Numerical features¶
nb_shown = len(numeric_features)
fig, ax = plt.subplots(nb_shown, 1, figsize=(20,5*nb_shown))
colors = ["Green", "Blue", "Red"]
for i, col in enumerate(numeric_features[:nb_shown]):
for value, color in zip(target_values, colors):
sns.distplot(train.loc[train[target_col]==value, col],
ax=ax[i], color=color, norm_hist=True)
ax[i].set_title("Train {}".format(col))
ax[i].set_xlabel("")
ax[i].set_xlabel("")
Categorical features¶
There is only 1 single categorical feature
sns.countplot(x=cat_cols[0], hue=target_col, data=train[cat_cols+[target_col]].fillna("NA"));
Balance the dataset and see the the distribution again¶
df_pos = train[train[target_col].isin(target_values[1:])]
nb_pos = df_pos.shape[0]
nb_neg = nb_pos
df_neg = train[train[target_col] == "normal"].sample(n=nb_neg, random_state=seed)
df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)
sns.countplot(x=cat_cols[0], hue=target_col, data=df_samples[cat_cols+[target_col]].fillna("NA"));
Train your model¶
# model.fit(train_data)
# some custom code block
Simple FE¶
print(cat_cols)
for c in cat_cols:
df_samples[c].fillna("NA", inplace=True)
df_dummies = pd.get_dummies(df_samples[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
dummy_cols = df_dummies.columns.to_list()
print(dummy_cols)
df_samples = pd.concat([df_samples, df_dummies], axis=1)
df_samples['cnt_NaN'] = df_samples[numeric_features].isna().sum(axis=1)
df_samples.fillna(-1, inplace=True)
df_samples.head(3)
model_features = df_samples.columns.to_list()
model_features = [c for c in model_features if c not in [key_col, target_col] + cat_cols]
unique_value_cols = []
for c in model_features:
if df_samples[c].unique().shape[0] == 1:
unique_value_cols.append(c)
print(unique_value_cols)
model_features = [c for c in model_features if c not in unique_value_cols]
print(len(model_features))
Train models with 5 folds¶
X_train = df_samples[model_features]
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))
skf = StratifiedKFold(n_splits=5, random_state=2021, shuffle=True)
preds = 0.0
params = {
"objective" : "multiclass",
"num_class" : len(target_values),
"bagging_seed" : 2021,
"verbosity" : 1 }
clfs = []
for fold, (itrain, ivalid) in enumerate(skf.split(X_train, y_train)):
print("-"*40)
print(f"Running for fold {fold}")
lgb_train = lgb.Dataset(X_train.iloc[itrain], y_train.iloc[itrain])
lgb_eval = lgb.Dataset(X_train.iloc[ivalid], y_train.iloc[ivalid], reference = lgb_train)
clf = lgb.train(params, lgb_train, 1000, valid_sets=[lgb_eval],
early_stopping_rounds=100, verbose_eval=200)
clfs.append(clf)
Let's see the features importance of a model¶
lgb.plot_importance(clf, max_num_features=20);
Save your trained model¶
# model.save()
for i, clf in enumerate(clfs):
model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{i}.pkl'
joblib.dump(clf, model_filename)
meta = {
"numeric_features": numeric_features,
"cat_cols": cat_cols,
"dummy_cols": dummy_cols,
"model_features": model_features
}
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
joblib.dump(meta, meta_filename)
Prediction phase 🔎¶
Please make sure to save the weights from the training section in your assets directory and load them in this section
# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)
nb_folds = 5 # skf.n_splits
clfs = []
for fold in range(nb_folds):
print("-"*40)
print(f"Running for fold {fold}")
model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{fold}.pkl'
clf = joblib.load(model_filename)
clfs.append(clf)
print("-"*40)
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
meta = joblib.load(meta_filename)
print(meta.keys())
numeric_features = meta['numeric_features']
cat_cols = meta['cat_cols']
dummy_cols = meta['dummy_cols']
model_features = meta['model_features']
Load test data¶
test_data = pd.read_csv(AICROWD_DATASET_PATH)
test_data.head()
Generate predictions¶
test_data = test_data.copy()
for c in numeric_features:
test_data[c] = test_data[c].astype(float)
for c in cat_cols:
test_data[c].fillna("NA", inplace=True)
df_test_dummies = pd.get_dummies(test_data[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
test_data = pd.concat([test_data, df_test_dummies], axis=1)
test_data['cnt_NaN'] = test_data[numeric_features].isna().sum(axis=1)
test_data.fillna(-1, inplace=True)
for c in dummy_cols:
if c not in test_data.columns:
test_data[c] = 0
print("Missing columns:", [c for c in model_features if c not in test_data.columns])
test_data.head(3)
X_test = test_data[model_features]
preds = 0.0
nb_folds = 5 # skf.n_splits
for fold, clf in enumerate(clfs):
print("-"*40)
print(f"Running for fold {fold}")
pred = clf.predict(X_test)
preds += pred/nb_folds
print(preds.shape)
predictions = {
"row_id": test_data["row_id"].values,
"normal_diagnosis_probability": preds[:,0],
"post_alzheimer_diagnosis_probability": preds[:,1],
"pre_alzheimer_diagnosis_probability": preds[:,2]
}
predictions_df = pd.DataFrame.from_dict(predictions)
Save predictions 📨¶
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)
Submit to AIcrowd 🚀¶
NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)
!aicrowd login --api-key $AICROWD_API_KEY
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
--assets-dir $AICROWD_ASSETS_DIR \
--challenge addi-alzheimers-detection-challenge
Content
Comments
You must login before you can post a comment.
Hey on submission I am receiving an error stating keyerror —————— train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True) ——————
KeyError: ‘diagnosis’
I have used your notebook as a reference for loading the data. Can you help me out resolve the issue?