Simple EDA and baseline models¶

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝¶

notebook overview

Update the config parameters. You can define the common variables here

Variable	Description
`AICROWD_DATASET_PATH`	Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH`	Path to write the output to.
`AICROWD_ASSETS_DIR`	In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY`	In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Installing packages. Please use the Install packages 🗃 section to install the packages
Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Setup AIcrowd Utilities 🛠¶

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [1]:

!pip install -q -U aicrowd-cli

In [2]:

%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷¶

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [3]:

import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"
AICROWD_API_KEY = "" # Get your key from https://www.aicrowd.com/participants/me

Install packages 🗃¶

Please add all pacakage installations in this section

In [4]:

!pip install numpy pandas
!pip install seaborn lightgbm scikit-learn

Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: seaborn in ./conda/lib/python3.8/site-packages (0.11.1)
Requirement already satisfied: lightgbm in ./conda/lib/python3.8/site-packages (3.2.1)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (0.24.2)
Requirement already satisfied: matplotlib>=2.2 in ./conda/lib/python3.8/site-packages (from seaborn) (3.4.1)
Requirement already satisfied: numpy>=1.15 in ./conda/lib/python3.8/site-packages (from seaborn) (1.20.2)
Requirement already satisfied: scipy>=1.0 in ./conda/lib/python3.8/site-packages (from seaborn) (1.6.3)
Requirement already satisfied: pandas>=0.23 in ./conda/lib/python3.8/site-packages (from seaborn) (1.2.4)
Requirement already satisfied: wheel in ./conda/lib/python3.8/site-packages (from lightgbm) (0.35.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn) (2.1.0)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn) (1.0.1)
Requirement already satisfied: pillow>=6.2.0 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)
Requirement already satisfied: cycler>=0.10 in ./conda/lib/python3.8/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas>=0.23->seaborn) (2021.1)
Requirement already satisfied: six>=1.5 in ./conda/lib/python3.8/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.15.0)

Define preprocessing code 💻¶

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages¶

Please import packages that are common for training and prediction phases here.

In [5]:

import numpy as np
import pandas as pd

In [6]:

# some precessing code

In [7]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

import joblib

import warnings
warnings.filterwarnings("ignore")

Training phase ⚙️¶

You can define your training code here. This sections will be skipped during evaluation.

In [8]:

# model = define_your_model

Load training data¶

In [9]:

# load your data

In [10]:

AICROWD_DATASET_PATH

Out[10]:

'/ds_shared_drive/validation.csv'

In [11]:

target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021

target_values = ["normal", "post_alzheimer", "pre_alzheimer"]

train = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "train"))
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)


print(train.shape)
features = train.columns[1:-1].to_list()

numeric_features = [c for c in features if c not in cat_cols]
for c in numeric_features:
    train[c] = train[c].astype(float)

print(train[target_col].value_counts())
train.tail(3)

(32777, 122)
normal            31208
post_alzheimer     1149
pre_alzheimer       420
Name: diagnosis, dtype: int64

Out[11]:

	row_id	number_of_digits	missing_digit_1	missing_digit_8	missing_digit_9	missing_digit_11	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	sequence_flag_ccw	number_of_hands	hand_count_dummy	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	intersection_pos_rel_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	final_rotation_angle	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	other_error	time_diff	centre_dot_detect	diagnosis
32774	YKCXR5L3HUEXI9129	11.0	1.0	0.0	0.0	1.0	NaN	415.609492	NaN	404.931167	343.933496	350.160677	411.903508	408.573127	388.002900	342.225437	372.795185	383.355253	NaN	6.460009	5.59	13.248901	0.086013	0.39	1.474322	9.459343	4.29	1.136238	NaN	10.27	NaN	2520.0	1815.0	4823.0	4440.0	2970.0	2166.0	3096.0	2360.0	3250.0	NaN	5150.0	NaN	40.0	55.0	91.0	74.0	66.0	57.0	72.0	59.0	65.0	NaN	50.0	NaN	63.0	33.0	53.0	60.0	45.0	38.0	43.0	40.0	50.0	NaN	103.0	362.272727	181.690909	1216108.964	5.135000	360.0	29.466579	360.0	301.658467	NaN	9727.113012	1.0	0.0	2.0	2.0	77.481745	98.597951	NaN	1.272531	21.116206	75.380406	14.119013	TL	NaN	26.572249	10.0	11.0	2.0	2.0	0.0	86.631195	125.0	1.000000	0.0	125.735462	115.510618	118.221225	122.488967	0.910695	0.089303	0.690915	0.015356	2.0	2.0	1.0	60.0	1.0	normal
32775	0MFBMF7ZRBSAH8ASA	10.0	0.0	1.0	1.0	0.0	433.959099	441.208568	408.001225	405.226171	417.833101	365.256759	339.988970	364.785553	397.165583	451.738863	NaN	NaN	5.641735	7.531402	20.67	9.179425	9.302904	25.87	48.801311	NaN	NaN	20.980509	27.633149	8.19	1881.0	1505.0	2537.0	2911.0	3477.0	3337.0	2640.0	NaN	NaN	2800.0	2376.0	3050.0	57.0	43.0	59.0	71.0	61.0	71.0	60.0	NaN	NaN	56.0	54.0	50.0	33.0	35.0	43.0	41.0	57.0	47.0	44.0	NaN	NaN	50.0	44.0	61.0	76.944444	73.511111	377425.600	18.243333	360.0	2357.648327	NaN	148.585224	NaN	148.585224	1.0	0.0	1.0	1.0	NaN	NaN	46.699503	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11.0	NaN	2.0	0.0	77.348992	84.0	1.000000	1.0	129.456141	112.602114	121.981285	118.401611	0.538221	0.461447	0.525810	0.473859	0.0	1.0	1.0	NaN	NaN	normal
32776	NMOMZBPRJMJCFOONV	11.0	0.0	0.0	0.0	1.0	439.300581	488.530961	NaN	438.130688	453.176014	424.410474	380.000000	371.031333	307.591450	358.568334	377.794256	429.549182	111.504807	109.183899	98.93	85.563310	87.320320	79.95	92.758482	98.168063	109.85	NaN	NaN	113.75	1680.0	1927.0	2256.0	2992.0	3366.0	7755.0	2870.0	2795.0	5655.0	1833.0	NaN	1681.0	28.0	41.0	47.0	44.0	51.0	55.0	41.0	43.0	65.0	39.0	NaN	41.0	60.0	47.0	48.0	68.0	66.0	141.0	70.0	65.0	87.0	47.0	NaN	41.0	777.618182	91.800000	3610308.273	100.620000	360.0	493.127248	NaN	NaN	NaN	NaN	NaN	NaN	2.0	2.0	94.780253	100.060823	NaN	1.055714	5.280570	81.236698	22.476805	BL	NaN	0.096989	10.0	11.0	2.0	2.0	270.0	84.200577	124.0	0.714286	0.0	126.783309	110.895126	111.380356	126.069024	0.498043	0.501635	0.541883	0.457771	2.0	0.0	1.0	60.0	0.0	normal

Target¶

In [12]:

sns.countplot(x=target_col, data=train);

Numerical features¶

In [13]:

nb_shown = len(numeric_features)
fig, ax = plt.subplots(nb_shown, 1, figsize=(20,5*nb_shown))

colors = ["Green", "Blue", "Red"]
for i, col in enumerate(numeric_features[:nb_shown]):
    for value, color in zip(target_values, colors):
        sns.distplot(train.loc[train[target_col]==value, col], 
                     ax=ax[i], color=color, norm_hist=True)
        ax[i].set_title("Train {}".format(col))
    ax[i].set_xlabel("")
    ax[i].set_xlabel("")

Categorical features¶

There is only 1 single categorical feature

In [14]:

sns.countplot(x=cat_cols[0], hue=target_col, data=train[cat_cols+[target_col]].fillna("NA"));

Balance the dataset and see the the distribution again¶

In [15]:

df_pos = train[train[target_col].isin(target_values[1:])]
nb_pos = df_pos.shape[0]
nb_neg = nb_pos
df_neg = train[train[target_col] == "normal"].sample(n=nb_neg, random_state=seed)
df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)

sns.countplot(x=cat_cols[0], hue=target_col, data=df_samples[cat_cols+[target_col]].fillna("NA"));

Train your model¶

In [16]:

# model.fit(train_data)

In [17]:

# some custom code block

Simple FE¶

In [18]:

print(cat_cols)
for c in cat_cols:
    df_samples[c].fillna("NA", inplace=True)
    
df_dummies = pd.get_dummies(df_samples[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
dummy_cols = df_dummies.columns.to_list()
print(dummy_cols)

df_samples = pd.concat([df_samples, df_dummies], axis=1)
df_samples['cnt_NaN'] = df_samples[numeric_features].isna().sum(axis=1)

df_samples.fillna(-1, inplace=True)
df_samples.head(3)

['intersection_pos_rel_centre']
['CAT_intersection_pos_rel_centre_BL', 'CAT_intersection_pos_rel_centre_BR', 'CAT_intersection_pos_rel_centre_NA', 'CAT_intersection_pos_rel_centre_TL', 'CAT_intersection_pos_rel_centre_TR', 'CAT_intersection_pos_rel_centre_nan']

Out[18]:

	row_id	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_5	missing_digit_8	missing_digit_9	missing_digit_11	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	number_of_hands	hand_count_dummy	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	intersection_pos_rel_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	final_rotation_angle	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	other_error	time_diff	centre_dot_detect	diagnosis	CAT_intersection_pos_rel_centre_BL	CAT_intersection_pos_rel_centre_NA	cnt_NaN
0	XSR3WB69PLAS5HY96	8.0	1.0	0.0	0.0	0.0	1.0	1.0	1.0	-1.000000	353.839653	-1.000000	286.768635	306.894118	263.427409	247.449490	255.840282	315.791466	375.300213	-1.000000	-1.000000	-1.000000	6.868341	3.90	5.471075	15.703004	27.17	11.775069	-1.000000	-1.00	56.485580	-1.000000	53.56	-1.0	5040.0	5568.0	7254.0	12782.0	7636.0	7884.0	-1.0	-1.0	15943.0	-1.0	4104.0	-1.0	72.0	96.0	93.0	83.0	92.0	108.0	-1.0	-1.0	107.0	-1.0	57.0	-1.0	70.0	58.0	78.0	154.0	83.0	73.0	-1.0	-1.0	149.0	-1.0	72.0	1395.839286	300.857143	1.655789e+07	28.210000	360.0	6911.674299	360.0	1699.618351	-1.0	1699.618351	1.0	2.0	2.0	43.776619	49.741608	-1.00000	1.136260	5.964989	90.610112	44.245656	BL	-1.000000	126.904753	12.0	11.0	10.0	2.0	0.0	68.819736	82.0	1.0	1.0	117.409654	91.673970	105.874053	98.859675	0.469047	0.530561	0.619334	0.380319	0.0	0.0	1.0	-100.0	0.0	pre_alzheimer	1	0	23
1	PIAYSCOQO68RFJBWJ	11.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	314.650600	322.321656	282.287974	313.882940	-1.000000	319.712762	319.845275	315.300254	330.351706	338.691969	341.685894	319.531689	16.803673	-1.000000	18.72	1.532490	1.311990	28.60	26.586651	18.092173	1.17	4.061573	11.432072	4.03	2772.0	-1.0	5472.0	4960.0	6188.0	6776.0	5440.0	6370.0	2623.0	13020.0	5005.0	9125.0	66.0	-1.0	96.0	80.0	91.0	121.0	85.0	130.0	61.0	105.0	65.0	73.0	42.0	-1.0	57.0	62.0	68.0	56.0	64.0	49.0	43.0	124.0	77.0	125.0	841.218182	525.272727	8.402984e+06	13.130000	360.0	506.146665	360.0	178.435934	-1.0	178.435934	1.0	2.0	2.0	61.442021	101.526588	-1.00000	1.652397	40.084566	66.705532	16.297846	BL	6.619794	-1.000000	11.0	11.0	1.0	2.0	0.0	77.391542	79.0	1.0	1.0	106.447204	99.524720	105.483966	100.332536	0.462181	0.537421	0.539200	0.460415	1.0	0.0	1.0	5.0	0.0	normal	1	0	8
2	YU1BFHD48SJV3ARKE	10.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	289.084763	383.589950	364.310993	351.458746	350.508916	-1.000000	228.985807	-1.000000	174.997857	348.621571	222.946182	348.686464	10.794822	46.127917	-1.00	40.293903	-1.000000	41.47	22.570652	12.023504	36.01	0.362155	6.851571	10.79	2944.0	6715.0	-1.0	13221.0	-1.0	6642.0	5580.0	13056.0	5518.0	7070.0	9315.0	15030.0	92.0	85.0	-1.0	113.0	-1.0	82.0	90.0	102.0	89.0	70.0	81.0	90.0	32.0	79.0	-1.0	117.0	-1.0	81.0	62.0	128.0	62.0	101.0	115.0	167.0	1532.044444	138.266667	1.592449e+07	29.423333	360.0	153.635052	-1.0	11020.682810	-1.0	11020.682810	0.0	1.0	1.0	-1.000000	-1.000000	72.24913	-1.000000	-1.000000	-1.000000	-1.000000	NA	-1.000000	-1.000000	-1.0	11.0	-1.0	2.0	210.0	75.590420	83.0	1.0	1.0	109.178156	92.266413	97.093513	102.445313	0.554095	0.445525	0.571322	0.428276	1.0	1.0	1.0	-1.0	-1.0	normal	0	1	24

In [19]:

model_features = df_samples.columns.to_list()
model_features = [c for c in model_features if c not in [key_col, target_col] + cat_cols]

unique_value_cols = []
for c in model_features:
    if df_samples[c].unique().shape[0] == 1:
        unique_value_cols.append(c)
        
print(unique_value_cols)
model_features = [c for c in model_features if c not in unique_value_cols]
print(len(model_features))

['actual_hour_digit', 'actual_minute_digit', 'CAT_intersection_pos_rel_centre_nan']
123

Train models with 5 folds¶

In [20]:

X_train = df_samples[model_features]
y_train = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))

skf = StratifiedKFold(n_splits=5, random_state=2021, shuffle=True)
preds = 0.0

params = {
          "objective" : "multiclass",
          "num_class" : len(target_values),
          "bagging_seed" : 2021,
          "verbosity" : 1 }

clfs = []
for fold, (itrain, ivalid) in enumerate(skf.split(X_train, y_train)):
    print("-"*40)
    print(f"Running for fold {fold}")
    lgb_train = lgb.Dataset(X_train.iloc[itrain], y_train.iloc[itrain])
    lgb_eval  = lgb.Dataset(X_train.iloc[ivalid], y_train.iloc[ivalid], reference = lgb_train)
    clf = lgb.train(params, lgb_train, 1000, valid_sets=[lgb_eval], 
                    early_stopping_rounds=100, verbose_eval=200)

    clfs.append(clf)

----------------------------------------
Running for fold 0
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006633 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 18673
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[18]	valid_0's multi_logloss: 0.767179
----------------------------------------
Running for fold 1
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001831 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18692
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[17]	valid_0's multi_logloss: 0.76898
----------------------------------------
Running for fold 2
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001798 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18674
[LightGBM] [Info] Number of data points in the train set: 2510, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693147
[LightGBM] [Info] Start training from score -1.004752
[LightGBM] [Info] Start training from score -2.010927
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[25]	valid_0's multi_logloss: 0.72574
----------------------------------------
Running for fold 3
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002106 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18680
[LightGBM] [Info] Number of data points in the train set: 2511, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.693546
[LightGBM] [Info] Start training from score -1.004063
[LightGBM] [Info] Start training from score -2.011325
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[23]	valid_0's multi_logloss: 0.751315
----------------------------------------
Running for fold 4
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002589 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 18684
[LightGBM] [Info] Number of data points in the train set: 2511, number of used features: 123
[LightGBM] [Warning] Find whitespaces in feature_names, replace with underlines
[LightGBM] [Info] Start training from score -0.692749
[LightGBM] [Info] Start training from score -1.005150
[LightGBM] [Info] Start training from score -2.011325
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[15]	valid_0's multi_logloss: 0.77959

Let's see the features importance of a model¶

In [21]:

lgb.plot_importance(clf, max_num_features=20);

Save your trained model¶

In [22]:

# model.save()

In [23]:

for i, clf in enumerate(clfs):
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{i}.pkl'
    joblib.dump(clf, model_filename)

In [24]:

meta = {
    "numeric_features": numeric_features,
    "cat_cols": cat_cols,
    "dummy_cols": dummy_cols,
    "model_features": model_features
}
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
joblib.dump(meta, meta_filename)

Out[24]:

['assets/model_lgb_meta.pkl']

Prediction phase 🔎¶

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [25]:

# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)

In [26]:

nb_folds = 5 # skf.n_splits
clfs = []
for fold in range(nb_folds):
    print("-"*40)
    print(f"Running for fold {fold}")
    model_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_fold_{fold}.pkl'
    
    clf = joblib.load(model_filename)
    clfs.append(clf)
    
print("-"*40)
meta_filename = f'{AICROWD_ASSETS_DIR}/model_lgb_meta.pkl'
meta = joblib.load(meta_filename)
print(meta.keys())

numeric_features = meta['numeric_features']
cat_cols = meta['cat_cols']
dummy_cols = meta['dummy_cols']
model_features = meta['model_features']

----------------------------------------
Running for fold 0
----------------------------------------
Running for fold 1
----------------------------------------
Running for fold 2
----------------------------------------
Running for fold 3
----------------------------------------
Running for fold 4
----------------------------------------
dict_keys(['numeric_features', 'cat_cols', 'dummy_cols', 'model_features'])

Load test data¶

In [27]:

test_data = pd.read_csv(AICROWD_DATASET_PATH)
test_data.head()

Out[27]:

	row_id	number_of_digits	missing_digit_1	missing_digit_2	missing_digit_3	missing_digit_4	missing_digit_5	missing_digit_6	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	missing_digit_11	missing_digit_12	1 dist from cen	10 dist from cen	11 dist from cen	12 dist from cen	2 dist from cen	3 dist from cen	4 dist from cen	5 dist from cen	6 dist from cen	7 dist from cen	8 dist from cen	9 dist from cen	euc_dist_digit_1	euc_dist_digit_2	euc_dist_digit_3	euc_dist_digit_4	euc_dist_digit_5	euc_dist_digit_6	euc_dist_digit_7	euc_dist_digit_8	euc_dist_digit_9	euc_dist_digit_10	euc_dist_digit_11	euc_dist_digit_12	area_digit_1	area_digit_2	area_digit_3	area_digit_4	area_digit_5	area_digit_6	area_digit_7	area_digit_8	area_digit_9	area_digit_10	area_digit_11	area_digit_12	height_digit_1	height_digit_2	height_digit_3	height_digit_4	height_digit_5	height_digit_6	height_digit_7	height_digit_8	height_digit_9	height_digit_10	height_digit_11	height_digit_12	width_digit_1	width_digit_2	width_digit_3	width_digit_4	width_digit_5	width_digit_6	width_digit_7	width_digit_8	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	sequence_flag_cw	number_of_hands	hand_count_dummy	hour_hand_length	minute_hand_length	single_hand_length	clockhand_ratio	clockhand_diff	angle_between_hands	deviation_from_centre	intersection_pos_rel_centre	hour_proximity_from_11	minute_proximity_from_2	hour_pointing_digit	actual_hour_digit	minute_pointing_digit	actual_minute_digit	final_rotation_angle	ellipse_circle_ratio	count_defects	percentage_inside_ellipse	pred_tremor	double_major	double_minor	vertical_dist	horizontal_dist	top_area_perc	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	other_error	time_diff	centre_dot_detect
0	LA9JQ1JZMJ9D2MBZV	11.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	314.649805	NaN	408.240125	323.348110	321.706776	264.496219	203.330396	205.081082	282.015070	343.657169	416.716030	435.900218	6.119758	25.267069	17.29	6.006505	10.246421	14.43	4.778738	43.124586	46.80	NaN	67.293643	3.90	2001.0	4180.0	6318.0	6528.0	6370.0	8127.0	5610.0	3312.0	9372.0	NaN	3500.0	6336.0	69.0	95.0	117.0	128.0	98.0	129.0	102.0	69.0	142.0	NaN	70.0	72.0	29.0	44.0	54.0	51.0	65.0	63.0	55.0	48.0	66.0	NaN	50.0	88.0	225.618182	730.963636	4.773900e+06	20.605000	360.0	854.199907	NaN	8623.343673	NaN	8623.343673	0.0	3.0	3.0	NaN	NaN	183.844962	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	0.0	84.753550	106	1.000000	0	118.971780	106.379109	111.720745	112.581495	0.500272	0.499368	0.553194	0.446447	0	0	1	NaN	NaN
1	PSSRCWAPTAG72A1NT	6.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	NaN	NaN	235.663425	NaN	NaN	325.616722	NaN	NaN	288.257264	292.027396	334.951116	370.648756	NaN	NaN	22.88	NaN	NaN	72.80	72.787316	20.133319	96.33	NaN	60.955820	NaN	NaN	NaN	12390.0	NaN	NaN	8848.0	5632.0	10434.0	7739.0	NaN	11834.0	NaN	NaN	NaN	118.0	NaN	NaN	79.0	64.0	94.0	71.0	NaN	97.0	NaN	NaN	NaN	105.0	NaN	NaN	112.0	88.0	111.0	109.0	NaN	122.0	NaN	126.166667	391.766667	6.631428e+06	64.003333	NaN	5998.258485	NaN	16273.285540	NaN	16273.285540	0.0	1.0	1.0	NaN	NaN	99.180032	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	180.0	73.359021	99	1.000000	0	123.968624	99.208099	104.829045	114.955335	0.572472	0.427196	0.496352	0.503273	0	1	1	NaN	NaN
2	GCTODIZJB42VCBZRZ	11.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	438.627689	429.789774	447.455305	447.033835	409.185166	361.946474	359.824957	NaN	345.937133	366.201106	375.225266	427.154831	112.333641	100.371900	86.45	86.234478	NaN	89.57	94.556399	97.331146	111.02	111.411562	116.061975	116.22	3182.0	4473.0	4554.0	5032.0	NaN	5355.0	4148.0	4320.0	4420.0	7290.0	2726.0	5184.0	43.0	71.0	69.0	68.0	NaN	51.0	68.0	48.0	52.0	81.0	47.0	81.0	74.0	63.0	66.0	74.0	NaN	105.0	61.0	90.0	85.0	90.0	58.0	64.0	228.072727	192.618182	1.418911e+06	100.815000	360.0	315.683251	NaN	257.619483	NaN	257.619483	1.0	2.0	2.0	42.707325	78.437307	NaN	1.836624	35.729983	106.779868	55.597531	BL	6.15111	0.57766	11.0	11	2.0	2	270.0	86.346225	120	1.000000	0	124.134670	120.392100	122.909870	121.542463	0.494076	0.505583	0.503047	0.496615	1	0	0	0.0	0.0
3	7YMVQGV1CDB1WZFNE	3.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.0	NaN	NaN	NaN	408.827592	272.472476	NaN	195.714716	NaN	NaN	NaN	NaN	NaN	NaN	2.506574	NaN	4.353660	NaN	NaN	NaN	NaN	NaN	NaN	NaN	12.48	NaN	1794.0	NaN	3416.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3360.0	NaN	39.0	NaN	56.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	56.0	NaN	46.0	NaN	61.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	60.0	70.333333	96.333333	8.477293e+05	12.480000	360.0	NaN	360.0	11194.405100	NaN	11194.405100	1.0	3.0	3.0	NaN	NaN	204.987534	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	NaN	2	30.0	51.132436	16	0.800000	1	69.766987	53.627186	53.983727	69.002438	0.555033	0.444633	0.580023	0.419575	0	1	1	NaN	NaN
4	PHEQC6DV3LTFJYIJU	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	436.069089	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	113.252059	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	25542.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	129.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	198.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	0.0	NaN	0.0	2.0	2.0	77.405367	92.911356	NaN	1.200322	15.505989	100.478258	8.853306	TR	NaN	NaN	8.0	11	8.0	2	30.0	54.115853	18	0.666667	1	112.043734	87.607876	94.088846	101.540792	0.603666	0.395976	0.494990	0.504604	0	0	1	150.0	0.0

Generate predictions¶

In [28]:

test_data = test_data.copy()

for c in numeric_features:
    test_data[c] = test_data[c].astype(float)
    
for c in cat_cols:
    test_data[c].fillna("NA", inplace=True)
    
df_test_dummies = pd.get_dummies(test_data[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
test_data = pd.concat([test_data, df_test_dummies], axis=1)
test_data['cnt_NaN'] = test_data[numeric_features].isna().sum(axis=1)

test_data.fillna(-1, inplace=True)

for c in dummy_cols:
    if c not in test_data.columns:
        test_data[c] = 0

print("Missing columns:", [c for c in model_features if c not in test_data.columns])
test_data.head(3)

X_test = test_data[model_features]

preds = 0.0
nb_folds = 5 # skf.n_splits
for fold, clf in enumerate(clfs):
    print("-"*40)
    print(f"Running for fold {fold}")
    pred = clf.predict(X_test)
    preds += pred/nb_folds
    
print(preds.shape)

Missing columns: []
----------------------------------------
Running for fold 0
----------------------------------------
Running for fold 1
----------------------------------------
Running for fold 2
----------------------------------------
Running for fold 3
----------------------------------------
Running for fold 4
(362, 3)

In [29]:

predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": preds[:,0],
    "post_alzheimer_diagnosis_probability": preds[:,1],
    "pre_alzheimer_diagnosis_probability": preds[:,2]
}

predictions_df = pd.DataFrame.from_dict(predictions)

Save predictions 📨¶

In [30]:

predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀¶

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [31]:

!aicrowd login --api-key $AICROWD_API_KEY
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge

Error: --api-key option requires an argument
Using notebook: /home/desktop0/public_baseline.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 4345 bytes to /home/desktop0/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 54460 bytes to /home/desktop0/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 8.2/8.2 MB • 2.7 MB/s • 0:00:00[0m • 0:00:01[36m0:00:01
                                                 ╭─────────────────────────╮                                                 
                                                 │ Successfully submitted! │                                                 
                                                 ╰─────────────────────────╯                                                 
                                                       Important links                                                       
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/133893              │
│                  │                                                                                                        │
│  All submissions │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true │
│                  │                                                                                                        │
│      Leaderboard │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    │
│                  │                                                                                                        │
│ Discussion forum │ https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    │
│                  │                                                                                                        │
│   Challenge page │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In [ ]:

ADDI Alzheimers Detection Challenge

Simple EDA and Baseline - LB 0.66 (0.616 with a magic)