Setup AIcrowd Utilities 🛠¶

In [ ]:

!pip install -q -U aicrowd-cli

In [ ]:

%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷¶

In [ ]:

import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_TRAIN_DATASET_PATH = os.getenv("TRAIN_DATASET_PATH", "/ds_shared_drive/train.csv")
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"

Install packages 🗃¶

In [ ]:

!pip install numpy pandas catboost sklearn

Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: catboost in ./conda/lib/python3.8/site-packages (0.25.1)
Requirement already satisfied: sklearn in ./conda/lib/python3.8/site-packages (0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: scipy in ./conda/lib/python3.8/site-packages (from catboost) (1.6.3)
Requirement already satisfied: matplotlib in ./conda/lib/python3.8/site-packages (from catboost) (3.4.1)
Requirement already satisfied: graphviz in ./conda/lib/python3.8/site-packages (from catboost) (0.16)
Requirement already satisfied: six in ./conda/lib/python3.8/site-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in ./conda/lib/python3.8/site-packages (from catboost) (4.14.3)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (from sklearn) (0.24.2)
Requirement already satisfied: pyparsing>=2.2.1 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: pillow>=6.2.0 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (8.2.0)
Requirement already satisfied: cycler>=0.10 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: retrying>=1.3.3 in ./conda/lib/python3.8/site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn->sklearn) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn->sklearn) (2.1.0)

Define preprocessing code¶

Import common packages¶

In [ ]:

import numpy as np
import os
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

Training phase¶

Load training data¶

In [ ]:

train_data = pd.read_csv(AICROWD_TRAIN_DATASET_PATH)
train_data.head()

Out[ ]:

	row_id	number_of_digits	missing_digit_4	missing_digit_7	missing_digit_8	...	bottom_area_perc	left_area_perc	right_area_perc	hor_count	vert_count	other_error	time_diff	centre_dot_detect	diagnosis
0	S0CIXBKIUEOUBNURP	12.0	0.0	0.0	0.0	...	0.526170	0.524975	0.474667	0	0	1	-105.0	0.0	normal
1	IW1Z4Z3H720OPW8LL	12.0	0.0	0.0	0.0	...	0.000810	0.516212	0.483330	0	1	1	NaN	NaN	normal
2	PVUGU14JRSU44ZADT	12.0	0.0	0.0	0.0	...	0.488109	0.550606	0.449042	0	0	0	0.0	0.0	normal
3	RW5UTGMB9H67LWJHX	7.0	1.0	1.0	1.0	...	NaN	NaN	NaN	1	0	1	NaN	NaN	normal
4	W0IM2V6F6UP5LYS3E	12.0	0.0	0.0	0.0	...	0.512818	0.511865	0.487791	0	1	0	0.0	1.0	normal

5 rows × 122 columns

Features exploration¶

In [ ]:

regr_features = []
cat_features = []

Functions¶

In [ ]:

def get_corr(feature):
    features_corr = [feature]
    features_corr.append('diagnosis')
    df_corr = train_data[features_corr].copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    if 'intersection_pos_rel_centre' in features_corr:
        df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    return df_corr.corr().values[0, 1]

In [ ]:

bold = '\033[1m'
ordinary = '\033[0m'
def feature_describe(feature):
    print(bold+'Data type:'+ordinary, train_data[feature].dtype)
    print(bold+'Number of missing values: '+ordinary + str(round(100 * train_data[feature].isnull().sum() / train_data.shape[0], 2)) + '%')
    print(bold+'Correlation with the diagnosis:'+ordinary, round(get_corr(feature), 2))
    if train_data[feature].dtype != object:
        print(bold+'Min:'+ordinary, round(train_data[feature].min(), 2))
        print(bold+'Mean:'+ordinary, round(train_data[feature].mean(), 2))
        print(bold+'Max:'+ordinary, round(train_data[feature].max(), 2))
    unique_number = train_data[feature].nunique()
    uniques = train_data[feature].unique()
    print(bold+'Number of unique values:'+ordinary, unique_number)
    print(bold+'Example of unique values:'+ordinary, end=' ')    
    for i in range(len(uniques[:5])):
        if i != len(uniques[:5]) - 1:
            if train_data[feature].dtype == object:
                print(uniques[i], end=', ')
            else:
                print(np.round(uniques[i], 2), end=', ')
        else:
            if train_data[feature].dtype == object:
                print(uniques[i])
            else:
                print(np.round(uniques[i], 2))
            
colors = ['orange', 'green', 'purple', 'deeppink', 'blue']
            
def show_distribution(feature):
    plt.figure(figsize=(8, 4), dpi=80)
    if train_data[feature].dtype == object:
        print('Mapping values:', end=' ')
        for i in range(len(set(pd.factorize(train_data[feature])[0]))):
            if list(set(pd.factorize(train_data[feature])[0]))[i] == -1:
                print('nan - -1', end=', ')
            else:
                print(list(set(pd.factorize(train_data[feature])[1]))[i] + ' - ' + str(list(set(pd.factorize(train_data[feature])[0]))[i]), end=', ')
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'normal'].index], 
                    label='normal', linewidth=3, shade=True, color='green', alpha=.5)
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'pre_alzheimer'].index], 
                    label='pre_alzheimer', linewidth=3, shade=True, color='orange', alpha=.5)
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'post_alzheimer'].index], 
                    label='post_alzheimer', linewidth=3, shade=True, color='blue', alpha=.5)
    else:
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'normal', feature], 
                    label='normal', linewidth=3, shade=True, color='green', alpha=.5)
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'pre_alzheimer', feature], 
                    label='pre_alzheimer', linewidth=3, shade=True, color='orange', alpha=.5)
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'post_alzheimer', feature], 
                    label='post_alzheimer', linewidth=3, shade=True, color='blue', alpha=.5)
    plt.xlabel('Value')
    plt.legend()
    plt.title(feature)
    plt.show()
    
def show_distribution_hist(feature):
    
    df = train_data.copy()
    if feature == 'intersection_pos_rel_centre':
        print('Mapping values:', end=' ')
        for i in range(len(set(pd.factorize(df[feature])[0]))):
            if list(set(pd.factorize(df[feature])[0]))[i] == -1:
                print('nan - -1', end=', ')
            else:
                print(list(set(pd.factorize(train_data[feature])[1]))[i] + ' - ' + str(list(set(pd.factorize(train_data[feature])[0]))[i]), end=', ')

    df['intersection_pos_rel_centre'] = pd.factorize(df['intersection_pos_rel_centre'])[0]
        
    _, ax = plt.subplots(1, 3, figsize=(16, 4), dpi=80)
    sns.histplot(df.loc[df.diagnosis == 'normal', feature], 
                 ax=ax[0], label='normal', color='green', stat='probability', bins=10)
    sns.histplot(df.loc[df.diagnosis == 'pre_alzheimer', feature], 
                 ax=ax[1], label='pre_alzheimer', color='orange', stat='probability', bins=10)
    sns.histplot(df.loc[df.diagnosis == 'post_alzheimer', feature], 
                 ax=ax[2], label='post_alzheimer', color='blue', stat='probability', bins=10)
    ax[0].legend()
    ax[1].legend()
    ax[2].legend()
    plt.show()

In [ ]:

def show_corr(features):
    
    features_corr = features.copy()
    features_corr.append('diagnosis')
    df_corr = train_data[features_corr].copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    if 'intersection_pos_rel_centre' in features_corr:
        df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    
    corr = df_corr.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    plt.figure(figsize=(20, 8))
    sns.heatmap(corr, 
                mask=mask,
                cmap=sns.color_palette('dark:salmon_r', as_cmap=True),
                annot=True,
                center=0,
                linewidths=.5, cbar_kws={'shrink': .5})
    plt.show()
    
    del df_corr 
    del features_corr
    del corr
    del mask

Clock and Digit Features¶

In [ ]:

clock_features = []

Final Rotation Angle¶

In [ ]:

feature = 'final_rotation_angle'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: 0.05
Min: 0.0
Mean: 65.74
Max: 330.0
Number of unique values: 12
Example of unique values: 0.0, 330.0, 90.0, 270.0, 30.0

Number of Digits¶

In [ ]:

feature = 'number_of_digits'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: -0.21
Min: 1.0
Mean: 10.3
Max: 17.0
Number of unique values: 17
Example of unique values: 12.0, 7.0, 2.0, 11.0, 4.0

Missing Digit Dummy Variables¶

In [ ]:

feature = 'missing_digit_1'
# similary we have 11 other variables for all the other digits (missing_digit_2, missing_digit_3, etc.)

feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)

for i in range(1, 13):
    feature = 'missing_digit_{}'.format(i)
    cat_features.append(feature)
    clock_features.append(feature)

Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: 0.12
Min: 0.0
Mean: 0.22
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, 1.0, nan

Deviation of Axis Digits (3, 6, 9 and 12) from Mid Axes¶

In [ ]:

feature = 'deviation_dist_from_mid_axis'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 1.77%
Correlation with the diagnosis: 0.04
Min: 0.0
Mean: 32.2
Max: 125.71
Number of unique values: 4336
Example of unique values: 14.1, 21.12, 19.66, 34.02, 8.71

Between Axis Digits Angle Metrics¶

In [ ]:

feature = 'between_axis_digits_angle_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 11.09%
Correlation with the diagnosis: -0.08
Min: 0.0
Mean: 352.14
Max: 360.0
Number of unique values: 74
Example of unique values: nan, 360.0, 0.0, 352.32, 352.7

In [ ]:

feature = 'between_axis_digits_angle_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 5.82%
Correlation with the diagnosis: 0.12
Min: 0.0
Mean: 2587.13
Max: 63116.01
Number of unique values: 30816
Example of unique values: 225.27, 382.13, 439.97, 6465.58, 59.51

Between Digits Angle Metrics¶

In [ ]:

feature = 'between_digits_angle_cw_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 38.9%
Correlation with the diagnosis: -0.12
Min: 0.0
Mean: 355.1
Max: 360.0
Number of unique values: 4
Example of unique values: 360.0, nan, 0.0, 343.05, 180.0

In [ ]:

feature = 'between_digits_angle_cw_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 2.11%
Correlation with the diagnosis: 0.18
Min: 0.0
Mean: 3081.78
Max: 63259.73
Number of unique values: 32079
Example of unique values: 72.27, 69.82, 49.54, 13467.73, 50.76

In [ ]:

feature = 'between_digits_angle_ccw_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 97.43%
Correlation with the diagnosis: 0.04
Min: 0.0
Mean: 243.83
Max: 360.0
Number of unique values: 3
Example of unique values: nan, 360.0, 0.0, 228.66

In [ ]:

feature = 'between_digits_angle_ccw_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 2.11%
Correlation with the diagnosis: 0.17
Min: 0.0
Mean: 3157.42
Max: 63259.73
Number of unique values: 32079
Example of unique values: 72.27, 69.82, 49.54, 13467.73, 50.76

Sequence Flag Clock Wise and Counter Clock Wise¶

In [ ]:

feature = 'sequence_flag_cw'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
cat_features.append(feature)

Data type: float64
Number of missing values: 0.92%
Correlation with the diagnosis: -0.12
Min: 0.0
Mean: 0.75
Max: 1.0
Number of unique values: 2
Example of unique values: 1.0, 0.0, nan

In [ ]:

feature = 'sequence_flag_ccw'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
cat_features.append(feature)

Data type: float64
Number of missing values: 0.92%
Correlation with the diagnosis: 0.14
Min: 0.0
Mean: 0.02
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, 1.0, nan

Correlation of clock and digits features¶

In [ ]:

show_corr(clock_features)

Hand Features¶

In [ ]:

hand_features = []

Number of Hands¶

In [ ]:

feature = 'number_of_hands'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 6.57%
Correlation with the diagnosis: -0.12
Min: 1.0
Mean: 1.77
Max: 8.0
Number of unique values: 7
Example of unique values: 2.0, 1.0, 4.0, nan, 3.0

In [ ]:

feature = 'hand_count_dummy'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: float64
Number of missing values: 6.57%
Correlation with the diagnosis: -0.13
Min: 1.0
Mean: 1.77
Max: 3.0
Number of unique values: 3
Example of unique values: 2.0, 1.0, 3.0, nan

Hand Length¶

In [ ]:

feature = 'hour_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.0
Min: 23.16
Mean: 60.54
Max: 123.52
Number of unique values: 11931
Example of unique values: 53.16, nan, 70.54, 61.79, 59.54

In [ ]:

feature = 'minute_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.03
Min: 33.19
Mean: 80.87
Max: 133.69
Number of unique values: 13480
Example of unique values: 77.9, nan, 75.65, 68.25, 81.46

In [ ]:

feature = 'single_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 76.38%
Correlation with the diagnosis: -0.01
Min: 18.02
Mean: 74.6
Max: 292.85
Number of unique values: 6678
Example of unique values: nan, 81.24, 37.87, 48.28, 98.25

In [ ]:

feature = 'clockhand_ratio'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.9%
Correlation with the diagnosis: -0.03
Min: 1.0
Mean: 1.38
Max: 2.5
Number of unique values: 22642
Example of unique values: 1.47, nan, 1.07, 1.1, 1.37

In [ ]:

feature = 'clockhand_diff'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.33%
Correlation with the diagnosis: -0.03
Min: 0.0
Mean: 20.27
Max: 69.87
Number of unique values: 22834
Example of unique values: 24.73, nan, 5.11, 6.46, 21.92

Angle Between Hands¶

In [ ]:

feature = 'angle_between_hands'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.08
Min: 0.03
Mean: 90.17
Max: 179.52
Number of unique values: 22838
Example of unique values: 32.63, nan, 72.45, 107.05, 119.91

Deviation of Intersection Point of Hands from Geometric Centre¶

In [ ]:

feature = 'deviation_from_centre'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 30.46%
Correlation with the diagnosis: 0.09
Min: 0.1
Mean: 17.42
Max: 298.72
Number of unique values: 22793
Example of unique values: 88.54, nan, 8.24, 18.54, 24.92

In [ ]:

feature = 'intersection_pos_rel_centre'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: object
Number of missing values: 30.25%
Correlation with the diagnosis: -0.12
Number of unique values: 4
Example of unique values: TL, nan, BL, TR, BR
Mapping values: TL - 0, BL - 1, TR - 2, BR - 3, nan - -1,

Mapping values: TL - 0, BL - 1, TR - 2, BR - 3, nan - -1,

The Proximity of Hour and Minute from 11 and 2 Respectively¶

In [ ]:

feature = 'hour_proximity_from_11'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 38.4%
Correlation with the diagnosis: 0.05
Min: 0.0
Mean: 24.92
Max: 179.77
Number of unique values: 20153
Example of unique values: 27.08, nan, 4.64, 8.39, 118.9

In [ ]:

feature = 'minute_proximity_from_2'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 39.23%
Correlation with the diagnosis: 0.09
Min: 0.0
Mean: 33.27
Max: 179.93
Number of unique values: 19899
Example of unique values: 69.66, nan, 0.56, 3.32, 113.84

Digit Pointed by Hour and Minute Hand¶

In [ ]:

feature = 'hour_pointing_digit'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: float64
Number of missing values: 30.81%
Correlation with the diagnosis: -0.02
Min: 1.0
Mean: 9.05
Max: 12.0
Number of unique values: 12
Example of unique values: 12.0, nan, 11.0, 2.0, 3.0

In [ ]:

feature = 'minute_pointing_digit'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: float64
Number of missing values: 30.81%
Correlation with the diagnosis: 0.09
Min: 1.0
Mean: 4.39
Max: 12.0
Number of unique values: 12
Example of unique values: 11.0, nan, 2.0, 3.0, 12.0

Clock Hand Errors¶

In [ ]:

feature = 'eleven_ten_error'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.03
Min: 0
Mean: 0.03
Max: 1
Number of unique values: 2
Example of unique values: 0, 1

In [ ]:

feature = 'other_error'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.13
Min: 0
Mean: 0.61
Max: 1
Number of unique values: 2
Example of unique values: 1, 0

Correlation of hand features¶

In [ ]:

show_corr(hand_features)

Circle Features¶

In [ ]:

circle_features = []

Ellipse to Circle Ratio¶

In [ ]:

feature = 'ellipse_circle_ratio'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 2.25%
Correlation with the diagnosis: -0.05
Min: 0.0
Mean: 79.12
Max: 99.97
Number of unique values: 32038
Example of unique values: 63.43, 42.04, 77.64, 67.62, 90.86

Predicted Tremor and the Number of Defects¶

In [ ]:

feature = 'count_defects'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.07
Min: 1
Mean: 93.49
Max: 176
Number of unique values: 172
Example of unique values: 94, 38, 103, 47, 158

In [ ]:

feature = 'pred_tremor'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
cat_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.07
Min: 0
Mean: 0.32
Max: 1
Number of unique values: 2
Example of unique values: 0, 1

Percentage of Digits inside the Clock Face¶

In [ ]:

feature = 'percentage_inside_ellipse'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.93%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.94
Max: 1.0
Number of unique values: 70
Example of unique values: 0.69, 0.25, 0.92, 0.86, 1.0

The Length of the Major and Minor Axis of the Fitted Ellipse¶

In [ ]:

feature = 'double_major'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.72%
Correlation with the diagnosis: 0.01
Min: 9.7
Mean: 120.24
Max: 499.39
Number of unique values: 32328
Example of unique values: 132.67, 106.8, 119.38, 123.76, 124.05

In [ ]:

feature = 'double_minor'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.04
Min: 0.0
Mean: 106.54
Max: 305.19
Number of unique values: 32469
Example of unique values: 82.43, 52.93, 100.03, 74.46, 117.91

Area of the Top, Bottom, Left and Right Hemisphere of the Circle¶

In [ ]:

feature = 'top_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: -0.02
Min: 0.0
Mean: 0.52
Max: 1.0
Number of unique values: 31153
Example of unique values: 0.47, 1.0, 0.51, nan, 0.49

In [ ]:

feature = 'bottom_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: 0.02
Min: 0.0
Mean: 0.47
Max: 1.0
Number of unique values: 31149
Example of unique values: 0.53, 0.0, 0.49, nan, 0.51

In [ ]:

feature = 'left_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: 0.0
Min: 0.0
Mean: 0.53
Max: 1.0
Number of unique values: 31163
Example of unique values: 0.52, 0.52, 0.55, nan, 0.51

In [ ]:

feature = 'right_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.46
Max: 1.0
Number of unique values: 31162
Example of unique values: 0.47, 0.48, 0.45, nan, 0.49

Horizontal and vertical distances¶

In [ ]:

feature = 'horizontal_dist'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.13%
Correlation with the diagnosis: -0.02
Min: 0.0
Mean: 113.92
Max: 492.88
Number of unique values: 32597
Example of unique values: 98.67, 52.95, 106.74, 122.83, 123.53

In [ ]:

feature = 'vertical_dist'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: float64
Number of missing values: 0.57%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 111.58
Max: 499.39
Number of unique values: 32534
Example of unique values: 99.37, 106.68, 110.2, 74.67, 118.36

Euclidean Distance from Digits¶

In [ ]:

feature = 'euc_dist_digit_1'
# similary we have 11 other variables for all the other digits (euc_dist_digit_2, euc_dist_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'euc_dist_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)

Data type: float64
Number of missing values: 22.51%
Correlation with the diagnosis: 0.03
Min: 0.0
Mean: 30.29
Max: 119.96
Number of unique values: 23912
Example of unique values: 30.13, 15.46, 11.66, 9.46, 17.6

Distance of Digits from clock center¶

In [ ]:

feature = '1 dist from cen'
# similary we have 11 other variables for all the other digits (2 dist from cen, 3 dist from cen, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = '{} dist from cen'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)

Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: -0.04
Min: 3.35
Mean: 361.87
Max: 618.03
Number of unique values: 21147
Example of unique values: 325.1, 394.52, 369.43, 380.71, 374.91

Area, Height, Width of Digit Bounding Boxes Metrics¶

In [ ]:

feature = 'area_digit_1'
# similary we have 11 other variables for all the other digits (area_digit_2, area_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'area_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)

Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.04
Min: 640.0
Mean: 2308.11
Max: 9870.0
Number of unique values: 1965
Example of unique values: 3182.0, 2015.0, 2320.0, 4640.0, 1652.0

In [ ]:

feature = 'height_digit_1'
# similary we have 11 other variables for all the other digits (height_digit_2, height_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'height_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)

Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.01
Min: 19.0
Mean: 59.88
Max: 143.0
Number of unique values: 123
Example of unique values: 74.0, 65.0, 80.0, 116.0, 59.0

In [ ]:

feature = 'width_digit_1'
# similary we have 11 other variables for all the other digits (width_digit_2, width_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'width_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)

Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.04
Min: 18.0
Mean: 40.34
Max: 164.0
Number of unique values: 126
Example of unique values: 43.0, 31.0, 29.0, 40.0, 28.0

In [ ]:

feature = 'variance_width'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)

Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.06
Min: 0.0
Mean: 363.58
Max: 5408.0
Number of unique values: 24755
Example of unique values: 682.64, 201.73, 362.39, 587.95, 276.79

In [ ]:

feature = 'variance_height'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)

Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.06
Min: 0.0
Mean: 324.12
Max: 6844.5
Number of unique values: 24130
Example of unique values: 383.11, 293.15, 324.42, 301.29, 110.45

In [ ]:

feature = 'variance_area'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)

Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.09
Min: 0.0
Mean: 5148403.27
Max: 153019556.9
Number of unique values: 32296
Example of unique values: 5683189.17, 3912113.54, 6395827.46, 6531155.57, 2105465.36

Time Features¶

In [ ]:

feature = 'time_diff'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)

Data type: float64
Number of missing values: 31.27%
Correlation with the diagnosis: 0.01
Min: -110.0
Mean: 105.2
Max: 605.0
Number of unique values: 140
Example of unique values: -105.0, nan, 0.0, 495.0, 540.0

Centre Dot Detection¶

In [ ]:

feature = 'centre_dot_detect'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
cat_features.append(feature)
circle_features.append(feature)

Data type: float64
Number of missing values: 30.36%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.24
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, nan, 1.0

Horizontal and vertical count¶

In [ ]:

feature = 'hor_count'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.09
Min: 0
Mean: 0.69
Max: 3
Number of unique values: 4
Example of unique values: 0, 1, 2, 3

In [ ]:

feature = 'vert_count'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)

Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.08
Min: 0
Mean: 0.76
Max: 3
Number of unique values: 4
Example of unique values: 0, 1, 2, 3

Correlation of circle features¶

In [ ]:

show_corr(circle_features[:20])

In [ ]:

show_corr(circle_features[20:40])

In [ ]:

show_corr(circle_features[40:60])

In [ ]:

show_corr(circle_features[60:])

Overview of all data¶

Functions¶

In [ ]:

def show_miss():
    df_miss = 100 * train_data.isnull().sum().sort_values(ascending=False) / train_data.shape[0]
    for i in [1, 5, 10, 15, 20, 30, 50]:
        print('More than {} columns with a percentage of omissions greater than {}%'.format(len(df_miss[df_miss > i]), i))
    
    plt.figure(figsize=(10, 10))
    sns.barplot(x=df_miss.head(26).values, y=df_miss.head(26).index)
    plt.title('Top 26 columns by number of omissions')
    
def show_corr_all():
    df_corr = train_data.copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    df_corr = df_corr.corr()['diagnosis'].sort_values(ascending=False).iloc[1:-2]
    df_corr = pd.concat([df_corr.head(20), df_corr.tail(10)])
    plt.figure(figsize=(10, 10))
    sns.barplot(x=df_corr.values, y=df_corr.index)
    plt.title('Top 20 correlations from the beggining and top 10 from the end')
    
def dist_diagnosis():
    plt.figure(figsize=(10,5))
    sns.barplot(x=train_data.diagnosis.unique(), y=train_data.groupby('diagnosis').row_id.count(), palette='rocket')
    plt.ylabel('Count')
    plt.show()

Missing values¶

In [ ]:

show_miss()

More than 91 columns with a percentage of omissions greater than 1%
More than 80 columns with a percentage of omissions greater than 5%
More than 77 columns with a percentage of omissions greater than 10%
More than 47 columns with a percentage of omissions greater than 15%
More than 26 columns with a percentage of omissions greater than 20%
More than 16 columns with a percentage of omissions greater than 30%
More than 2 columns with a percentage of omissions greater than 50%

Categorical and regression features¶

I made the division according to my subjective logic. This is not the only correct solution.

In [ ]:

print(bold+'These are all categorical features in the form of a list:'+ordinary)
print(cat_features)

These are all categorical features in the form of a list:
['missing_digit_1', 'missing_digit_2', 'missing_digit_3', 'missing_digit_4', 'missing_digit_5', 'missing_digit_6', 'missing_digit_7', 'missing_digit_8', 'missing_digit_9', 'missing_digit_10', 'missing_digit_11', 'missing_digit_12', 'sequence_flag_cw', 'sequence_flag_ccw', 'hand_count_dummy', 'intersection_pos_rel_centre', 'hour_pointing_digit', 'minute_pointing_digit', 'eleven_ten_error', 'other_error', 'pred_tremor', 'centre_dot_detect']

In [ ]:

train_data[cat_features].head()

Out[ ]:

	missing_digit_4	missing_digit_7	missing_digit_8	missing_digit_9	missing_digit_10	...	sequence_flag_cw	hand_count_dummy	intersection_pos_rel_centre	hour_pointing_digit	minute_pointing_digit	other_error	pred_tremor	centre_dot_detect
0	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	TL	12.0	11.0	1	0	0.0
1	0.0	0.0	0.0	0.0	0.0	...	1.0	1.0	NaN	NaN	NaN	1	1	NaN
2	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	BL	11.0	2.0	0	0	0.0
3	1.0	1.0	1.0	1.0	1.0	...	0.0	1.0	NaN	NaN	NaN	1	1	NaN
4	0.0	0.0	0.0	0.0	0.0	...	1.0	2.0	TR	11.0	2.0	0	0	1.0

5 rows × 22 columns

In [ ]:

train_data[regr_features].head()

Out[ ]:

	number_of_digits	deviation_dist_from_mid_axis	between_axis_digits_angle_sum	between_axis_digits_angle_var	between_digits_angle_cw_sum	between_digits_angle_cw_var	between_digits_angle_ccw_sum	between_digits_angle_ccw_var	number_of_hands	...	width_digit_9	width_digit_10	width_digit_11	width_digit_12	variance_width	variance_height	variance_area	time_diff	hor_count	vert_count
0	12.0	14.105000	NaN	225.273687	360.0	72.269406	NaN	72.269406	2.0	...	64.0	65.0	55.0	120.0	682.636364	383.113636	5683189.174	-105.0	0	0
1	12.0	21.125000	360.0	382.127186	360.0	69.822716	NaN	69.822716	1.0	...	54.0	83.0	67.0	80.0	201.727273	293.151515	3912113.538	NaN	0	1
2	12.0	19.662500	360.0	439.972719	360.0	49.540354	NaN	49.540354	2.0	...	73.0	99.0	86.0	94.0	362.386364	324.424242	6395827.455	0.0	0	0
3	7.0	34.016667	360.0	6465.579942	NaN	13467.727800	NaN	13467.727800	1.0	...	NaN	NaN	74.0	116.0	587.952381	301.285714	6531155.571	NaN	1	0
4	12.0	8.710000	360.0	59.508165	360.0	50.762378	NaN	50.762378	2.0	...	41.0	81.0	64.0	87.0	276.787879	110.446970	2105465.356	0.0	0	1

5 rows × 96 columns

Top correlations¶

In [ ]:

show_corr_all()

Diagnosis distribution¶

In [ ]:

dist_diagnosis()

Training¶

In [ ]:

train_data.dtypes[train_data.dtypes == object]

Out[ ]:

row_id                         object
intersection_pos_rel_centre    object
diagnosis                      object
dtype: object

In [ ]:

train_data[cat_features] = train_data[cat_features].fillna(999)
train_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']] = train_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']].astype(int)

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(train_data.drop(['row_id', 'diagnosis'], axis=1), train_data['diagnosis'], 
                                                    test_size=0.15, stratify=train_data['diagnosis'], random_state=17)

model = CatBoostClassifier(loss_function='MultiClass',
                          auto_class_weights='SqrtBalanced')
model.fit(X_train, y_train, eval_set=(X_test, y_test), cat_features=cat_features, verbose=100)

Learning rate set to 0.115406
0:	learn: 0.9862599	test: 0.9876717	best: 0.9876717 (0)	total: 269ms	remaining: 4m 28s
100:	learn: 0.4245284	test: 0.5377943	best: 0.5367516 (93)	total: 23s	remaining: 3m 24s
200:	learn: 0.3352676	test: 0.5394514	best: 0.5350195 (157)	total: 47.3s	remaining: 3m 8s
300:	learn: 0.2727060	test: 0.5486710	best: 0.5350195 (157)	total: 1m 11s	remaining: 2m 45s
400:	learn: 0.2246994	test: 0.5623510	best: 0.5350195 (157)	total: 1m 35s	remaining: 2m 22s
500:	learn: 0.1848185	test: 0.5801075	best: 0.5350195 (157)	total: 1m 59s	remaining: 1m 58s
600:	learn: 0.1552057	test: 0.5976177	best: 0.5350195 (157)	total: 2m 21s	remaining: 1m 34s

In [ ]:

model.save_model(AICROWD_ASSETS_DIR + '/model_123')
np.save(AICROWD_ASSETS_DIR + '/cat', cat_features)

Prediction phase 🔎¶

In [ ]:

from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.load_model(AICROWD_ASSETS_DIR + '/model_123')

Out[ ]:

<catboost.core.CatBoostClassifier at 0x7f8356c3eee0>

Load test data¶

In [ ]:

test_data = pd.read_csv(AICROWD_DATASET_PATH)
cat_features = np.load(AICROWD_ASSETS_DIR + '/cat.npy', allow_pickle=True)

In [ ]:

test_data[cat_features] = test_data[cat_features].fillna(999)
test_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']] = test_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']].astype(int)

Generate predictions¶

In [ ]:

preds = model.predict_proba(test_data.drop(['row_id'], axis=1))

In [ ]:

predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": preds[:, 0],
    "post_alzheimer_diagnosis_probability": preds[:, 1],
    "pre_alzheimer_diagnosis_probability": preds[:, 2],
}

predictions_df = pd.DataFrame.from_dict(predictions)

In [ ]:

pred_sum = predictions_df['normal_diagnosis_probability'] + predictions_df['post_alzheimer_diagnosis_probability'] + predictions_df['pre_alzheimer_diagnosis_probability']
predictions_df['normal_diagnosis_probability'] /= pred_sum 
predictions_df['post_alzheimer_diagnosis_probability'] /= pred_sum 
predictions_df['pre_alzheimer_diagnosis_probability'] /= pred_sum
predictions_df['normal_diagnosis_probability'] + predictions_df['post_alzheimer_diagnosis_probability'] + predictions_df['pre_alzheimer_diagnosis_probability']

Out[ ]:

0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
357    1.0
358    1.0
359    1.0
360    1.0
361    1.0
Length: 362, dtype: float64

In [ ]:

predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀¶

In [ ]:

!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge

API Key valid
Saved API Key successfully!
Using notebook: /home/desktop0/features_exploration.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 3773 bytes to /home/desktop0/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 9077 bytes to /home/desktop0/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━ 100.0% • 13.9/13.9 MB • 974.2 kB/s • 0:00:00[0m • 0:00:01[36m0:00:01
                                                 ╭─────────────────────────╮                                                 
                                                 │ Successfully submitted! │                                                 
                                                 ╰─────────────────────────╯                                                 
                                                       Important links                                                       
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/138056              │
│                  │                                                                                                        │
│  All submissions │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true │
│                  │                                                                                                        │
│      Leaderboard │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    │
│                  │                                                                                                        │
│ Discussion forum │ https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    │
│                  │                                                                                                        │
│   Challenge page │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In [ ]:

ADDI Alzheimers Detection Challenge

Detailed Data Analysis & Simple CatBoost - 0.640 on LB

Setup AIcrowd Utilities 🛠¶

AIcrowd Runtime Configuration 🧷¶

Install packages 🗃¶

Define preprocessing code¶

Import common packages¶

Training phase¶

Load training data¶

Features exploration¶

Functions¶

Clock and Digit Features¶

Final Rotation Angle¶

Number of Digits¶

Missing Digit Dummy Variables¶

Deviation of Axis Digits (3, 6, 9 and 12) from Mid Axes¶

Between Axis Digits Angle Metrics¶

Between Digits Angle Metrics¶

Sequence Flag Clock Wise and Counter Clock Wise¶

Correlation of clock and digits features¶

Hand Features¶

Number of Hands¶

Hand Length¶

Angle Between Hands¶

Deviation of Intersection Point of Hands from Geometric Centre¶

The Proximity of Hour and Minute from 11 and 2 Respectively¶

Digit Pointed by Hour and Minute Hand¶

Clock Hand Errors¶

Correlation of hand features¶

Circle Features¶

Ellipse to Circle Ratio¶

Predicted Tremor and the Number of Defects¶

Percentage of Digits inside the Clock Face¶

The Length of the Major and Minor Axis of the Fitted Ellipse¶

Area of the Top, Bottom, Left and Right Hemisphere of the Circle¶

Horizontal and vertical distances¶

Euclidean Distance from Digits¶

Distance of Digits from clock center¶

Area, Height, Width of Digit Bounding Boxes Metrics¶

Time Features¶

Centre Dot Detection¶

Horizontal and vertical count¶

Correlation of circle features¶

Overview of all data¶

Functions¶

Missing values¶

Categorical and regression features¶

Top correlations¶

Diagnosis distribution¶

Training¶

Prediction phase 🔎¶

Load test data¶

Generate predictions¶

Submit to AIcrowd 🚀¶

Content