Loading

ADDI Alzheimers Detection Challenge

Detailed Data Analysis & Simple CatBoost - 0.640 on LB

Description of features and the entire dataset, selection of categorical features by logic, CatBoost

sweetlhare

I tried to make detailed graphs of the features and their dependencies on the diagnosis. I also made basic summaries for the entire dataset and trained the model. The analysis of the dataset corresponds to the organizers pdf. So, you can easily find the description of the desired feature in pdf. LB scores is 0.640 and 0.447.

Setup AIcrowd Utilities 🛠

In [ ]:
!pip install -q -U aicrowd-cli
In [ ]:
%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷

In [ ]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_TRAIN_DATASET_PATH = os.getenv("TRAIN_DATASET_PATH", "/ds_shared_drive/train.csv")
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"

Install packages 🗃

In [ ]:
!pip install numpy pandas catboost sklearn
Requirement already satisfied: numpy in ./conda/lib/python3.8/site-packages (1.20.2)
Requirement already satisfied: pandas in ./conda/lib/python3.8/site-packages (1.2.4)
Requirement already satisfied: catboost in ./conda/lib/python3.8/site-packages (0.25.1)
Requirement already satisfied: sklearn in ./conda/lib/python3.8/site-packages (0.0)
Requirement already satisfied: python-dateutil>=2.7.3 in ./conda/lib/python3.8/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in ./conda/lib/python3.8/site-packages (from pandas) (2021.1)
Requirement already satisfied: scipy in ./conda/lib/python3.8/site-packages (from catboost) (1.6.3)
Requirement already satisfied: matplotlib in ./conda/lib/python3.8/site-packages (from catboost) (3.4.1)
Requirement already satisfied: graphviz in ./conda/lib/python3.8/site-packages (from catboost) (0.16)
Requirement already satisfied: six in ./conda/lib/python3.8/site-packages (from catboost) (1.15.0)
Requirement already satisfied: plotly in ./conda/lib/python3.8/site-packages (from catboost) (4.14.3)
Requirement already satisfied: scikit-learn in ./conda/lib/python3.8/site-packages (from sklearn) (0.24.2)
Requirement already satisfied: pyparsing>=2.2.1 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (2.4.7)
Requirement already satisfied: pillow>=6.2.0 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (8.2.0)
Requirement already satisfied: cycler>=0.10 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./conda/lib/python3.8/site-packages (from matplotlib->catboost) (1.3.1)
Requirement already satisfied: retrying>=1.3.3 in ./conda/lib/python3.8/site-packages (from plotly->catboost) (1.3.3)
Requirement already satisfied: joblib>=0.11 in ./conda/lib/python3.8/site-packages (from scikit-learn->sklearn) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./conda/lib/python3.8/site-packages (from scikit-learn->sklearn) (2.1.0)

Define preprocessing code

Import common packages

In [ ]:
import numpy as np
import os
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier

Training phase

Load training data

In [ ]:
train_data = pd.read_csv(AICROWD_TRAIN_DATASET_PATH)
train_data.head()
Out[ ]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 ... bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect diagnosis
0 S0CIXBKIUEOUBNURP 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.526170 0.524975 0.474667 0 0 0 1 -105.0 0.0 normal
1 IW1Z4Z3H720OPW8LL 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000810 0.516212 0.483330 0 1 0 1 NaN NaN normal
2 PVUGU14JRSU44ZADT 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.488109 0.550606 0.449042 0 0 0 0 0.0 0.0 normal
3 RW5UTGMB9H67LWJHX 7.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 ... NaN NaN NaN 1 0 0 1 NaN NaN normal
4 W0IM2V6F6UP5LYS3E 12.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.512818 0.511865 0.487791 0 1 0 0 0.0 1.0 normal

5 rows × 122 columns

Features exploration

In [ ]:
regr_features = []
cat_features = []

Functions

In [ ]:
def get_corr(feature):
    features_corr = [feature]
    features_corr.append('diagnosis')
    df_corr = train_data[features_corr].copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    if 'intersection_pos_rel_centre' in features_corr:
        df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    return df_corr.corr().values[0, 1]
In [ ]:
bold = '\033[1m'
ordinary = '\033[0m'
def feature_describe(feature):
    print(bold+'Data type:'+ordinary, train_data[feature].dtype)
    print(bold+'Number of missing values: '+ordinary + str(round(100 * train_data[feature].isnull().sum() / train_data.shape[0], 2)) + '%')
    print(bold+'Correlation with the diagnosis:'+ordinary, round(get_corr(feature), 2))
    if train_data[feature].dtype != object:
        print(bold+'Min:'+ordinary, round(train_data[feature].min(), 2))
        print(bold+'Mean:'+ordinary, round(train_data[feature].mean(), 2))
        print(bold+'Max:'+ordinary, round(train_data[feature].max(), 2))
    unique_number = train_data[feature].nunique()
    uniques = train_data[feature].unique()
    print(bold+'Number of unique values:'+ordinary, unique_number)
    print(bold+'Example of unique values:'+ordinary, end=' ')    
    for i in range(len(uniques[:5])):
        if i != len(uniques[:5]) - 1:
            if train_data[feature].dtype == object:
                print(uniques[i], end=', ')
            else:
                print(np.round(uniques[i], 2), end=', ')
        else:
            if train_data[feature].dtype == object:
                print(uniques[i])
            else:
                print(np.round(uniques[i], 2))
            
colors = ['orange', 'green', 'purple', 'deeppink', 'blue']
            
def show_distribution(feature):
    plt.figure(figsize=(8, 4), dpi=80)
    if train_data[feature].dtype == object:
        print('Mapping values:', end=' ')
        for i in range(len(set(pd.factorize(train_data[feature])[0]))):
            if list(set(pd.factorize(train_data[feature])[0]))[i] == -1:
                print('nan - -1', end=', ')
            else:
                print(list(set(pd.factorize(train_data[feature])[1]))[i] + ' - ' + str(list(set(pd.factorize(train_data[feature])[0]))[i]), end=', ')
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'normal'].index], 
                    label='normal', linewidth=3, shade=True, color='green', alpha=.5)
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'pre_alzheimer'].index], 
                    label='pre_alzheimer', linewidth=3, shade=True, color='orange', alpha=.5)
        sns.kdeplot(pd.factorize(train_data[feature])[0][train_data[train_data.diagnosis == 'post_alzheimer'].index], 
                    label='post_alzheimer', linewidth=3, shade=True, color='blue', alpha=.5)
    else:
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'normal', feature], 
                    label='normal', linewidth=3, shade=True, color='green', alpha=.5)
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'pre_alzheimer', feature], 
                    label='pre_alzheimer', linewidth=3, shade=True, color='orange', alpha=.5)
        sns.kdeplot(train_data.loc[train_data.diagnosis == 'post_alzheimer', feature], 
                    label='post_alzheimer', linewidth=3, shade=True, color='blue', alpha=.5)
    plt.xlabel('Value')
    plt.legend()
    plt.title(feature)
    plt.show()
    
def show_distribution_hist(feature):
    
    df = train_data.copy()
    if feature == 'intersection_pos_rel_centre':
        print('Mapping values:', end=' ')
        for i in range(len(set(pd.factorize(df[feature])[0]))):
            if list(set(pd.factorize(df[feature])[0]))[i] == -1:
                print('nan - -1', end=', ')
            else:
                print(list(set(pd.factorize(train_data[feature])[1]))[i] + ' - ' + str(list(set(pd.factorize(train_data[feature])[0]))[i]), end=', ')

    df['intersection_pos_rel_centre'] = pd.factorize(df['intersection_pos_rel_centre'])[0]
        
    _, ax = plt.subplots(1, 3, figsize=(16, 4), dpi=80)
    sns.histplot(df.loc[df.diagnosis == 'normal', feature], 
                 ax=ax[0], label='normal', color='green', stat='probability', bins=10)
    sns.histplot(df.loc[df.diagnosis == 'pre_alzheimer', feature], 
                 ax=ax[1], label='pre_alzheimer', color='orange', stat='probability', bins=10)
    sns.histplot(df.loc[df.diagnosis == 'post_alzheimer', feature], 
                 ax=ax[2], label='post_alzheimer', color='blue', stat='probability', bins=10)
    ax[0].legend()
    ax[1].legend()
    ax[2].legend()
    plt.show()
In [ ]:
def show_corr(features):
    
    features_corr = features.copy()
    features_corr.append('diagnosis')
    df_corr = train_data[features_corr].copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    if 'intersection_pos_rel_centre' in features_corr:
        df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    
    corr = df_corr.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))
    plt.figure(figsize=(20, 8))
    sns.heatmap(corr, 
                mask=mask,
                cmap=sns.color_palette('dark:salmon_r', as_cmap=True),
                annot=True,
                center=0,
                linewidths=.5, cbar_kws={'shrink': .5})
    plt.show()
    
    del df_corr 
    del features_corr
    del corr
    del mask

Clock and Digit Features

In [ ]:
clock_features = []

Final Rotation Angle

In [ ]:
feature = 'final_rotation_angle'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: 0.05
Min: 0.0
Mean: 65.74
Max: 330.0
Number of unique values: 12
Example of unique values: 0.0, 330.0, 90.0, 270.0, 30.0

Number of Digits

In [ ]:
feature = 'number_of_digits'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: -0.21
Min: 1.0
Mean: 10.3
Max: 17.0
Number of unique values: 17
Example of unique values: 12.0, 7.0, 2.0, 11.0, 4.0

Missing Digit Dummy Variables

In [ ]:
feature = 'missing_digit_1'
# similary we have 11 other variables for all the other digits (missing_digit_2, missing_digit_3, etc.)

feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)

for i in range(1, 13):
    feature = 'missing_digit_{}'.format(i)
    cat_features.append(feature)
    clock_features.append(feature)
Data type: float64
Number of missing values: 0.23%
Correlation with the diagnosis: 0.12
Min: 0.0
Mean: 0.22
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, 1.0, nan

Deviation of Axis Digits (3, 6, 9 and 12) from Mid Axes

In [ ]:
feature = 'deviation_dist_from_mid_axis'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 1.77%
Correlation with the diagnosis: 0.04
Min: 0.0
Mean: 32.2
Max: 125.71
Number of unique values: 4336
Example of unique values: 14.1, 21.12, 19.66, 34.02, 8.71

Between Axis Digits Angle Metrics

In [ ]:
feature = 'between_axis_digits_angle_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 11.09%
Correlation with the diagnosis: -0.08
Min: 0.0
Mean: 352.14
Max: 360.0
Number of unique values: 74
Example of unique values: nan, 360.0, 0.0, 352.32, 352.7
In [ ]:
feature = 'between_axis_digits_angle_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 5.82%
Correlation with the diagnosis: 0.12
Min: 0.0
Mean: 2587.13
Max: 63116.01
Number of unique values: 30816
Example of unique values: 225.27, 382.13, 439.97, 6465.58, 59.51

Between Digits Angle Metrics

In [ ]:
feature = 'between_digits_angle_cw_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 38.9%
Correlation with the diagnosis: -0.12
Min: 0.0
Mean: 355.1
Max: 360.0
Number of unique values: 4
Example of unique values: 360.0, nan, 0.0, 343.05, 180.0
In [ ]:
feature = 'between_digits_angle_cw_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 2.11%
Correlation with the diagnosis: 0.18
Min: 0.0
Mean: 3081.78
Max: 63259.73
Number of unique values: 32079
Example of unique values: 72.27, 69.82, 49.54, 13467.73, 50.76
In [ ]:
feature = 'between_digits_angle_ccw_sum'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 97.43%
Correlation with the diagnosis: 0.04
Min: 0.0
Mean: 243.83
Max: 360.0
Number of unique values: 3
Example of unique values: nan, 360.0, 0.0, 228.66
In [ ]:
feature = 'between_digits_angle_ccw_var'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 2.11%
Correlation with the diagnosis: 0.17
Min: 0.0
Mean: 3157.42
Max: 63259.73
Number of unique values: 32079
Example of unique values: 72.27, 69.82, 49.54, 13467.73, 50.76

Sequence Flag Clock Wise and Counter Clock Wise

In [ ]:
feature = 'sequence_flag_cw'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
cat_features.append(feature)
Data type: float64
Number of missing values: 0.92%
Correlation with the diagnosis: -0.12
Min: 0.0
Mean: 0.75
Max: 1.0
Number of unique values: 2
Example of unique values: 1.0, 0.0, nan
In [ ]:
feature = 'sequence_flag_ccw'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
clock_features.append(feature)
cat_features.append(feature)
Data type: float64
Number of missing values: 0.92%
Correlation with the diagnosis: 0.14
Min: 0.0
Mean: 0.02
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, 1.0, nan

Correlation of clock and digits features

In [ ]:
show_corr(clock_features)

Hand Features

In [ ]:
hand_features = []

Number of Hands

In [ ]:
feature = 'number_of_hands'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 6.57%
Correlation with the diagnosis: -0.12
Min: 1.0
Mean: 1.77
Max: 8.0
Number of unique values: 7
Example of unique values: 2.0, 1.0, 4.0, nan, 3.0
In [ ]:
feature = 'hand_count_dummy'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: float64
Number of missing values: 6.57%
Correlation with the diagnosis: -0.13
Min: 1.0
Mean: 1.77
Max: 3.0
Number of unique values: 3
Example of unique values: 2.0, 1.0, 3.0, nan

Hand Length

In [ ]:
feature = 'hour_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.0
Min: 23.16
Mean: 60.54
Max: 123.52
Number of unique values: 11931
Example of unique values: 53.16, nan, 70.54, 61.79, 59.54
In [ ]:
feature = 'minute_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.03
Min: 33.19
Mean: 80.87
Max: 133.69
Number of unique values: 13480
Example of unique values: 77.9, nan, 75.65, 68.25, 81.46
In [ ]:
feature = 'single_hand_length'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 76.38%
Correlation with the diagnosis: -0.01
Min: 18.02
Mean: 74.6
Max: 292.85
Number of unique values: 6678
Example of unique values: nan, 81.24, 37.87, 48.28, 98.25
In [ ]:
feature = 'clockhand_ratio'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.9%
Correlation with the diagnosis: -0.03
Min: 1.0
Mean: 1.38
Max: 2.5
Number of unique values: 22642
Example of unique values: 1.47, nan, 1.07, 1.1, 1.37
In [ ]:
feature = 'clockhand_diff'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.33%
Correlation with the diagnosis: -0.03
Min: 0.0
Mean: 20.27
Max: 69.87
Number of unique values: 22834
Example of unique values: 24.73, nan, 5.11, 6.46, 21.92

Angle Between Hands

In [ ]:
feature = 'angle_between_hands'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.25%
Correlation with the diagnosis: -0.08
Min: 0.03
Mean: 90.17
Max: 179.52
Number of unique values: 22838
Example of unique values: 32.63, nan, 72.45, 107.05, 119.91

Deviation of Intersection Point of Hands from Geometric Centre

In [ ]:
feature = 'deviation_from_centre'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 30.46%
Correlation with the diagnosis: 0.09
Min: 0.1
Mean: 17.42
Max: 298.72
Number of unique values: 22793
Example of unique values: 88.54, nan, 8.24, 18.54, 24.92
In [ ]:
feature = 'intersection_pos_rel_centre'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: object
Number of missing values: 30.25%
Correlation with the diagnosis: -0.12
Number of unique values: 4
Example of unique values: TL, nan, BL, TR, BR
Mapping values: TL - 0, BL - 1, TR - 2, BR - 3, nan - -1, 
Mapping values: TL - 0, BL - 1, TR - 2, BR - 3, nan - -1, 

The Proximity of Hour and Minute from 11 and 2 Respectively

In [ ]:
feature = 'hour_proximity_from_11'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 38.4%
Correlation with the diagnosis: 0.05
Min: 0.0
Mean: 24.92
Max: 179.77
Number of unique values: 20153
Example of unique values: 27.08, nan, 4.64, 8.39, 118.9
In [ ]:
feature = 'minute_proximity_from_2'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 39.23%
Correlation with the diagnosis: 0.09
Min: 0.0
Mean: 33.27
Max: 179.93
Number of unique values: 19899
Example of unique values: 69.66, nan, 0.56, 3.32, 113.84

Digit Pointed by Hour and Minute Hand

In [ ]:
feature = 'hour_pointing_digit'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: float64
Number of missing values: 30.81%
Correlation with the diagnosis: -0.02
Min: 1.0
Mean: 9.05
Max: 12.0
Number of unique values: 12
Example of unique values: 12.0, nan, 11.0, 2.0, 3.0
In [ ]:
feature = 'minute_pointing_digit'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: float64
Number of missing values: 30.81%
Correlation with the diagnosis: 0.09
Min: 1.0
Mean: 4.39
Max: 12.0
Number of unique values: 12
Example of unique values: 11.0, nan, 2.0, 3.0, 12.0

Clock Hand Errors

In [ ]:
feature = 'eleven_ten_error'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.03
Min: 0
Mean: 0.03
Max: 1
Number of unique values: 2
Example of unique values: 0, 1
In [ ]:
feature = 'other_error'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
hand_features.append(feature)
cat_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.13
Min: 0
Mean: 0.61
Max: 1
Number of unique values: 2
Example of unique values: 1, 0

Correlation of hand features

In [ ]:
show_corr(hand_features)

Circle Features

In [ ]:
circle_features = []

Ellipse to Circle Ratio

In [ ]:
feature = 'ellipse_circle_ratio'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 2.25%
Correlation with the diagnosis: -0.05
Min: 0.0
Mean: 79.12
Max: 99.97
Number of unique values: 32038
Example of unique values: 63.43, 42.04, 77.64, 67.62, 90.86

Predicted Tremor and the Number of Defects

In [ ]:
feature = 'count_defects'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.07
Min: 1
Mean: 93.49
Max: 176
Number of unique values: 172
Example of unique values: 94, 38, 103, 47, 158
In [ ]:
feature = 'pred_tremor'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
cat_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: 0.07
Min: 0
Mean: 0.32
Max: 1
Number of unique values: 2
Example of unique values: 0, 1

Percentage of Digits inside the Clock Face

In [ ]:
feature = 'percentage_inside_ellipse'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.93%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.94
Max: 1.0
Number of unique values: 70
Example of unique values: 0.69, 0.25, 0.92, 0.86, 1.0

The Length of the Major and Minor Axis of the Fitted Ellipse

In [ ]:
feature = 'double_major'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.72%
Correlation with the diagnosis: 0.01
Min: 9.7
Mean: 120.24
Max: 499.39
Number of unique values: 32328
Example of unique values: 132.67, 106.8, 119.38, 123.76, 124.05
In [ ]:
feature = 'double_minor'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.04
Min: 0.0
Mean: 106.54
Max: 305.19
Number of unique values: 32469
Example of unique values: 82.43, 52.93, 100.03, 74.46, 117.91

Area of the Top, Bottom, Left and Right Hemisphere of the Circle

In [ ]:
feature = 'top_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: -0.02
Min: 0.0
Mean: 0.52
Max: 1.0
Number of unique values: 31153
Example of unique values: 0.47, 1.0, 0.51, nan, 0.49
In [ ]:
feature = 'bottom_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: 0.02
Min: 0.0
Mean: 0.47
Max: 1.0
Number of unique values: 31149
Example of unique values: 0.53, 0.0, 0.49, nan, 0.51
In [ ]:
feature = 'left_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: 0.0
Min: 0.0
Mean: 0.53
Max: 1.0
Number of unique values: 31163
Example of unique values: 0.52, 0.52, 0.55, nan, 0.51
In [ ]:
feature = 'right_area_perc'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 4.76%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.46
Max: 1.0
Number of unique values: 31162
Example of unique values: 0.47, 0.48, 0.45, nan, 0.49

Horizontal and vertical distances

In [ ]:
feature = 'horizontal_dist'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.13%
Correlation with the diagnosis: -0.02
Min: 0.0
Mean: 113.92
Max: 492.88
Number of unique values: 32597
Example of unique values: 98.67, 52.95, 106.74, 122.83, 123.53
In [ ]:
feature = 'vertical_dist'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: float64
Number of missing values: 0.57%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 111.58
Max: 499.39
Number of unique values: 32534
Example of unique values: 99.37, 106.68, 110.2, 74.67, 118.36

Euclidean Distance from Digits

In [ ]:
feature = 'euc_dist_digit_1'
# similary we have 11 other variables for all the other digits (euc_dist_digit_2, euc_dist_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'euc_dist_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)
Data type: float64
Number of missing values: 22.51%
Correlation with the diagnosis: 0.03
Min: 0.0
Mean: 30.29
Max: 119.96
Number of unique values: 23912
Example of unique values: 30.13, 15.46, 11.66, 9.46, 17.6

Distance of Digits from clock center

In [ ]:
feature = '1 dist from cen'
# similary we have 11 other variables for all the other digits (2 dist from cen, 3 dist from cen, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = '{} dist from cen'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)
Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: -0.04
Min: 3.35
Mean: 361.87
Max: 618.03
Number of unique values: 21147
Example of unique values: 325.1, 394.52, 369.43, 380.71, 374.91

Area, Height, Width of Digit Bounding Boxes Metrics

In [ ]:
feature = 'area_digit_1'
# similary we have 11 other variables for all the other digits (area_digit_2, area_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'area_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)
Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.04
Min: 640.0
Mean: 2308.11
Max: 9870.0
Number of unique values: 1965
Example of unique values: 3182.0, 2015.0, 2320.0, 4640.0, 1652.0
In [ ]:
feature = 'height_digit_1'
# similary we have 11 other variables for all the other digits (height_digit_2, height_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'height_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)
Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.01
Min: 19.0
Mean: 59.88
Max: 143.0
Number of unique values: 123
Example of unique values: 74.0, 65.0, 80.0, 116.0, 59.0
In [ ]:
feature = 'width_digit_1'
# similary we have 11 other variables for all the other digits (width_digit_2, width_digit_3, etc.)
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
for i in range(1, 13):
    feature = 'width_digit_{}'.format(i)
    regr_features.append(feature)
    circle_features.append(feature)
Data type: float64
Number of missing values: 22.36%
Correlation with the diagnosis: 0.04
Min: 18.0
Mean: 40.34
Max: 164.0
Number of unique values: 126
Example of unique values: 43.0, 31.0, 29.0, 40.0, 28.0
In [ ]:
feature = 'variance_width'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)
Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.06
Min: 0.0
Mean: 363.58
Max: 5408.0
Number of unique values: 24755
Example of unique values: 682.64, 201.73, 362.39, 587.95, 276.79
In [ ]:
feature = 'variance_height'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)
Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.06
Min: 0.0
Mean: 324.12
Max: 6844.5
Number of unique values: 24130
Example of unique values: 383.11, 293.15, 324.42, 301.29, 110.45
In [ ]:
feature = 'variance_area'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)
Data type: float64
Number of missing values: 1.42%
Correlation with the diagnosis: 0.09
Min: 0.0
Mean: 5148403.27
Max: 153019556.9
Number of unique values: 32296
Example of unique values: 5683189.17, 3912113.54, 6395827.46, 6531155.57, 2105465.36

Time Features

In [ ]:
feature = 'time_diff'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
regr_features.append(feature)
circle_features.append(feature)
Data type: float64
Number of missing values: 31.27%
Correlation with the diagnosis: 0.01
Min: -110.0
Mean: 105.2
Max: 605.0
Number of unique values: 140
Example of unique values: -105.0, nan, 0.0, 495.0, 540.0

Centre Dot Detection

In [ ]:
feature = 'centre_dot_detect'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
cat_features.append(feature)
circle_features.append(feature)
Data type: float64
Number of missing values: 30.36%
Correlation with the diagnosis: -0.01
Min: 0.0
Mean: 0.24
Max: 1.0
Number of unique values: 2
Example of unique values: 0.0, nan, 1.0

Horizontal and vertical count

In [ ]:
feature = 'hor_count'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.09
Min: 0
Mean: 0.69
Max: 3
Number of unique values: 4
Example of unique values: 0, 1, 2, 3
In [ ]:
feature = 'vert_count'
feature_describe(feature)
show_distribution(feature)
show_distribution_hist(feature)
circle_features.append(feature)
regr_features.append(feature)
Data type: int64
Number of missing values: 0.0%
Correlation with the diagnosis: -0.08
Min: 0
Mean: 0.76
Max: 3
Number of unique values: 4
Example of unique values: 0, 1, 2, 3

Correlation of circle features

In [ ]:
show_corr(circle_features[:20])
In [ ]:
show_corr(circle_features[20:40])
In [ ]:
show_corr(circle_features[40:60])
In [ ]:
show_corr(circle_features[60:])

Overview of all data

Functions

In [ ]:
def show_miss():
    df_miss = 100 * train_data.isnull().sum().sort_values(ascending=False) / train_data.shape[0]
    for i in [1, 5, 10, 15, 20, 30, 50]:
        print('More than {} columns with a percentage of omissions greater than {}%'.format(len(df_miss[df_miss > i]), i))
    
    plt.figure(figsize=(10, 10))
    sns.barplot(x=df_miss.head(26).values, y=df_miss.head(26).index)
    plt.title('Top 26 columns by number of omissions')
    
def show_corr_all():
    df_corr = train_data.copy()
    df_corr['diagnosis'] = pd.factorize(df_corr['diagnosis'])[0]
    df_corr['intersection_pos_rel_centre'] = pd.factorize(df_corr['intersection_pos_rel_centre'])[0]
    df_corr = df_corr.corr()['diagnosis'].sort_values(ascending=False).iloc[1:-2]
    df_corr = pd.concat([df_corr.head(20), df_corr.tail(10)])
    plt.figure(figsize=(10, 10))
    sns.barplot(x=df_corr.values, y=df_corr.index)
    plt.title('Top 20 correlations from the beggining and top 10 from the end')
    
def dist_diagnosis():
    plt.figure(figsize=(10,5))
    sns.barplot(x=train_data.diagnosis.unique(), y=train_data.groupby('diagnosis').row_id.count(), palette='rocket')
    plt.ylabel('Count')
    plt.show()

Missing values

In [ ]:
show_miss()
More than 91 columns with a percentage of omissions greater than 1%
More than 80 columns with a percentage of omissions greater than 5%
More than 77 columns with a percentage of omissions greater than 10%
More than 47 columns with a percentage of omissions greater than 15%
More than 26 columns with a percentage of omissions greater than 20%
More than 16 columns with a percentage of omissions greater than 30%
More than 2 columns with a percentage of omissions greater than 50%

Categorical and regression features

I made the division according to my subjective logic. This is not the only correct solution.

In [ ]:
print(bold+'These are all categorical features in the form of a list:'+ordinary)
print(cat_features)
These are all categorical features in the form of a list:
['missing_digit_1', 'missing_digit_2', 'missing_digit_3', 'missing_digit_4', 'missing_digit_5', 'missing_digit_6', 'missing_digit_7', 'missing_digit_8', 'missing_digit_9', 'missing_digit_10', 'missing_digit_11', 'missing_digit_12', 'sequence_flag_cw', 'sequence_flag_ccw', 'hand_count_dummy', 'intersection_pos_rel_centre', 'hour_pointing_digit', 'minute_pointing_digit', 'eleven_ten_error', 'other_error', 'pred_tremor', 'centre_dot_detect']
In [ ]:
train_data[cat_features].head()
Out[ ]:
missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 ... sequence_flag_cw sequence_flag_ccw hand_count_dummy intersection_pos_rel_centre hour_pointing_digit minute_pointing_digit eleven_ten_error other_error pred_tremor centre_dot_detect
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 2.0 TL 12.0 11.0 0 1 0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 1.0 NaN NaN NaN 0 1 1 NaN
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 2.0 BL 11.0 2.0 0 0 0 0.0
3 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 ... 0.0 0.0 1.0 NaN NaN NaN 0 1 1 NaN
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 2.0 TR 11.0 2.0 0 0 0 1.0

5 rows × 22 columns

In [ ]:
train_data[regr_features].head()
Out[ ]:
final_rotation_angle number_of_digits deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var number_of_hands ... width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area time_diff hor_count vert_count
0 0.0 12.0 14.105000 NaN 225.273687 360.0 72.269406 NaN 72.269406 2.0 ... 64.0 65.0 55.0 120.0 682.636364 383.113636 5683189.174 -105.0 0 0
1 0.0 12.0 21.125000 360.0 382.127186 360.0 69.822716 NaN 69.822716 1.0 ... 54.0 83.0 67.0 80.0 201.727273 293.151515 3912113.538 NaN 0 1
2 0.0 12.0 19.662500 360.0 439.972719 360.0 49.540354 NaN 49.540354 2.0 ... 73.0 99.0 86.0 94.0 362.386364 324.424242 6395827.455 0.0 0 0
3 0.0 7.0 34.016667 360.0 6465.579942 NaN 13467.727800 NaN 13467.727800 1.0 ... NaN NaN 74.0 116.0 587.952381 301.285714 6531155.571 NaN 1 0
4 0.0 12.0 8.710000 360.0 59.508165 360.0 50.762378 NaN 50.762378 2.0 ... 41.0 81.0 64.0 87.0 276.787879 110.446970 2105465.356 0.0 0 1

5 rows × 96 columns

Top correlations

In [ ]:
show_corr_all()

Diagnosis distribution

In [ ]:
dist_diagnosis()

Training

In [ ]:
train_data.dtypes[train_data.dtypes == object]
Out[ ]:
row_id                         object
intersection_pos_rel_centre    object
diagnosis                      object
dtype: object
In [ ]:
train_data[cat_features] = train_data[cat_features].fillna(999)
train_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']] = train_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']].astype(int)
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(train_data.drop(['row_id', 'diagnosis'], axis=1), train_data['diagnosis'], 
                                                    test_size=0.15, stratify=train_data['diagnosis'], random_state=17)

model = CatBoostClassifier(loss_function='MultiClass',
                          auto_class_weights='SqrtBalanced')
model.fit(X_train, y_train, eval_set=(X_test, y_test), cat_features=cat_features, verbose=100)
Learning rate set to 0.115406
0:	learn: 0.9862599	test: 0.9876717	best: 0.9876717 (0)	total: 269ms	remaining: 4m 28s
100:	learn: 0.4245284	test: 0.5377943	best: 0.5367516 (93)	total: 23s	remaining: 3m 24s
200:	learn: 0.3352676	test: 0.5394514	best: 0.5350195 (157)	total: 47.3s	remaining: 3m 8s
300:	learn: 0.2727060	test: 0.5486710	best: 0.5350195 (157)	total: 1m 11s	remaining: 2m 45s
400:	learn: 0.2246994	test: 0.5623510	best: 0.5350195 (157)	total: 1m 35s	remaining: 2m 22s
500:	learn: 0.1848185	test: 0.5801075	best: 0.5350195 (157)	total: 1m 59s	remaining: 1m 58s
600:	learn: 0.1552057	test: 0.5976177	best: 0.5350195 (157)	total: 2m 21s	remaining: 1m 34s
In [ ]:
model.save_model(AICROWD_ASSETS_DIR + '/model_123')
np.save(AICROWD_ASSETS_DIR + '/cat', cat_features)

Prediction phase 🔎

In [ ]:
from catboost import CatBoostClassifier
model = CatBoostClassifier()
model.load_model(AICROWD_ASSETS_DIR + '/model_123')
Out[ ]:
<catboost.core.CatBoostClassifier at 0x7f8356c3eee0>

Load test data

In [ ]:
test_data = pd.read_csv(AICROWD_DATASET_PATH)
cat_features = np.load(AICROWD_ASSETS_DIR + '/cat.npy', allow_pickle=True)
In [ ]:
test_data[cat_features] = test_data[cat_features].fillna(999)
test_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']] = test_data[[feature for feature in cat_features if feature != 'intersection_pos_rel_centre']].astype(int)

Generate predictions

In [ ]:
preds = model.predict_proba(test_data.drop(['row_id'], axis=1))
In [ ]:
predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": preds[:, 0],
    "post_alzheimer_diagnosis_probability": preds[:, 1],
    "pre_alzheimer_diagnosis_probability": preds[:, 2],
}

predictions_df = pd.DataFrame.from_dict(predictions)
In [ ]:
pred_sum = predictions_df['normal_diagnosis_probability'] + predictions_df['post_alzheimer_diagnosis_probability'] + predictions_df['pre_alzheimer_diagnosis_probability']
predictions_df['normal_diagnosis_probability'] /= pred_sum 
predictions_df['post_alzheimer_diagnosis_probability'] /= pred_sum 
predictions_df['pre_alzheimer_diagnosis_probability'] /= pred_sum
predictions_df['normal_diagnosis_probability'] + predictions_df['post_alzheimer_diagnosis_probability'] + predictions_df['pre_alzheimer_diagnosis_probability']
Out[ ]:
0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
357    1.0
358    1.0
359    1.0
360    1.0
361    1.0
Length: 362, dtype: float64
In [ ]:
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀

In [ ]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge
API Key valid
Saved API Key successfully!
Using notebook: /home/desktop0/features_exploration.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 3773 bytes to /home/desktop0/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 9077 bytes to /home/desktop0/submission/predict.nbconvert.ipynb
submission.zip ━━━━━━━━━━━━━━━━━━━━ 100.0%13.9/13.9 MB974.2 kB/s0:00:00[0m • 0:00:01[36m0:00:01
                                                 ╭─────────────────────────╮                                                 
                                                 │ Successfully submitted! │                                                 
                                                 ╰─────────────────────────╯                                                 
                                                       Important links                                                       
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/138056              │
│                  │                                                                                                        │
│  All submissions │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true │
│                  │                                                                                                        │
│      Leaderboard │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    │
│                  │                                                                                                        │
│ Discussion forum │ https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    │
│                  │                                                                                                        │
│   Challenge page │ https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In [ ]:


Comments

You must login before you can post a comment.

Execute