Loading

ADDI Alzheimers Detection Challenge

EDA, FE, HPO - All you need (LB: 0.640)

Detailed EDA, FE with Class Balancing, Hyper-Parameter Optimization of XGBoost using Optuna

jyot_makadiya

This notebook explains feature-level exploratory data analysis along with observation comments, simple feature engineering including class balancing and XGBoost hyper-parameter optimization using HPO framework Optuna.

Drawing

What is the notebook about?

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

How to use this notebook? 📝

  • Update the config parameters. You can define the common variables here
Variable Description
AICROWD_DATASET_PATH Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path.
AICROWD_PREDICTIONS_PATH Path to write the output to.
AICROWD_ASSETS_DIR In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
AICROWD_API_KEY In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me
  • Installing packages. Please use the Install packages 🗃 section to install the packages
  • Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section

Content:

  • Exploratory Data Analysis
  • Feature Engineering
  • Hyper-parameter Optimization
  • Training Best Parameters Model
  • Final Prediction and submission

Introduction:

Hello I am Jyot Makadiya, a pre-final year student pursuing bachelor of technology in computer science & engineering. I have been experimenting with data for 1 year now and so far the journey has been smooth and I learned a lot on the way.
This challenge can be assumed to be a multiclass classification problem with 3 classes ( Normal, Pre-Alzheimer’s, Post-Alzheimer’s). The main tasks to achieve a good score include having a good cross-validation with balanced dataset, good feature engineering and Fine-tuning hyper-parameters along with ensembling. </br>
This notebook covers my approach for this competition starting with exploratory data analysis. Then it covers simple feature engineering for a few features (I'll expand the idea of FE and ensemble in next part/walkthrough blog). Finally we use Optuna for hyper-parameter optimization. </br>
The aim of this notebook is to introduce you with the variety of concepts including but not limited to hyper-parameter optimization aka AutoML tools, Simple but feature level EDA and FE.
</br> For a better view of graphs and plots, open this notebook in colab using open in colab button

Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [ ]:
!pip install -q -U aicrowd-cli
In [ ]:
%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /ds_shared_drive on the workspace.

In [ ]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "Z:/challenge-data/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "Z:/challenge-data/predictions.csv")
AICROWD_ASSETS_DIR = "assets"

Install packages 🗃

Please add all pacakage installations in this section

In [ ]:
!pip install -q numpy pandas
In [ ]:
!pip install -q xgboost scikit-learn seaborn lightgbm optuna

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

Import common packages

Please import packages that are common for training and prediction phases here.

In [ ]:
import xgboost as xgb
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.color_palette("rocket_r")
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 1000)

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss, f1_score
import joblib

import warnings
warnings.filterwarnings("ignore")
# df
# with open(AICROWD_DATASET_PATH) as f:
#     f.read()
# some precessing code
In [ ]:
# os.listdir('Z:/challenge-data/')
#Pre Processing functions

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

In [ ]:
# model = define_your_model

Load training data

In [ ]:
df_orig = pd.read_csv("Z:/challenge-data/train.csv")

df_valid = pd.read_csv("Z:/challenge-data/validation.csv")
df_valid_target = pd.read_csv("Z:/challenge-data/validation_ground_truth.csv")
df = df_orig.copy()
df.describe()
Out[ ]:
number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect
count 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 32703.000000 25448.000000 27882.000000 27201.000000 28937.000000 27855.000000 28612.000000 27251.000000 26092.000000 28407.000000 28555.000000 28755.000000 26974.000000 25400.000000 27800.000000 28603.000000 27238.000000 26082.000000 28394.000000 28491.000000 28641.000000 2.693600e+04 27838.000000 27151.000000 28909.000000 25448.000000 27855.000000 28612.000000 27251.000000 26092.000000 28407.000000 28555.000000 28755.000000 26974.000000 27882.000000 27201.000000 28937.000000 25448.000000 27855.000000 28612.000000 27251.000000 26092.000000 28407.000000 28555.000000 28755.000000 26974.000000 27882.000000 27201.000000 28937.000000 25448.000000 27855.000000 28612.000000 27251.000000 26092.000000 28407.000000 28555.000000 28755.000000 26974.000000 27882.000000 27201.000000 28937.000000 32313.000000 32313.000000 3.231300e+04 32198.000000 29141.000000 30870.000000 20027.000000 32085.000000 844.000000 32085.000000 32474.000000 32474.000000 30623.000000 30623.000000 22861.000000 22861.000000 7741.000000 22650.000000 22835.000000 22861.000000 22793.000000 20191.000000 19919.000000 22677.000000 32777.0 22678.000000 32777.0 32703.000000 3.203900e+04 32777.000000 32472.000000 32777.000000 32540.000000 3.277600e+04 3.258900e+04 3.273500e+04 31218.000000 31218.000000 31218.000000 31218.000000 32777.000000 32777.000000 32777.000000 32777.000000 22526.000000 22826.000000
mean 10.299422 0.221845 0.148243 0.125096 0.166713 0.202153 0.131364 0.126839 0.120723 0.175183 0.147418 0.168241 0.115158 361.869732 367.418424 368.235873 370.796838 349.116177 337.542587 336.085919 335.550313 353.017822 368.547709 370.329200 375.631690 30.287315 32.834984 33.031035 32.049520 30.724226 28.135344 30.886070 32.250843 3.125026e+01 33.247571 32.644335 28.629239 2308.107671 4616.101562 5046.115231 5793.115665 7214.179250 6035.063259 4942.821748 5697.203373 5678.539964 6647.253927 5393.460167 6998.064450 59.880541 70.994184 80.247973 87.709479 88.637130 88.011054 81.457468 85.024761 89.002818 81.400330 75.654792 77.071742 40.342109 62.648717 61.032259 65.411545 79.223402 68.243391 60.447242 66.558233 63.786943 78.795890 69.117569 87.386426 363.578878 324.115546 5.148403e+06 32.202820 352.139508 2587.128279 355.100767 3081.777480 243.825427 3157.421099 0.750385 0.018538 1.772132 1.770369 60.538409 80.874117 74.602333 1.375478 20.270851 90.170001 17.420096 24.922338 33.267558 9.047449 11.0 4.393377 2.0 65.737088 7.911654e+01 93.489459 0.939555 0.317052 120.238950 1.065362e+02 1.115766e+02 1.139164e+02 0.519007 0.465878 0.525769 0.464433 0.693230 0.762211 0.025231 0.612655 105.199325 0.241172
std 2.345710 0.415494 0.355346 0.330832 0.372725 0.401612 0.337803 0.332797 0.325810 0.380129 0.354527 0.374086 0.319217 50.310698 48.060878 48.425983 48.005863 53.313076 51.175381 47.456872 46.910977 47.096105 50.956366 51.562665 45.795291 33.877417 31.828580 33.060628 31.662544 30.055328 31.245333 33.028061 33.840305 3.440961e+01 34.375507 34.165306 35.018626 1070.213451 2365.657591 2569.549735 2641.521129 3474.474015 2742.576668 2221.963276 2741.527329 2563.651971 3161.975614 2633.392241 3525.529979 20.742269 21.127381 23.071334 26.389729 23.653629 27.994521 24.236242 25.677814 27.629884 21.714878 20.775003 21.834246 17.562823 17.698763 19.128573 19.661431 24.229453 21.233585 19.571260 22.255969 21.164594 21.110245 20.374876 25.649241 306.449113 302.509846 6.805541e+06 32.276635 52.263430 5675.602203 41.688130 5648.419057 168.298221 5784.787525 0.432797 0.134888 0.457020 0.448046 14.191507 13.311371 35.873579 0.299056 13.359890 23.522379 19.001146 39.291724 46.534051 3.661448 0.0 3.965286 0.0 110.472325 1.453976e+01 39.504488 0.169569 0.465335 19.864539 1.295960e+01 1.900768e+01 1.467719e+01 0.180912 0.178807 0.160916 0.160926 0.675787 0.699355 0.156829 0.487151 205.429390 0.427804
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.354102 5.852350 11.335784 22.102036 7.905694 15.206906 6.519202 7.826238 3.535534 14.422205 8.139410 14.115594 0.000760 0.003261 0.000000 0.000760 0.002345 0.000000 0.000491 0.000515 2.960000e-14 0.002010 0.001071 0.000000 640.000000 768.000000 828.000000 1036.000000 1152.000000 805.000000 777.000000 1054.000000 870.000000 888.000000 780.000000 1089.000000 19.000000 21.000000 29.000000 31.000000 30.000000 28.000000 24.000000 28.000000 24.000000 31.000000 26.000000 28.000000 18.000000 26.000000 23.000000 26.000000 32.000000 23.000000 21.000000 24.000000 25.000000 24.000000 24.000000 30.000000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 23.164818 33.186431 18.017081 1.000000 0.000000 0.029409 0.100852 0.000000 0.000000 1.000000 11.0 1.000000 2.0 0.000000 5.060000e-10 1.000000 0.000000 0.000000 9.696612 4.210000e-10 4.210000e-10 2.600000e-09 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -110.000000 0.000000
25% 10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 336.580321 343.945581 342.212288 348.353987 320.153479 308.950947 309.714788 309.358914 328.613374 342.429190 345.253711 353.542784 6.680591 8.930655 8.190000 8.141702 7.908380 6.240000 6.903923 7.413237 6.890000e+00 7.541077 7.369695 5.200000 1537.000000 2942.000000 3240.000000 3915.000000 4758.000000 4041.000000 3312.000000 3760.000000 3848.000000 4380.000000 3540.000000 4473.000000 45.000000 56.000000 63.000000 69.000000 72.000000 67.000000 63.000000 67.000000 69.000000 66.000000 61.000000 62.000000 29.000000 50.000000 47.000000 52.000000 61.000000 54.000000 47.000000 50.000000 50.000000 64.000000 54.000000 69.000000 171.515152 148.277778 1.575504e+06 9.880000 360.000000 102.884207 360.000000 51.690940 0.000000 52.068825 1.000000 0.000000 2.000000 2.000000 50.270850 71.700628 53.825024 1.138048 9.375855 82.023132 8.079066 2.334051 1.728408 10.000000 11.0 2.000000 2.0 0.000000 7.802811e+01 78.000000 1.000000 0.000000 115.425250 1.027451e+02 1.077028e+02 1.099097e+02 0.472774 0.480002 0.502471 0.457726 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 367.434688 372.683512 372.667412 377.180328 353.802911 343.432854 340.306479 339.694716 358.558224 372.874309 375.153635 381.106284 14.935799 20.012485 18.980000 19.097071 18.527444 14.300000 15.724508 16.984068 1.586000e+01 17.561152 17.209259 11.960000 2065.000000 4104.000000 4508.000000 5270.000000 6525.000000 5580.000000 4559.000000 5145.000000 5280.000000 6075.000000 4872.000000 6240.000000 59.000000 68.000000 78.000000 86.000000 87.000000 86.000000 80.000000 84.000000 88.000000 79.000000 73.000000 74.000000 34.000000 60.000000 57.000000 61.000000 75.000000 64.000000 57.000000 62.000000 59.000000 76.000000 66.000000 84.000000 282.787879 246.386364 3.114518e+06 16.163333 360.000000 296.051705 360.000000 165.733916 360.000000 167.238094 1.000000 0.000000 2.000000 2.000000 59.852542 80.769660 68.324616 1.307708 18.837195 91.649346 13.104333 5.189904 4.258455 11.000000 11.0 2.000000 2.0 0.000000 8.375303e+01 103.000000 1.000000 0.000000 120.289760 1.093602e+02 1.130389e+02 1.160801e+02 0.493087 0.505231 0.520833 0.478135 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000
75% 12.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 393.898464 397.112940 399.011278 401.186366 383.428285 371.767737 367.008855 366.316888 384.157520 400.218222 401.983364 404.633476 37.094044 46.669563 47.840000 45.860710 44.763078 35.100000 44.079657 46.748061 4.085250e+01 50.597531 48.329409 31.330000 2816.000000 5716.000000 6231.000000 7105.500000 8840.000000 7560.000000 6120.000000 7000.000000 7038.000000 8232.000000 6640.000000 8701.000000 73.000000 83.000000 94.000000 104.000000 103.000000 107.000000 98.000000 101.000000 107.000000 94.000000 88.000000 89.000000 47.000000 72.000000 71.000000 75.000000 93.000000 78.000000 70.000000 78.000000 72.000000 91.000000 80.000000 102.000000 456.363636 404.100000 6.090066e+06 44.200000 360.000000 2402.861831 360.000000 5215.235174 360.000000 5224.038548 1.000000 0.000000 2.000000 2.000000 70.048169 89.989302 85.473987 1.542887 29.315735 101.621967 20.451729 16.677090 77.721980 11.000000 11.0 10.000000 2.0 90.000000 8.742738e+01 121.000000 1.000000 1.000000 124.313176 1.135460e+02 1.163918e+02 1.201251e+02 0.516964 0.525196 0.540867 0.496303 1.000000 1.000000 0.000000 1.000000 60.000000 0.000000
max 17.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 618.025889 628.776988 613.843832 659.571073 568.624876 611.333379 580.975473 520.454849 586.950168 666.132119 608.481717 620.016935 119.957644 119.906309 119.860000 119.937391 119.643227 119.730000 119.997479 119.838808 1.199900e+02 119.915642 119.855309 119.990000 9870.000000 25088.000000 31482.000000 29946.000000 32200.000000 27378.000000 22866.000000 29503.000000 34524.000000 35280.000000 30338.000000 28362.000000 143.000000 213.000000 212.000000 241.000000 218.000000 225.000000 195.000000 256.000000 271.000000 213.000000 241.000000 287.000000 164.000000 193.000000 185.000000 204.000000 220.000000 209.000000 206.000000 277.000000 207.000000 219.000000 197.000000 220.000000 5408.000000 6844.500000 1.530196e+08 125.710000 360.000000 63116.006320 360.000000 63259.726600 360.000000 63259.726600 1.000000 1.000000 8.000000 3.000000 123.519704 133.691585 292.853059 2.498143 69.872251 179.518624 298.723197 179.774464 179.928116 12.000000 11.0 12.000000 2.0 330.000000 9.997281e+01 176.000000 1.000000 1.000000 499.391604 3.051857e+02 4.993892e+02 4.928847e+02 1.000000 1.000000 1.000000 1.000000 3.000000 3.000000 1.000000 1.000000 605.000000 1.000000
In [ ]:
# list(df.columns)

Exploratory Data Analysis

In [ ]:
# Final Rotation Angle in degrees

feat_col = df['final_rotation_angle']
feat_col.fillna(-5,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(x = 'final_rotation_angle',data=df, palette='rocket_r', hue='diagnosis')
fig.set_xlabel("Rotation Angle in Degree",size=15)
fig.set_ylabel("Angle Frequency",size=15)
plt.title('Angle frequencies for all samples',size = 20)
plt.show()

We can notice that there are only 13 discrete values in rotation angles, instead of using these, we can resample that to 4 different columns each representing 90 degrees range or 1 quarter of circle angles.

In [ ]:
print(f"number of unique values for rotation angles: {feat_col.nunique()}")

#now we can change that to 4 different quarters columns
df['rotation_angle_90'] = (feat_col <= 90).astype('int')    #we will also include NaN in this column
df['rotation_angle_180'] = (90 < feat_col) & (feat_col <= 180).astype('int') 
df['rotation_angle_270'] = (180 < feat_col) & (feat_col <= 270).astype('int') 
df['rotation_angle_360'] = (feat_col > 270).astype('int')   

#We care not using this currently instead we will use two columns for below 180 and above 180
number of unique values for rotation angles: 13
In [ ]:
# number of digits 
feat_col = df['number_of_digits']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="number_of_digits",palette='rocket', hue="diagnosis" )
fig.set_xlabel("number of digits",size=15)
fig.set_ylabel("Digits Frequency",size=15)
plt.title('Num Digits frequencies for all samples',size = 20)
plt.show()
In [ ]:
print(f"number of unique values for number digits: {df['number_of_digits'].nunique()}")
number of unique values for number digits: 18

We can notice that most of the values lie in 10,11,12 count range which is good indicator for large normal part of our dataset. And so maybe a new feature with either 10 or 11 or 12 true maybe useful

In [ ]:
#Let's look at some of the features with categorical values of repeating multiple instances
#For missing Digit values
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"missing_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    sns.countplot(data=df, x=feature,palette='rocket' )
    plt.xlabel(f"Count of values for {feature}", fontsize=12);# plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

The ratio is same for almost all the digits with around 5000 values being missing. We can notice the large portion in missing_digit_1 & missing_digit_5 variable

In [ ]:
#Let's look at Euclidean distance from digits 
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"euc_dist_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-10,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency of values for {feature}", fontsize=12);# plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>
In [ ]:
#Let's look at Euclidean distance from center(512,512) to digits 
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"{i} dist from cen" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-10,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution of values for {feature}", fontsize=12);# plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

The distribution seems to have variance around 200 with balanced gaussian distribution. Another thing to notice is that there are a lot of missing values in those variables.

In [ ]:
#Next set of variables are area for each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"area_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

We can notice the distributions have large variance and the distributions seem to be skewed. We may use some feature engineering to mkae it right.

In [ ]:
#Next set of variables are height of each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"height_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

There is a lot of variance in height of bounding boxes. This may explain the different sizes of bounding boxes as we can see the size will be diferent for some digits and 11, 12.

In [ ]:
#Next set of variables are width for each digit bounding boxes

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"width_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='Red')
    plt.xlabel(f"Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

Again we can notice some skewness and a large portion of missing values inside variables. The variance is also different for most of the variables.

In [ ]:
# we will look into the varinace of features distribution of width and height now, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_height'],color="blue", kde=True,bins=120, label='variance_height')
sns.distplot(df['variance_width'],color="red", kde=True,bins=120, label='variance_width')
# sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

Surprisingly, they are almost identical which is good as we can atleast know there is good correlation with height and width variables. Another thing to notice is that we can extract area variable by multiplying height and width features as area = H*W for a bounding box. (Sadly we can't get the missing values in Area from H & W as they are also missing in both other variables)

In [ ]:
# we will look into the varinace of area, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>
In [ ]:
#Next set of variables are Angle calculated as counterclockwise and clockwise sum,variance

plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
    plt.subplot(2, 1,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.countplot(data=df, x=feature,palette='rocket')
    plt.xlabel(f"count values Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>
In [ ]:
#same with variance variable 

plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
    plt.subplot(2, 1,i+1)
    df[feature].fillna(-1,inplace=True)
#     sns.distplot(df[feature],color="blue", kde=True,bins=120, label='sum')
    sns.distplot(df[feature.replace('sum','var')],color="red", kde=True,bins=120, label='var')
    plt.xlabel(f"Frequency distribution for {feature.replace('sum','var')}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

Majority of above values are concentrated at value 0 in both the cases in variance, that indicates the presence of very precise data or large number of missing values which we can confirm from Sum countplot.

In [ ]:
features = df_orig.columns[1:-1].to_list()
for f in features:
    print(f" {f} is having : {df[f].nunique()} distinct values")
 number_of_digits is having : 18 distinct values
 missing_digit_1 is having : 2 distinct values
 missing_digit_2 is having : 2 distinct values
 missing_digit_3 is having : 2 distinct values
 missing_digit_4 is having : 2 distinct values
 missing_digit_5 is having : 2 distinct values
 missing_digit_6 is having : 2 distinct values
 missing_digit_7 is having : 2 distinct values
 missing_digit_8 is having : 2 distinct values
 missing_digit_9 is having : 2 distinct values
 missing_digit_10 is having : 2 distinct values
 missing_digit_11 is having : 2 distinct values
 missing_digit_12 is having : 2 distinct values
 1 dist from cen is having : 21148 distinct values
 10 dist from cen is having : 22765 distinct values
 11 dist from cen is having : 22258 distinct values
 12 dist from cen is having : 21357 distinct values
 2 dist from cen is having : 22905 distinct values
 3 dist from cen is having : 23065 distinct values
 4 dist from cen is having : 22437 distinct values
 5 dist from cen is having : 21245 distinct values
 6 dist from cen is having : 21664 distinct values
 7 dist from cen is having : 23415 distinct values
 8 dist from cen is having : 23604 distinct values
 9 dist from cen is having : 20996 distinct values
 euc_dist_digit_1 is having : 23913 distinct values
 euc_dist_digit_2 is having : 26417 distinct values
 euc_dist_digit_3 is having : 912 distinct values
 euc_dist_digit_4 is having : 25887 distinct values
 euc_dist_digit_5 is having : 24739 distinct values
 euc_dist_digit_6 is having : 889 distinct values
 euc_dist_digit_7 is having : 26646 distinct values
 euc_dist_digit_8 is having : 27017 distinct values
 euc_dist_digit_9 is having : 919 distinct values
 euc_dist_digit_10 is having : 26196 distinct values
 euc_dist_digit_11 is having : 25628 distinct values
 euc_dist_digit_12 is having : 912 distinct values
 area_digit_1 is having : 1966 distinct values
 area_digit_2 is having : 3201 distinct values
 area_digit_3 is having : 3624 distinct values
 area_digit_4 is having : 3816 distinct values
 area_digit_5 is having : 4275 distinct values
 area_digit_6 is having : 3960 distinct values
 area_digit_7 is having : 3400 distinct values
 area_digit_8 is having : 3884 distinct values
 area_digit_9 is having : 3707 distinct values
 area_digit_10 is having : 3733 distinct values
 area_digit_11 is having : 3450 distinct values
 area_digit_12 is having : 4332 distinct values
 height_digit_1 is having : 124 distinct values
 height_digit_2 is having : 163 distinct values
 height_digit_3 is having : 170 distinct values
 height_digit_4 is having : 178 distinct values
 height_digit_5 is having : 171 distinct values
 height_digit_6 is having : 177 distinct values
 height_digit_7 is having : 157 distinct values
 height_digit_8 is having : 185 distinct values
 height_digit_9 is having : 195 distinct values
 height_digit_10 is having : 162 distinct values
 height_digit_11 is having : 151 distinct values
 height_digit_12 is having : 169 distinct values
 width_digit_1 is having : 127 distinct values
 width_digit_2 is having : 141 distinct values
 width_digit_3 is having : 147 distinct values
 width_digit_4 is having : 154 distinct values
 width_digit_5 is having : 177 distinct values
 width_digit_6 is having : 169 distinct values
 width_digit_7 is having : 148 distinct values
 width_digit_8 is having : 166 distinct values
 width_digit_9 is having : 161 distinct values
 width_digit_10 is having : 159 distinct values
 width_digit_11 is having : 162 distinct values
 width_digit_12 is having : 180 distinct values
 variance_width is having : 24755 distinct values
 variance_height is having : 24130 distinct values
 variance_area is having : 32296 distinct values
 deviation_dist_from_mid_axis is having : 4336 distinct values
 between_axis_digits_angle_sum is having : 74 distinct values
 between_axis_digits_angle_var is having : 30816 distinct values
 between_digits_angle_cw_sum is having : 5 distinct values
 between_digits_angle_cw_var is having : 32079 distinct values
 between_digits_angle_ccw_sum is having : 4 distinct values
 between_digits_angle_ccw_var is having : 32079 distinct values
 sequence_flag_cw is having : 2 distinct values
 sequence_flag_ccw is having : 2 distinct values
 number_of_hands is having : 7 distinct values
 hand_count_dummy is having : 3 distinct values
 hour_hand_length is having : 11931 distinct values
 minute_hand_length is having : 13480 distinct values
 single_hand_length is having : 6678 distinct values
 clockhand_ratio is having : 22642 distinct values
 clockhand_diff is having : 22834 distinct values
 angle_between_hands is having : 22838 distinct values
 deviation_from_centre is having : 22793 distinct values
 intersection_pos_rel_centre is having : 4 distinct values
 hour_proximity_from_11 is having : 20153 distinct values
 minute_proximity_from_2 is having : 19899 distinct values
 hour_pointing_digit is having : 12 distinct values
 actual_hour_digit is having : 1 distinct values
 minute_pointing_digit is having : 12 distinct values
 actual_minute_digit is having : 1 distinct values
 final_rotation_angle is having : 13 distinct values
 ellipse_circle_ratio is having : 32038 distinct values
 count_defects is having : 172 distinct values
 percentage_inside_ellipse is having : 70 distinct values
 pred_tremor is having : 2 distinct values
 double_major is having : 32328 distinct values
 double_minor is having : 32469 distinct values
 vertical_dist is having : 32534 distinct values
 horizontal_dist is having : 32597 distinct values
 top_area_perc is having : 31153 distinct values
 bottom_area_perc is having : 31149 distinct values
 left_area_perc is having : 31163 distinct values
 right_area_perc is having : 31162 distinct values
 hor_count is having : 4 distinct values
 vert_count is having : 4 distinct values
 eleven_ten_error is having : 2 distinct values
 other_error is having : 2 distinct values
 time_diff is having : 140 distinct values
 centre_dot_detect is having : 2 distinct values
In [ ]:
#Now we will take a look at how the different categorical features with only a few values hold as an countplot distribution

plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['sequence_flag_cw',
 'sequence_flag_ccw',
 'number_of_hands',
 'hand_count_dummy',
 'pred_tremor',
 'hor_count',
 'vert_count',
 'eleven_ten_error',
 'other_error',
 'centre_dot_detect']
for i,feature in enumerate(cont_features):
    plt.subplot(4,3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.countplot(data=df, x=feature,palette='rocket')
    plt.xlabel(f"Count Values for {feature}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>

The above count plots explain all the categorical values with categories less than 7. we can already see some odd patterns in hand_count_dummy and number_of_hands. Upon further checking the abnormal values (greater than 2) seem to come from normal diagnosis labels.

In [ ]:
#check the null values in training data
print(f" Training data has Null values : {df_orig.isnull().sum()}")
 Training data has Null values : row_id                               0
number_of_digits                    74
missing_digit_1                     74
missing_digit_2                     74
missing_digit_3                     74
missing_digit_4                     74
missing_digit_5                     74
missing_digit_6                     74
missing_digit_7                     74
missing_digit_8                     74
missing_digit_9                     74
missing_digit_10                    74
missing_digit_11                    74
missing_digit_12                    74
1 dist from cen                   7329
10 dist from cen                  4895
11 dist from cen                  5576
12 dist from cen                  3840
2 dist from cen                   4922
3 dist from cen                   4165
4 dist from cen                   5526
5 dist from cen                   6685
6 dist from cen                   4370
7 dist from cen                   4222
8 dist from cen                   4022
9 dist from cen                   5803
euc_dist_digit_1                  7377
euc_dist_digit_2                  4977
euc_dist_digit_3                  4174
euc_dist_digit_4                  5539
euc_dist_digit_5                  6695
euc_dist_digit_6                  4383
euc_dist_digit_7                  4286
euc_dist_digit_8                  4136
euc_dist_digit_9                  5841
euc_dist_digit_10                 4939
euc_dist_digit_11                 5626
euc_dist_digit_12                 3868
area_digit_1                      7329
area_digit_2                      4922
area_digit_3                      4165
area_digit_4                      5526
area_digit_5                      6685
area_digit_6                      4370
area_digit_7                      4222
area_digit_8                      4022
area_digit_9                      5803
area_digit_10                     4895
area_digit_11                     5576
area_digit_12                     3840
height_digit_1                    7329
height_digit_2                    4922
height_digit_3                    4165
height_digit_4                    5526
height_digit_5                    6685
height_digit_6                    4370
height_digit_7                    4222
height_digit_8                    4022
height_digit_9                    5803
height_digit_10                   4895
height_digit_11                   5576
height_digit_12                   3840
width_digit_1                     7329
width_digit_2                     4922
width_digit_3                     4165
width_digit_4                     5526
width_digit_5                     6685
width_digit_6                     4370
width_digit_7                     4222
width_digit_8                     4022
width_digit_9                     5803
width_digit_10                    4895
width_digit_11                    5576
width_digit_12                    3840
variance_width                     464
variance_height                    464
variance_area                      464
deviation_dist_from_mid_axis       579
between_axis_digits_angle_sum     3636
between_axis_digits_angle_var     1907
between_digits_angle_cw_sum      12750
between_digits_angle_cw_var        692
between_digits_angle_ccw_sum     31933
between_digits_angle_ccw_var       692
sequence_flag_cw                   303
sequence_flag_ccw                  303
number_of_hands                   2154
hand_count_dummy                  2154
hour_hand_length                  9916
minute_hand_length                9916
single_hand_length               25036
clockhand_ratio                  10127
clockhand_diff                    9942
angle_between_hands               9916
deviation_from_centre             9984
intersection_pos_rel_centre       9916
hour_proximity_from_11           12586
minute_proximity_from_2          12858
hour_pointing_digit              10100
actual_hour_digit                    0
minute_pointing_digit            10099
actual_minute_digit                  0
final_rotation_angle                74
ellipse_circle_ratio               738
count_defects                        0
percentage_inside_ellipse          305
pred_tremor                          0
double_major                       237
double_minor                         1
vertical_dist                      188
horizontal_dist                     42
top_area_perc                     1559
bottom_area_perc                  1559
left_area_perc                    1559
right_area_perc                   1559
hor_count                            0
vert_count                           0
eleven_ten_error                     0
other_error                          0
time_diff                        10251
centre_dot_detect                 9951
diagnosis                            0
dtype: int64
In [ ]:
# Now finally we take a look the remaining feature distribution as they contain large number of distinct values suitable for a distribution plot
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['deviation_dist_from_mid_axis',
 'between_axis_digits_angle_sum',
 'between_axis_digits_angle_var',
 'hour_hand_length',
 'minute_hand_length',
 'single_hand_length',
 'clockhand_ratio',
 'clockhand_diff',
 'angle_between_hands',
 'deviation_from_centre',
 'hour_proximity_from_11',
 'minute_proximity_from_2',
 ]
for i,feature in enumerate(cont_features):
    plt.subplot(4, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='blue')
    plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>
In [ ]:
cont_features = ['hour_pointing_digit',
 'minute_pointing_digit',
 'final_rotation_angle',
 'ellipse_circle_ratio',
 'count_defects',
 'percentage_inside_ellipse',
 'double_major',
 'double_minor',
 'vertical_dist',
 'horizontal_dist',
 'top_area_perc',
 'bottom_area_perc',
 'left_area_perc',
 'right_area_perc',
 'time_diff']
plt.figure()
fig, ax = plt.subplots(5, 3,figsize=(14, 20))
for i,feature in enumerate(cont_features):
    plt.subplot(5, 3,i+1)
    df[feature].fillna(-1,inplace=True)
    sns.distplot(df[feature] , color='blue')
    plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
<Figure size 432x288 with 0 Axes>
From the above distributions we can notice the low variance in some features, very concentrated values in others and a few seem to have discrete type characteristic but at the same time being continuous
In [ ]:
#one categorical feature with true categorical values

# intersection_pos_rel_centre
feat_col = df['intersection_pos_rel_centre']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="intersection_pos_rel_centre",palette='rocket', hue="diagnosis" )
fig.set_xlabel("categories in intersection_pos_rel_centre",size=15)
fig.set_ylabel("Frequency values",size=15)
plt.title('Categorical values distribution with classes',size = 20)
plt.show()
In [ ]:
# we will look into the vfinal target variable to get more insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.countplot(data=df, x='diagnosis',palette='rocket')
plt.title('Distribution of target variable',size = 20)
plt.legend()
plt.show()
No handles with labels found to put in legend.
<Figure size 432x288 with 0 Axes>

We can notice a very large imbalance in data classes which we will address later during feature engineering to make distribution more even

In [ ]:
def CorrMtx(df, dropDuplicates = True):

    # Your dataset is already a correlation matrix.
    # If you have a dateset where you need to include the calculation
    # of a correlation matrix, just uncomment the line below:
    df = df.corr()

    # Exclude duplicate correlations by masking uper right values
    if dropDuplicates:    
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True

    # Set background color / chart style
    sns.set_style(style = 'white')

    # Set up  matplotlib figure
    f, ax = plt.subplots(figsize=(11, 9))

    # Add diverging colormap from red to blue
    cmap = sns.diverging_palette(250, 10, as_cmap=True)

    # Draw correlation plot with or without duplicates
    if dropDuplicates:
        sns.heatmap(df, mask=mask, cmap=cmap, 
                square=True,
                linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
    else:
        sns.heatmap(df, cmap=cmap, 
                square=True,
                linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
df.fillna(-1,inplace=True)
CorrMtx(df, dropDuplicates =False)

The correlation plot looks interesting to me as it gives a lot of insight into data, for example we can notice some clustered features are interlinked and having high correlation,others seem to have negative correlation with a few.

Feature Engineering and Data Preparation

FE - Part I: Creating new features

In [ ]:
# Now we apply some feature engineering from the conclusions drawn from above EDA
df = df_orig.copy()

# Standardize features
def standardize(df):
    numeric = df.select_dtypes(include=['int64', 'float64'])
    
    # subtracy mean and divide by std
    df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    
    return df
 
#we will use -999 to fill up the missing values as of now
df.fillna(-999,inplace=True)

#Create more features from categorical features
df_dummies = pd.get_dummies(df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
df = df.drop('intersection_pos_rel_centre', axis=1)
df = pd.concat([df, df_dummies], axis=1)

df_dummies = pd.get_dummies(df['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')
df = df.drop('hand_count_dummy', axis=1)
df = pd.concat([df, df_dummies], axis=1)

feat_col = df['final_rotation_angle']
df['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this column
df['rotation_angle_360'] = (feat_col > 180).astype('int') 
df = df.drop('final_rotation_angle', axis=1)

features =df.columns[1:].to_list()
features.remove('diagnosis')

#currently we are not using standardize but you can use that by uncommeting below line
# df = standardize(df)

features
Out[ ]:
['number_of_digits',
 'missing_digit_1',
 'missing_digit_2',
 'missing_digit_3',
 'missing_digit_4',
 'missing_digit_5',
 'missing_digit_6',
 'missing_digit_7',
 'missing_digit_8',
 'missing_digit_9',
 'missing_digit_10',
 'missing_digit_11',
 'missing_digit_12',
 '1 dist from cen',
 '10 dist from cen',
 '11 dist from cen',
 '12 dist from cen',
 '2 dist from cen',
 '3 dist from cen',
 '4 dist from cen',
 '5 dist from cen',
 '6 dist from cen',
 '7 dist from cen',
 '8 dist from cen',
 '9 dist from cen',
 'euc_dist_digit_1',
 'euc_dist_digit_2',
 'euc_dist_digit_3',
 'euc_dist_digit_4',
 'euc_dist_digit_5',
 'euc_dist_digit_6',
 'euc_dist_digit_7',
 'euc_dist_digit_8',
 'euc_dist_digit_9',
 'euc_dist_digit_10',
 'euc_dist_digit_11',
 'euc_dist_digit_12',
 'area_digit_1',
 'area_digit_2',
 'area_digit_3',
 'area_digit_4',
 'area_digit_5',
 'area_digit_6',
 'area_digit_7',
 'area_digit_8',
 'area_digit_9',
 'area_digit_10',
 'area_digit_11',
 'area_digit_12',
 'height_digit_1',
 'height_digit_2',
 'height_digit_3',
 'height_digit_4',
 'height_digit_5',
 'height_digit_6',
 'height_digit_7',
 'height_digit_8',
 'height_digit_9',
 'height_digit_10',
 'height_digit_11',
 'height_digit_12',
 'width_digit_1',
 'width_digit_2',
 'width_digit_3',
 'width_digit_4',
 'width_digit_5',
 'width_digit_6',
 'width_digit_7',
 'width_digit_8',
 'width_digit_9',
 'width_digit_10',
 'width_digit_11',
 'width_digit_12',
 'variance_width',
 'variance_height',
 'variance_area',
 'deviation_dist_from_mid_axis',
 'between_axis_digits_angle_sum',
 'between_axis_digits_angle_var',
 'between_digits_angle_cw_sum',
 'between_digits_angle_cw_var',
 'between_digits_angle_ccw_sum',
 'between_digits_angle_ccw_var',
 'sequence_flag_cw',
 'sequence_flag_ccw',
 'number_of_hands',
 'hour_hand_length',
 'minute_hand_length',
 'single_hand_length',
 'clockhand_ratio',
 'clockhand_diff',
 'angle_between_hands',
 'deviation_from_centre',
 'hour_proximity_from_11',
 'minute_proximity_from_2',
 'hour_pointing_digit',
 'actual_hour_digit',
 'minute_pointing_digit',
 'actual_minute_digit',
 'ellipse_circle_ratio',
 'count_defects',
 'percentage_inside_ellipse',
 'pred_tremor',
 'double_major',
 'double_minor',
 'vertical_dist',
 'horizontal_dist',
 'top_area_perc',
 'bottom_area_perc',
 'left_area_perc',
 'right_area_perc',
 'hor_count',
 'vert_count',
 'eleven_ten_error',
 'other_error',
 'time_diff',
 'centre_dot_detect',
 'c_i_-999',
 'c_i_BL',
 'c_i_BR',
 'c_i_TL',
 'c_i_TR',
 'c_h_-999.0',
 'c_h_1.0',
 'c_h_2.0',
 'c_h_3.0',
 'rotation_angle_180',
 'rotation_angle_360']

FE - Part II: Dealing with Class Imbalance

In [ ]:
#Now we will use one of the methods described in https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#and used by https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#check those out, great notebooks

df_final = pd.concat([
    df.loc[df.diagnosis == 'pre_alzheimer'],
    df.loc[df.diagnosis == 'post_alzheimer'],
    df.loc[df.diagnosis == 'normal'].sample(frac=1/6),
]).reset_index().drop('index', axis=1)



train_data = df_final[features]

target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
train_labels = df_final['diagnosis'].map(target_dict).astype('int')
train_data.describe()
Out[ ]:
number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect c_i_-999 c_i_BL c_i_BR c_i_TL c_i_TR c_h_-999.0 c_h_1.0 c_h_2.0 c_h_3.0 rotation_angle_180 rotation_angle_360
count 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.00000 6770.00000 6770.000000 6770.000000 6770.000000 6.770000e+03 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.0 6770.000000 6770.0 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000 6770.000000
mean 2.205465 -7.251551 -7.361300 -7.353767 -7.318168 -7.284786 -7.345495 -7.356721 -7.366765 -7.297932 -7.329985 -7.308715 -7.380059 -22.195355 85.820879 57.993805 158.878009 115.693513 95.045635 45.866884 0.618808 95.241264 122.723867 138.123734 49.233093 -260.504807 -144.774563 -150.807905 -188.369042 -224.448232 -163.786662 -151.936662 -142.008238 -210.628043 -176.689947 -199.641637 -127.947171 1403.209601 3704.183900 3996.926588 4343.154210 5229.291581 4716.534417 3895.291285 4670.187740 4162.444313 5161.030428 4045.102068 5928.502954 -238.270015 -112.619055 -113.378582 -146.426440 -181.362038 -116.460266 -109.724668 -95.389808 -167.023191 -137.998818 -165.289217 -86.609749 -251.928656 -119.627917 -128.924520 -163.127622 -188.362629 -131.741507 -126.184786 -109.935303 -185.789217 -139.82644 -169.98449 -78.567947 339.246680 297.689181 5.505836e+06 -5.602571 161.265643 2925.461328 -221.238988 3939.049096 -935.865731 4029.426765 -13.038257 -13.686263 -108.993353 -350.062017 -337.836826 -703.105294 -393.609485 -376.250703 -332.746307 -378.064557 -457.367168 -446.508045 -387.701773 11.0 -390.715953 2.0 49.553896 90.502806 -16.050169 0.351256 108.758189 105.726612 101.984333 112.154405 -52.487678 -52.529321 -52.477276 -52.535546 0.634417 0.703397 0.029985 0.675037 -329.566470 -388.681093 0.387592 0.137371 0.071344 0.268390 0.135303 0.110635 0.263811 0.612408 0.013146 0.806942 0.193058
std 87.278148 86.411620 86.401703 86.402388 86.405617 86.408632 86.403140 86.402120 86.401206 86.407446 86.404547 86.406472 86.399994 613.451077 550.043566 570.670442 494.856191 510.664305 514.112826 548.877680 576.482521 528.160353 521.823106 510.976726 583.188112 465.404599 393.170745 397.830391 425.453682 446.657144 402.940222 396.300214 389.499337 439.277714 417.979123 432.997542 372.519664 1767.826721 3103.843143 3340.847532 3690.797373 4749.247871 3699.724917 3047.743663 3642.054339 3728.097547 4299.486305 3650.198329 4513.071386 476.762205 404.418243 414.760512 446.948512 470.622550 425.089259 412.340230 404.560842 462.192311 435.362023 449.024706 388.646802 468.113349 401.068195 407.332302 437.938456 466.615689 417.432748 404.533581 397.901991 451.560504 434.40546 446.50366 392.231714 409.979873 406.970543 8.156777e+06 200.561033 469.422919 6543.878818 668.442520 6925.440770 276.609069 7079.345020 116.371406 116.294178 313.929704 516.422340 526.139065 479.932534 489.028573 496.811455 530.473060 496.803408 513.079050 518.163306 492.436627 0.0 490.465431 0.0 173.934939 40.263078 129.222672 0.477398 115.397321 19.316798 103.804854 43.576149 223.997320 223.987464 223.999767 223.985977 0.671055 0.693385 0.170559 0.468396 567.379782 487.238454 0.487237 0.344264 0.257418 0.443154 0.342072 0.313703 0.440731 0.487237 0.113909 0.394727 0.394727
min -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.00000 -999.00000 -999.000000 -999.000000 -999.000000 -9.990000e+02 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 11.0 -999.000000 2.0 -999.000000 1.000000 -999.000000 0.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -999.000000 276.467545 260.469886 309.404670 272.446325 256.152779 238.876537 108.130784 277.323332 291.351686 300.264778 238.596499 -999.000000 3.367697 3.120000 1.670401 0.062912 1.950000 2.649224 3.084697 0.520000 1.696822 0.889803 2.470000 -999.000000 2160.000000 2295.000000 2405.250000 1774.000000 2684.000000 2378.000000 2829.750000 1841.250000 2816.000000 2024.250000 3422.000000 -999.000000 46.000000 52.000000 48.000000 40.250000 49.000000 49.000000 51.000000 40.000000 51.000000 44.000000 53.000000 -999.000000 43.000000 39.000000 41.000000 37.000000 43.000000 39.000000 43.000000 35.000000 50.00000 41.00000 59.000000 162.265828 137.811688 1.486458e+06 9.880000 360.000000 80.625211 -999.000000 56.718975 -999.000000 57.144170 0.000000 0.000000 1.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 11.0 -999.000000 2.0 75.499059 74.000000 1.000000 0.000000 114.799143 101.724518 106.522653 109.362514 0.464936 0.470940 0.497451 0.449839 0.000000 0.000000 0.000000 0.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 342.639388 355.813505 353.080904 366.623104 339.164783 327.199862 321.226749 316.376750 342.074735 356.309907 360.365959 362.882729 8.703903 16.609486 15.210000 13.757949 11.869623 11.180000 12.582122 14.012915 10.855000 12.929675 11.513083 10.140000 1674.000000 3599.000000 3940.500000 4480.000000 5368.500000 4872.000000 4032.000000 4606.000000 4390.500000 5211.000000 4160.000000 5656.500000 49.000000 63.000000 71.000000 75.000000 76.000000 77.000000 72.000000 77.000000 77.000000 72.000000 66.000000 71.000000 30.000000 56.000000 53.000000 56.000000 66.000000 60.000000 52.000000 58.000000 53.000000 70.00000 60.00000 79.000000 280.875758 241.621212 3.088450e+06 17.355000 360.000000 315.974661 360.000000 235.903932 -999.000000 237.594904 1.000000 0.000000 2.000000 47.181941 68.624733 -999.000000 1.086670 6.190447 76.210632 6.873668 0.558894 0.454038 2.000000 11.0 2.000000 2.0 82.958432 100.000000 1.000000 0.000000 120.067894 108.892404 112.513835 115.876461 0.490146 0.503934 0.518955 0.475576 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 12.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 383.768891 388.613481 388.896194 396.084587 376.630124 364.555723 357.789393 354.439787 375.583629 390.380263 394.259749 396.124585 26.665341 43.126611 41.340000 39.214867 35.956487 30.225000 37.757273 39.778335 32.077500 40.320899 35.761962 27.430000 2516.000000 5390.000000 5828.750000 6500.000000 8022.750000 7040.000000 5733.000000 6696.000000 6480.000000 7654.000000 6132.000000 8286.500000 67.000000 80.000000 90.000000 97.000000 97.000000 100.000000 93.000000 98.000000 100.000000 90.000000 84.000000 87.000000 40.000000 70.000000 68.000000 71.000000 87.000000 75.000000 67.000000 75.000000 68.000000 87.00000 76.00000 100.000000 466.358333 413.868182 6.414914e+06 48.100000 360.000000 2790.897863 360.000000 7494.666046 -999.000000 7519.096753 1.000000 0.000000 2.000000 63.744688 83.514959 40.275668 1.363993 21.825065 94.489077 15.576374 5.973103 5.561037 11.000000 11.0 2.000000 2.0 87.013714 119.000000 1.000000 1.000000 124.292840 113.400288 116.133562 120.156781 0.515807 0.525680 0.540454 0.495097 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000
max 17.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 611.804707 587.053873 600.393621 659.571073 544.839655 524.434219 520.622704 490.549946 505.302385 666.132119 608.481717 593.485467 119.014060 119.906309 118.560000 119.546475 117.821062 115.700000 119.321309 119.838808 118.950000 118.928227 119.644143 119.730000 9782.000000 25088.000000 24480.000000 26850.000000 32200.000000 21812.000000 22866.000000 26085.000000 25584.000000 31622.000000 28000.000000 28362.000000 143.000000 196.000000 193.000000 223.000000 218.000000 202.000000 176.000000 256.000000 248.000000 201.000000 208.000000 287.000000 149.000000 193.000000 184.000000 181.000000 210.000000 209.000000 206.000000 192.000000 207.000000 202.00000 182.00000 220.000000 4232.000000 5832.000000 1.530196e+08 119.470000 360.000000 63116.006320 360.000000 63259.726600 360.000000 63259.726600 1.000000 1.000000 8.000000 110.544344 128.171298 292.853059 2.484998 69.628202 179.356082 295.987900 179.275012 179.599020 12.000000 11.0 12.000000 2.0 99.972806 170.000000 1.000000 1.000000 472.390325 253.593660 471.591459 471.512542 1.000000 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000 605.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

We mainly used very simple feature engineering as of now but in susequent notebooks (probably part 2 or workthrough blog/video), I'll explain more methods of feature engineering and try to dig deeper into how we can leverage the different FE techniques, now let's focus on hyper parameter optimization

Redundant Code

In [ ]:
# features = df_orig.columns[1:-1].to_list()
# cont_f = []
# for f in features:
#     print(f" {f} is having : {df[f].nunique()}")
#     if df[f].nunique() >= 7:
#         cont_f.append(f)
In [ ]:
# train = df[features]
# train = train.drop(['intersection_pos_rel_centre'],axis = 1)
# train.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train.describe()
In [ ]:
# target_values = list(df_orig['diagnosis'].unique())
# target_col = 'diagnosis'
# df_pos = df_orig[df_orig[target_col].isin(target_values[1:])]
# nb_pos = df_pos.shape[0]
# nb_neg = nb_pos*2
# df_neg = df_orig[df_orig[target_col] == "normal"].sample(n=nb_neg, random_state=42)
# df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)

# train_data = df_samples[features]
# train_data.drop(['intersection_pos_rel_centre'],axis = 1, inplace=True)
# train_data.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train_data.describe()
In [ ]:
# df_orig['diagnosis'].unique()
In [ ]:
# target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
# remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
# train_labels = df_samples['diagnosis'].map(target_dict).astype('int')
# train_labels

Train your model

Part I: Hyper-parameter Optimization using Optuna

In [ ]:
#use 10% train data for validation while tuning hyperparamters
X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)


#For tuning hyperparameters we are using default sampler and pruner of Optunafor simplicity, you can find moer info about them 
#at https://github.com/optuna/optuna/ [ps: I am one of the contributors so feel free to ask any queries or give feedback]
import optuna


def objective(trial):
    train_x, valid_x, train_y, valid_y = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
    dtrain = xgb.DMatrix(train_x, label=train_y)
    dvalid = xgb.DMatrix(valid_x, label=valid_y)

    param = {
        "verbosity": 0,
        "eval_metric":"mlogloss",
        "use_label_encoder":False,
        # L2 regularization weight.
        "reg_lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
        # L1 regularization weight.
        "reg_alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
        # sampling ratio for training data.
        "subsample": trial.suggest_float("subsample", 0.2, 1.0),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 1.0, log=True),
        "max_depth": trial.suggest_int("max_depth", 8, 20),
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
    }
    model = xgb.XGBClassifier(**param)
    model.fit(train_x,train_y)
    pred_labels = model.predict_proba(valid_x)
    return log_loss(valid_y, pred_labels)
In [ ]:
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
[I 2021-05-12 15:59:38,641] A new study created in memory with name: no-name-22701c7a-c3c0-4464-ad97-3aff52082ff5
[I 2021-05-12 15:59:43,741] Trial 0 finished with value: 0.9195696366414362 and parameters: {'lambda': 1.6349867526984736e-08, 'alpha': 1.3660860062064696e-06, 'subsample': 0.20079966534758473, 'colsample_bytree': 0.37119951076379953, 'learning_rate': 0.0038901255071057336, 'max_depth': 20, 'n_estimators': 70}. Best is trial 0 with value: 0.9195696366414362.
[I 2021-05-12 16:00:04,196] Trial 1 finished with value: 0.9635377036768014 and parameters: {'lambda': 0.004574820718124166, 'alpha': 2.9722885826155193e-05, 'subsample': 0.9208011834964993, 'colsample_bytree': 0.7567158878933506, 'learning_rate': 0.00129718056690989, 'max_depth': 11, 'n_estimators': 146}. Best is trial 0 with value: 0.9195696366414362.
[I 2021-05-12 16:00:23,199] Trial 2 finished with value: 0.6012728019910514 and parameters: {'lambda': 0.06532063043901398, 'alpha': 7.219229185669857e-08, 'subsample': 0.8064491134239615, 'colsample_bytree': 0.6532285969703563, 'learning_rate': 0.007698231074099377, 'max_depth': 9, 'n_estimators': 199}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:00:36,058] Trial 3 finished with value: 1.0188958269333888 and parameters: {'lambda': 4.100103321864437e-06, 'alpha': 1.469784619079097e-06, 'subsample': 0.46997280290451515, 'colsample_bytree': 0.9817330486559255, 'learning_rate': 0.37994346952157776, 'max_depth': 8, 'n_estimators': 183}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:07,492] Trial 4 finished with value: 0.7157922250657526 and parameters: {'lambda': 0.008884359462830286, 'alpha': 0.0013918845053601467, 'subsample': 0.8952897423649306, 'colsample_bytree': 0.7656061476589351, 'learning_rate': 0.005564061425281412, 'max_depth': 20, 'n_estimators': 149}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:10,933] Trial 5 finished with value: 0.9841397848969459 and parameters: {'lambda': 0.002146058060039168, 'alpha': 0.18073047641261855, 'subsample': 0.397484066252481, 'colsample_bytree': 0.7971449141596241, 'learning_rate': 0.711811587096646, 'max_depth': 8, 'n_estimators': 52}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:15,631] Trial 6 finished with value: 1.079011461771409 and parameters: {'lambda': 0.3172815083490704, 'alpha': 0.2074018202573672, 'subsample': 0.3852548230234022, 'colsample_bytree': 0.4810423024971776, 'learning_rate': 0.772503478998581, 'max_depth': 13, 'n_estimators': 129}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:25,493] Trial 7 finished with value: 0.6409120366829854 and parameters: {'lambda': 0.00010606366275323564, 'alpha': 1.3231854083926023e-08, 'subsample': 0.744251168195859, 'colsample_bytree': 0.4299455255176272, 'learning_rate': 0.07107340110329274, 'max_depth': 13, 'n_estimators': 99}. Best is trial 2 with value: 0.6012728019910514.
[I 2021-05-12 16:01:32,129] Trial 8 finished with value: 0.5439758313217227 and parameters: {'lambda': 0.33083813627745107, 'alpha': 1.5911831635866442e-07, 'subsample': 0.82765124048449, 'colsample_bytree': 0.2279829749365476, 'learning_rate': 0.035925260456468336, 'max_depth': 12, 'n_estimators': 95}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:38,463] Trial 9 finished with value: 1.0134956397943775 and parameters: {'lambda': 6.776274672080595e-05, 'alpha': 0.056795678572885214, 'subsample': 0.6102348185161552, 'colsample_bytree': 0.976707107422057, 'learning_rate': 0.8199557848196738, 'max_depth': 15, 'n_estimators': 78}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:45,583] Trial 10 finished with value: 0.5941905979131743 and parameters: {'lambda': 0.9957324360369868, 'alpha': 2.076680191304403e-07, 'subsample': 0.684544680253138, 'colsample_bytree': 0.2286114677754202, 'learning_rate': 0.06678423843359269, 'max_depth': 16, 'n_estimators': 103}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:53,293] Trial 11 finished with value: 0.558872301090217 and parameters: {'lambda': 0.6552347455123961, 'alpha': 2.3169385235616206e-07, 'subsample': 0.653234773049951, 'colsample_bytree': 0.24031308330446927, 'learning_rate': 0.039180808153754215, 'max_depth': 17, 'n_estimators': 106}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:01:59,901] Trial 12 finished with value: 0.5558765758725882 and parameters: {'lambda': 0.7369091485314381, 'alpha': 2.8126831683363716e-05, 'subsample': 0.5439057556282585, 'colsample_bytree': 0.21214216238324685, 'learning_rate': 0.023287391978836985, 'max_depth': 17, 'n_estimators': 102}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:05,345] Trial 13 finished with value: 0.6327375523096702 and parameters: {'lambda': 0.060321088785949904, 'alpha': 0.00010613432399749767, 'subsample': 0.5053679835529601, 'colsample_bytree': 0.20209023588551986, 'learning_rate': 0.01594804011056962, 'max_depth': 18, 'n_estimators': 82}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:09,340] Trial 14 finished with value: 0.7116914281530954 and parameters: {'lambda': 2.5453952810966554e-08, 'alpha': 2.2966164218465713e-05, 'subsample': 0.9843494524124975, 'colsample_bytree': 0.3233026473291353, 'learning_rate': 0.209234484896145, 'max_depth': 11, 'n_estimators': 54}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:17,439] Trial 15 finished with value: 0.552392530850681 and parameters: {'lambda': 5.014880568541396e-07, 'alpha': 0.002565296674701982, 'subsample': 0.22883167938347604, 'colsample_bytree': 0.5263465314609196, 'learning_rate': 0.019551956408425334, 'max_depth': 18, 'n_estimators': 124}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:23,813] Trial 16 finished with value: 0.6979143200293344 and parameters: {'lambda': 1.2973372711244673e-06, 'alpha': 0.0017945347028109787, 'subsample': 0.24368971537894468, 'colsample_bytree': 0.5417594117009952, 'learning_rate': 0.13422959042553706, 'max_depth': 11, 'n_estimators': 127}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:46,457] Trial 17 finished with value: 0.5715046930876582 and parameters: {'lambda': 6.711648112115581e-07, 'alpha': 0.008114864419414643, 'subsample': 0.8071338719437253, 'colsample_bytree': 0.6280043772146156, 'learning_rate': 0.012410805891684423, 'max_depth': 14, 'n_estimators': 157}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:02:53,307] Trial 18 finished with value: 0.9546845668880063 and parameters: {'lambda': 1.9473381877226924e-07, 'alpha': 0.00032124477770214166, 'subsample': 0.2826032414232876, 'colsample_bytree': 0.3416417950739769, 'learning_rate': 0.0017414056809881896, 'max_depth': 19, 'n_estimators': 120}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:19,611] Trial 19 finished with value: 0.6713276867807376 and parameters: {'lambda': 3.4324133372634834e-05, 'alpha': 0.013171732785893786, 'subsample': 0.9816867890170315, 'colsample_bytree': 0.8889916979277526, 'learning_rate': 0.04229477429241404, 'max_depth': 12, 'n_estimators': 167}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:39,319] Trial 20 finished with value: 0.8613529729174018 and parameters: {'lambda': 7.43126857356227e-06, 'alpha': 3.022852631363205e-06, 'subsample': 0.8594906080158236, 'colsample_bytree': 0.5404173064590668, 'learning_rate': 0.003275403546508423, 'max_depth': 15, 'n_estimators': 118}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:48,251] Trial 21 finished with value: 0.5813804973395572 and parameters: {'lambda': 0.0003931298740367695, 'alpha': 9.307805111877995e-06, 'subsample': 0.5483616600559615, 'colsample_bytree': 0.28287453314700883, 'learning_rate': 0.019630774046001352, 'max_depth': 18, 'n_estimators': 90}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:03:52,858] Trial 22 finished with value: 0.631024004261462 and parameters: {'lambda': 0.06258532953868075, 'alpha': 0.00039906590899308164, 'subsample': 0.3285257743497002, 'colsample_bytree': 0.43167280657786383, 'learning_rate': 0.018793427580528474, 'max_depth': 17, 'n_estimators': 68}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:07,027] Trial 23 finished with value: 0.5692939623114563 and parameters: {'lambda': 1.0680239353387206e-07, 'alpha': 3.5566316565534124e-08, 'subsample': 0.716256689759055, 'colsample_bytree': 0.262616241837632, 'learning_rate': 0.030667941305320325, 'max_depth': 16, 'n_estimators': 139}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:15,588] Trial 24 finished with value: 0.6535982637991338 and parameters: {'lambda': 0.25448068138959007, 'alpha': 0.007304168983292869, 'subsample': 0.5768148453722621, 'colsample_bytree': 0.38980608904710046, 'learning_rate': 0.08499556493790805, 'max_depth': 18, 'n_estimators': 107}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:26,722] Trial 25 finished with value: 0.7011930414721744 and parameters: {'lambda': 0.0009091508617729856, 'alpha': 7.5790509217935e-05, 'subsample': 0.4456615729447536, 'colsample_bytree': 0.6745486061016551, 'learning_rate': 0.009612821423414911, 'max_depth': 14, 'n_estimators': 92}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:35,375] Trial 26 finished with value: 0.5509703861387168 and parameters: {'lambda': 0.014241710013410082, 'alpha': 0.0012737675771777972, 'subsample': 0.7825569112681832, 'colsample_bytree': 0.20596670917273524, 'learning_rate': 0.030404545660587586, 'max_depth': 19, 'n_estimators': 110}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:45,130] Trial 27 finished with value: 0.6254864225344268 and parameters: {'lambda': 0.02987454745886535, 'alpha': 0.0020637668494017125, 'subsample': 0.7707754748217122, 'colsample_bytree': 0.3119176940308491, 'learning_rate': 0.05358713313962714, 'max_depth': 19, 'n_estimators': 115}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:04:56,542] Trial 28 finished with value: 0.7235978848434658 and parameters: {'lambda': 0.01403596509504278, 'alpha': 0.03954710306945208, 'subsample': 0.8526279550262742, 'colsample_bytree': 0.5412666921490603, 'learning_rate': 0.11223971630025333, 'max_depth': 10, 'n_estimators': 134}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:06,999] Trial 29 finished with value: 0.9039875398358355 and parameters: {'lambda': 0.0004793778331294463, 'alpha': 0.0003814881557834932, 'subsample': 0.9348677563121771, 'colsample_bytree': 0.394508057603869, 'learning_rate': 0.0040076023340996384, 'max_depth': 20, 'n_estimators': 75}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:19,847] Trial 30 finished with value: 0.5543004662854984 and parameters: {'lambda': 1.5420576306444612e-08, 'alpha': 0.8286272487868824, 'subsample': 0.7911696988792145, 'colsample_bytree': 0.7172112890802471, 'learning_rate': 0.030544422871830627, 'max_depth': 13, 'n_estimators': 89}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:32,768] Trial 31 finished with value: 0.556463176673457 and parameters: {'lambda': 2.7694083953665918e-08, 'alpha': 0.3250902110632494, 'subsample': 0.8062160348412188, 'colsample_bytree': 0.6994586113543141, 'learning_rate': 0.0315697435069218, 'max_depth': 13, 'n_estimators': 90}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:42,328] Trial 32 finished with value: 0.6990377457490524 and parameters: {'lambda': 1.1291930864690077e-07, 'alpha': 0.040749304109750056, 'subsample': 0.6613774240008639, 'colsample_bytree': 0.8537777704403738, 'learning_rate': 0.012848315693840632, 'max_depth': 11, 'n_estimators': 69}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:05:58,927] Trial 33 finished with value: 0.555768231196849 and parameters: {'lambda': 1.3020269233111845e-08, 'alpha': 0.00422071844876201, 'subsample': 0.8483766635056502, 'colsample_bytree': 0.7403406186829168, 'learning_rate': 0.02596946182725228, 'max_depth': 12, 'n_estimators': 110}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:07,535] Trial 34 finished with value: 0.83272786814846 and parameters: {'lambda': 9.15007686586003e-06, 'alpha': 0.00013203434182416723, 'subsample': 0.9259142796153228, 'colsample_bytree': 0.5949503388916598, 'learning_rate': 0.007617066478424526, 'max_depth': 12, 'n_estimators': 60}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:16,988] Trial 35 finished with value: 0.7677976213070284 and parameters: {'lambda': 5.682319585356447e-07, 'alpha': 4.892213591879229e-07, 'subsample': 0.7769562242874049, 'colsample_bytree': 0.8284801770616087, 'learning_rate': 0.23353518546832233, 'max_depth': 10, 'n_estimators': 85}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:34,124] Trial 36 finished with value: 0.5686744241156729 and parameters: {'lambda': 0.0042326990122688755, 'alpha': 0.0010704938541182879, 'subsample': 0.7027872732016271, 'colsample_bytree': 0.9217383865380415, 'learning_rate': 0.03527130956513971, 'max_depth': 15, 'n_estimators': 95}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:06:57,322] Trial 37 finished with value: 0.6038816892873823 and parameters: {'lambda': 7.167199590663346e-08, 'alpha': 0.7384480724372152, 'subsample': 0.8777126909623407, 'colsample_bytree': 0.7446172967069293, 'learning_rate': 0.04854967241859484, 'max_depth': 19, 'n_estimators': 142}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:14,198] Trial 38 finished with value: 0.7140833424236432 and parameters: {'lambda': 0.1787103327358404, 'alpha': 2.7145901635824887e-06, 'subsample': 0.8074491095165193, 'colsample_bytree': 0.6039569973938803, 'learning_rate': 0.0067836313605354704, 'max_depth': 13, 'n_estimators': 123}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:23,361] Trial 39 finished with value: 0.5999528267138817 and parameters: {'lambda': 0.016342787430727198, 'alpha': 0.9006287926906134, 'subsample': 0.618906799988347, 'colsample_bytree': 0.45741160128240516, 'learning_rate': 0.01236147269936341, 'max_depth': 9, 'n_estimators': 131}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:41,681] Trial 40 finished with value: 0.6999605190981472 and parameters: {'lambda': 1.1330966109970059e-08, 'alpha': 1.0272576483174874e-08, 'subsample': 0.7301847063299857, 'colsample_bytree': 0.7130786059228844, 'learning_rate': 0.09127106702914806, 'max_depth': 20, 'n_estimators': 113}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:07:58,819] Trial 41 finished with value: 0.5595087209053371 and parameters: {'lambda': 3.6102863506940234e-08, 'alpha': 0.004992089413328167, 'subsample': 0.8429985042560257, 'colsample_bytree': 0.7760310533548564, 'learning_rate': 0.021699116804683727, 'max_depth': 12, 'n_estimators': 111}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:14,185] Trial 42 finished with value: 0.5638894339695991 and parameters: {'lambda': 1.1288425942790915e-08, 'alpha': 0.023954876339781377, 'subsample': 0.9522532371070629, 'colsample_bytree': 0.7461485742183953, 'learning_rate': 0.024212240095413288, 'max_depth': 12, 'n_estimators': 97}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:30,785] Trial 43 finished with value: 0.6308152922441181 and parameters: {'lambda': 3.4051144992833417e-07, 'alpha': 0.1231372766186487, 'subsample': 0.9057837369483771, 'colsample_bytree': 0.8101561490685745, 'learning_rate': 0.05967777310138885, 'max_depth': 14, 'n_estimators': 106}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:44,386] Trial 44 finished with value: 0.5663630858261793 and parameters: {'lambda': 1.2663189371629173e-06, 'alpha': 0.0027520548094786424, 'subsample': 0.7591503243644138, 'colsample_bytree': 0.6603098788602672, 'learning_rate': 0.03484101617568957, 'max_depth': 13, 'n_estimators': 100}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:08:54,557] Trial 45 finished with value: 0.6342182246585716 and parameters: {'lambda': 1.1354052583809866e-08, 'alpha': 0.0007241911751782543, 'subsample': 0.8227250109167384, 'colsample_bytree': 0.7082449288272578, 'learning_rate': 0.015151960040659704, 'max_depth': 10, 'n_estimators': 83}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:20,560] Trial 46 finished with value: 0.5718747880750031 and parameters: {'lambda': 5.271621351182447e-08, 'alpha': 0.003749125413858633, 'subsample': 0.8868298808233133, 'colsample_bytree': 0.867231453380404, 'learning_rate': 0.02473438018044617, 'max_depth': 13, 'n_estimators': 149}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:39,029] Trial 47 finished with value: 0.6066211906527933 and parameters: {'lambda': 2.5434925432329653e-05, 'alpha': 0.01830860135864865, 'subsample': 0.6706894847247397, 'colsample_bytree': 0.945898504387016, 'learning_rate': 0.04864244304813086, 'max_depth': 12, 'n_estimators': 124}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:09:55,072] Trial 48 finished with value: 0.6615006651624795 and parameters: {'lambda': 0.00015276590378494285, 'alpha': 0.00021910499564107314, 'subsample': 0.7702816765883712, 'colsample_bytree': 0.778173602749846, 'learning_rate': 0.009755895143165334, 'max_depth': 11, 'n_estimators': 111}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:02,468] Trial 49 finished with value: 0.6839060028561622 and parameters: {'lambda': 0.12991973783622773, 'alpha': 0.08567943074799311, 'subsample': 0.6334402537577137, 'colsample_bytree': 0.5121145018681567, 'learning_rate': 0.14973279889924992, 'max_depth': 16, 'n_estimators': 78}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:26,054] Trial 50 finished with value: 0.7531997957381541 and parameters: {'lambda': 2.495253410332865e-06, 'alpha': 5.98390286694122e-05, 'subsample': 0.9719514387358505, 'colsample_bytree': 0.6191629461686738, 'learning_rate': 0.07115781461247034, 'max_depth': 15, 'n_estimators': 196}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:29,560] Trial 51 finished with value: 0.5535589019593231 and parameters: {'lambda': 0.9452244010242742, 'alpha': 8.063659264258198e-06, 'subsample': 0.21896109976066547, 'colsample_bytree': 0.21915155178596238, 'learning_rate': 0.02677416216425438, 'max_depth': 17, 'n_estimators': 101}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:33,122] Trial 52 finished with value: 0.5466886827037113 and parameters: {'lambda': 0.511074951931679, 'alpha': 1.1046897266643477e-05, 'subsample': 0.21033872002280776, 'colsample_bytree': 0.2315174702986938, 'learning_rate': 0.028498064739051473, 'max_depth': 18, 'n_estimators': 100}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:36,803] Trial 53 finished with value: 0.5871077385047337 and parameters: {'lambda': 0.631998286377863, 'alpha': 8.779506220626414e-06, 'subsample': 0.2251500805586211, 'colsample_bytree': 0.20346022427270505, 'learning_rate': 0.017434630377452562, 'max_depth': 18, 'n_estimators': 101}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:40,745] Trial 54 finished with value: 0.5451431829108138 and parameters: {'lambda': 0.4256004603824704, 'alpha': 5.320442103515791e-07, 'subsample': 0.28048490693317624, 'colsample_bytree': 0.2525291803811883, 'learning_rate': 0.0403617734286861, 'max_depth': 19, 'n_estimators': 88}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:45,146] Trial 55 finished with value: 0.5554722776991402 and parameters: {'lambda': 0.45745651179323193, 'alpha': 8.732081498753008e-07, 'subsample': 0.28855667745005215, 'colsample_bytree': 0.25171986922037454, 'learning_rate': 0.04892345427900728, 'max_depth': 19, 'n_estimators': 97}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:48,099] Trial 56 finished with value: 0.5488883606825392 and parameters: {'lambda': 0.11891801544660276, 'alpha': 2.631866846093695e-07, 'subsample': 0.20239950180990837, 'colsample_bytree': 0.2854835903140222, 'learning_rate': 0.039166260641913794, 'max_depth': 17, 'n_estimators': 75}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:52,158] Trial 57 finished with value: 0.6028591805492615 and parameters: {'lambda': 0.09669298185844794, 'alpha': 7.32689458988185e-08, 'subsample': 0.359980360210848, 'colsample_bytree': 0.3494439420225374, 'learning_rate': 0.09271913276260339, 'max_depth': 20, 'n_estimators': 62}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:56,350] Trial 58 finished with value: 0.5448266950049375 and parameters: {'lambda': 0.03626179826209123, 'alpha': 9.27141261424062e-08, 'subsample': 0.27289740535372226, 'colsample_bytree': 0.29509333803783233, 'learning_rate': 0.03990156579946337, 'max_depth': 18, 'n_estimators': 81}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:10:59,668] Trial 59 finished with value: 0.5726700851754278 and parameters: {'lambda': 0.027306535127362894, 'alpha': 1.0963292098426224e-07, 'subsample': 0.2005543994931825, 'colsample_bytree': 0.29722878734755664, 'learning_rate': 0.06963920969062723, 'max_depth': 17, 'n_estimators': 76}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:04,369] Trial 60 finished with value: 0.5468015766357301 and parameters: {'lambda': 0.005249903271277778, 'alpha': 3.8946739297522934e-07, 'subsample': 0.258429922148393, 'colsample_bytree': 0.2729022015591938, 'learning_rate': 0.041386276225514426, 'max_depth': 19, 'n_estimators': 86}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:08,949] Trial 61 finished with value: 0.5458041890586233 and parameters: {'lambda': 0.0051814458801663285, 'alpha': 2.6751448157951585e-07, 'subsample': 0.2664097284913194, 'colsample_bytree': 0.28030618516547123, 'learning_rate': 0.04390642992502063, 'max_depth': 19, 'n_estimators': 86}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:12,850] Trial 62 finished with value: 0.5454425141435431 and parameters: {'lambda': 0.0036268596066761673, 'alpha': 3.07022674458088e-08, 'subsample': 0.2635304629004013, 'colsample_bytree': 0.26814816336468444, 'learning_rate': 0.04164221516040123, 'max_depth': 18, 'n_estimators': 73}. Best is trial 8 with value: 0.5439758313217227.
[I 2021-05-12 16:11:17,449] Trial 63 finished with value: 0.532580343227795 and parameters: {'lambda': 0.0021482290862969993, 'alpha': 2.4438454633711583e-08, 'subsample': 0.2658469152130181, 'colsample_bytree': 0.26317295728868534, 'learning_rate': 0.0419633326885014, 'max_depth': 18, 'n_estimators': 85}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:20,881] Trial 64 finished with value: 0.5455619081100958 and parameters: {'lambda': 0.003650530358382242, 'alpha': 2.392416161942314e-08, 'subsample': 0.3178635772980517, 'colsample_bytree': 0.2415398454323338, 'learning_rate': 0.06378468412808477, 'max_depth': 18, 'n_estimators': 61}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:24,207] Trial 65 finished with value: 0.6482702056213386 and parameters: {'lambda': 0.0010119351817173571, 'alpha': 1.72694871830583e-08, 'subsample': 0.3147647964352783, 'colsample_bytree': 0.3355767294713051, 'learning_rate': 0.14912442676143356, 'max_depth': 18, 'n_estimators': 50}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:29,562] Trial 66 finished with value: 0.5619241205103244 and parameters: {'lambda': 0.001820937914975054, 'alpha': 3.4699936491914344e-08, 'subsample': 0.39880437993780493, 'colsample_bytree': 0.36283741635239275, 'learning_rate': 0.0610965678138887, 'max_depth': 19, 'n_estimators': 63}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:33,097] Trial 67 finished with value: 0.5759682530506983 and parameters: {'lambda': 0.006594124584151247, 'alpha': 2.4223562213811096e-08, 'subsample': 0.2624900682695232, 'colsample_bytree': 0.3060381389200454, 'learning_rate': 0.0803100619645167, 'max_depth': 18, 'n_estimators': 66}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:36,145] Trial 68 finished with value: 0.5533519738644054 and parameters: {'lambda': 0.03557073810741478, 'alpha': 9.135883479243865e-08, 'subsample': 0.32710129095579726, 'colsample_bytree': 0.24682253436320883, 'learning_rate': 0.04243453298393327, 'max_depth': 20, 'n_estimators': 57}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:41,463] Trial 69 finished with value: 0.6272342846577711 and parameters: {'lambda': 0.00162710547622667, 'alpha': 4.930773639992504e-08, 'subsample': 0.38217639444528906, 'colsample_bytree': 0.38489488124864235, 'learning_rate': 0.11263050797881029, 'max_depth': 16, 'n_estimators': 73}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:45,279] Trial 70 finished with value: 0.7312466987934071 and parameters: {'lambda': 0.003585991576295025, 'alpha': 1.543017771767456e-07, 'subsample': 0.2960765672390549, 'colsample_bytree': 0.26906814302190557, 'learning_rate': 0.19350213837897506, 'max_depth': 17, 'n_estimators': 82}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:48,725] Trial 71 finished with value: 0.5663126838122504 and parameters: {'lambda': 0.262876504280296, 'alpha': 1.051793788102725e-06, 'subsample': 0.2568083029964536, 'colsample_bytree': 0.2307537593499657, 'learning_rate': 0.05794754990845982, 'max_depth': 18, 'n_estimators': 81}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:53,945] Trial 72 finished with value: 0.5528982526222396 and parameters: {'lambda': 0.0028936756879058673, 'alpha': 1.0616658816812026e-08, 'subsample': 0.3590243821950928, 'colsample_bytree': 0.32233114082842085, 'learning_rate': 0.03617709434845802, 'max_depth': 18, 'n_estimators': 70}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:11:58,223] Trial 73 finished with value: 0.5432369545517569 and parameters: {'lambda': 0.00843338646501243, 'alpha': 5.741499514487287e-07, 'subsample': 0.23854268259192382, 'colsample_bytree': 0.23484128351263964, 'learning_rate': 0.04584906620859215, 'max_depth': 19, 'n_estimators': 93}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:02,276] Trial 74 finished with value: 0.6367677213760224 and parameters: {'lambda': 0.010844554491811436, 'alpha': 5.915165138977224e-07, 'subsample': 0.24184496404379216, 'colsample_bytree': 0.2585588133507941, 'learning_rate': 0.11152519775737416, 'max_depth': 19, 'n_estimators': 92}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:06,379] Trial 75 finished with value: 0.5571174169526032 and parameters: {'lambda': 0.042075498669974995, 'alpha': 1.8733088524762418e-07, 'subsample': 0.3047555197534811, 'colsample_bytree': 0.20218612033890965, 'learning_rate': 0.049669689099621245, 'max_depth': 20, 'n_estimators': 86}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:11,393] Trial 76 finished with value: 0.6556599639367141 and parameters: {'lambda': 0.0003522880299796923, 'alpha': 2.4280882846385074e-06, 'subsample': 0.4410602341125248, 'colsample_bytree': 0.28346940514865787, 'learning_rate': 0.021210083183288837, 'max_depth': 19, 'n_estimators': 54}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:15,214] Trial 77 finished with value: 0.5779121745648784 and parameters: {'lambda': 0.007670335696446554, 'alpha': 5.144509968986996e-08, 'subsample': 0.27519849108934746, 'colsample_bytree': 0.2389379456094626, 'learning_rate': 0.07381013405515663, 'max_depth': 17, 'n_estimators': 79}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:22,014] Trial 78 finished with value: 0.5549856231711596 and parameters: {'lambda': 0.0008757698131320319, 'alpha': 2.0515435403873646e-08, 'subsample': 0.3350514149254721, 'colsample_bytree': 0.3027040689611254, 'learning_rate': 0.04064504360586133, 'max_depth': 20, 'n_estimators': 92}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:27,515] Trial 79 finished with value: 0.5645061856076573 and parameters: {'lambda': 0.016481327315946676, 'alpha': 1.1755913686334436e-07, 'subsample': 0.2747045691434007, 'colsample_bytree': 0.41647395260867737, 'learning_rate': 0.0541934018953002, 'max_depth': 19, 'n_estimators': 88}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:30,453] Trial 80 finished with value: 0.6106096981273843 and parameters: {'lambda': 0.021306925673849987, 'alpha': 3.085746990267951e-07, 'subsample': 0.237785290213129, 'colsample_bytree': 0.21865281284685356, 'learning_rate': 0.1012157564069846, 'max_depth': 18, 'n_estimators': 72}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:33,804] Trial 81 finished with value: 0.5450685095850879 and parameters: {'lambda': 0.3910441214359621, 'alpha': 1.4063486528810884e-06, 'subsample': 0.20763777384337134, 'colsample_bytree': 0.23345908182468905, 'learning_rate': 0.031095479288094544, 'max_depth': 18, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:38,391] Trial 82 finished with value: 0.5479409344428311 and parameters: {'lambda': 0.055843623571873634, 'alpha': 7.672594084092638e-07, 'subsample': 0.34480903792607503, 'colsample_bytree': 0.2002849540511683, 'learning_rate': 0.032522772321552575, 'max_depth': 19, 'n_estimators': 93}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:42,271] Trial 83 finished with value: 0.5591167967458375 and parameters: {'lambda': 0.0025904239680238704, 'alpha': 3.487891481775715e-08, 'subsample': 0.31192301212712625, 'colsample_bytree': 0.25824144894328327, 'learning_rate': 0.06299170255298323, 'max_depth': 18, 'n_estimators': 65}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:46,430] Trial 84 finished with value: 0.5489946654949618 and parameters: {'lambda': 0.010711188709130822, 'alpha': 5.941148571370244e-08, 'subsample': 0.25283688396425996, 'colsample_bytree': 0.3205519203108514, 'learning_rate': 0.04473218024112889, 'max_depth': 17, 'n_estimators': 80}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:51,211] Trial 85 finished with value: 0.5454058441742045 and parameters: {'lambda': 0.21191114637154645, 'alpha': 1.7069021747223223e-06, 'subsample': 0.2816660640852129, 'colsample_bytree': 0.29354203438812176, 'learning_rate': 0.028383203721260002, 'max_depth': 19, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:12:56,916] Trial 86 finished with value: 0.5560013901998057 and parameters: {'lambda': 0.2027037518890819, 'alpha': 4.4484025634973083e-07, 'subsample': 0.2881127096833037, 'colsample_bytree': 0.3584476957079319, 'learning_rate': 0.022970928466720997, 'max_depth': 18, 'n_estimators': 104}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:00,808] Trial 87 finished with value: 0.5452843927936505 and parameters: {'lambda': 0.0724776962530672, 'alpha': 1.6072138624716317e-06, 'subsample': 0.22493911938622116, 'colsample_bytree': 0.23930030823202367, 'learning_rate': 0.029051721782793364, 'max_depth': 20, 'n_estimators': 94}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:05,080] Trial 88 finished with value: 0.6122397920845882 and parameters: {'lambda': 0.3926214919691781, 'alpha': 5.20134245175365e-06, 'subsample': 0.22326360557950037, 'colsample_bytree': 0.33103931635246486, 'learning_rate': 0.01581492357996074, 'max_depth': 20, 'n_estimators': 96}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:09,699] Trial 89 finished with value: 0.54369354052703 and parameters: {'lambda': 0.08440941971756566, 'alpha': 1.71568843694643e-06, 'subsample': 0.23744357112602013, 'colsample_bytree': 0.21973680090136738, 'learning_rate': 0.02774601958698305, 'max_depth': 19, 'n_estimators': 117}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:13,216] Trial 90 finished with value: 0.5441979703216908 and parameters: {'lambda': 0.9995777421898492, 'alpha': 2.206074780945274e-06, 'subsample': 0.20149735524389611, 'colsample_bytree': 0.2297371201587289, 'learning_rate': 0.027868738355782217, 'max_depth': 20, 'n_estimators': 105}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:16,966] Trial 91 finished with value: 0.5474748504313708 and parameters: {'lambda': 0.07379241876467516, 'alpha': 1.3086158444928505e-06, 'subsample': 0.2013149371376218, 'colsample_bytree': 0.21711574870865735, 'learning_rate': 0.028177558449633468, 'max_depth': 20, 'n_estimators': 107}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:21,655] Trial 92 finished with value: 0.5589636626874854 and parameters: {'lambda': 0.9924600023337846, 'alpha': 1.4478910472321209e-06, 'subsample': 0.23754518211674278, 'colsample_bytree': 0.29511020881676076, 'learning_rate': 0.019762485413860297, 'max_depth': 20, 'n_estimators': 117}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:25,492] Trial 93 finished with value: 0.6191651385106928 and parameters: {'lambda': 0.16305634843175618, 'alpha': 4.237776388516178e-06, 'subsample': 0.21777905333493716, 'colsample_bytree': 0.22001140848856016, 'learning_rate': 0.013591279822230651, 'max_depth': 19, 'n_estimators': 104}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:28,917] Trial 94 finished with value: 0.5382495336344623 and parameters: {'lambda': 0.2846161397768814, 'alpha': 2.182145163362304e-06, 'subsample': 0.2027126808301043, 'colsample_bytree': 0.252497971647853, 'learning_rate': 0.03364593401093403, 'max_depth': 19, 'n_estimators': 95}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:31,894] Trial 95 finished with value: 0.5416538699680012 and parameters: {'lambda': 0.33279461471335114, 'alpha': 2.0415387526989004e-05, 'subsample': 0.20508863545504274, 'colsample_bytree': 0.20017135811162498, 'learning_rate': 0.03539778135261181, 'max_depth': 20, 'n_estimators': 90}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:34,983] Trial 96 finished with value: 0.5424150261076022 and parameters: {'lambda': 0.3291621846689235, 'alpha': 4.876706528920014e-06, 'subsample': 0.20585611814346355, 'colsample_bytree': 0.20477895858130177, 'learning_rate': 0.033806928092234574, 'max_depth': 19, 'n_estimators': 89}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:38,195] Trial 97 finished with value: 0.5487368975379442 and parameters: {'lambda': 0.32097427957389696, 'alpha': 1.7400655038010598e-05, 'subsample': 0.20080927372677157, 'colsample_bytree': 0.2073407648496909, 'learning_rate': 0.03323991772208836, 'max_depth': 20, 'n_estimators': 98}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:41,360] Trial 98 finished with value: 0.5682462060513137 and parameters: {'lambda': 0.7285672576968819, 'alpha': 3.781380558946924e-06, 'subsample': 0.23825709299194875, 'colsample_bytree': 0.20286188580937986, 'learning_rate': 0.024154833437132445, 'max_depth': 19, 'n_estimators': 90}. Best is trial 63 with value: 0.532580343227795.
[I 2021-05-12 16:13:45,103] Trial 99 finished with value: 0.6603433381067245 and parameters: {'lambda': 0.10708360425328674, 'alpha': 6.635710870254048e-06, 'subsample': 0.20524048569618314, 'colsample_bytree': 0.22566632418201965, 'learning_rate': 0.010550307671215309, 'max_depth': 19, 'n_estimators': 107}. Best is trial 63 with value: 0.532580343227795.
Best trial:
  Value: 0.532580343227795
  Params: 
    lambda: 0.0021482290862969993
    alpha: 2.4438454633711583e-08
    subsample: 0.2658469152130181
    colsample_bytree: 0.26317295728868534
    learning_rate: 0.0419633326885014
    max_depth: 18
    n_estimators: 85

Part II:Using best parameters to train XGBoost

In [ ]:
#Task remaining, use stratified folds with Kfolds for training.

params = {'lambda': 0.0021482290862969993,
 'alpha': 2.4438454633711583e-08,
 'subsample': 0.2658469152130181,
 'colsample_bytree': 0.26317295728868534,
 'learning_rate': 0.0419633326885014,
 'max_depth': 18,
 'n_estimators': 85}

X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
# X_train, X_test, Y_train, y_test = train_test_split(train, y_train, test_size=0.1, random_state=42)
# model = xgb.XGBClassifier(**{'colsample_bylevel': 0.9, 'learning_rate': 0.05, 'max_depth': 20, 'n_estimators': 200,
#                                  'reg_lambda': 15, 'eval_metric':'mlogloss'
#                              }).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)
model = xgb.XGBClassifier(**params).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)
[0]	validation_0-mlogloss:1.06489
[1]	validation_0-mlogloss:1.03391
[2]	validation_0-mlogloss:1.00446
[3]	validation_0-mlogloss:0.97824
[4]	validation_0-mlogloss:0.95265
[5]	validation_0-mlogloss:0.92869
[6]	validation_0-mlogloss:0.90672
[7]	validation_0-mlogloss:0.88671
[8]	validation_0-mlogloss:0.86817
[9]	validation_0-mlogloss:0.84989
[10]	validation_0-mlogloss:0.83220
[11]	validation_0-mlogloss:0.81559
[12]	validation_0-mlogloss:0.79999
[13]	validation_0-mlogloss:0.78577
[14]	validation_0-mlogloss:0.76979
[15]	validation_0-mlogloss:0.75615
[16]	validation_0-mlogloss:0.74430
[17]	validation_0-mlogloss:0.73394
[18]	validation_0-mlogloss:0.72340
[19]	validation_0-mlogloss:0.71213
[20]	validation_0-mlogloss:0.70240
[21]	validation_0-mlogloss:0.69234
[22]	validation_0-mlogloss:0.68344
[23]	validation_0-mlogloss:0.67649
[24]	validation_0-mlogloss:0.66963
[25]	validation_0-mlogloss:0.66135
[26]	validation_0-mlogloss:0.65426
[27]	validation_0-mlogloss:0.64670
[28]	validation_0-mlogloss:0.64092
[29]	validation_0-mlogloss:0.63407
[30]	validation_0-mlogloss:0.62783
[31]	validation_0-mlogloss:0.62136
[32]	validation_0-mlogloss:0.61609
[33]	validation_0-mlogloss:0.61149
[34]	validation_0-mlogloss:0.60653
[35]	validation_0-mlogloss:0.60186
[36]	validation_0-mlogloss:0.59752
[37]	validation_0-mlogloss:0.59198
[38]	validation_0-mlogloss:0.58920
[39]	validation_0-mlogloss:0.58555
[40]	validation_0-mlogloss:0.58232
[41]	validation_0-mlogloss:0.57922
[42]	validation_0-mlogloss:0.57612
[43]	validation_0-mlogloss:0.57331
[44]	validation_0-mlogloss:0.57131
[45]	validation_0-mlogloss:0.56983
[46]	validation_0-mlogloss:0.56668
[47]	validation_0-mlogloss:0.56498
[48]	validation_0-mlogloss:0.56203
[49]	validation_0-mlogloss:0.55954
[50]	validation_0-mlogloss:0.55627
[51]	validation_0-mlogloss:0.55448
[52]	validation_0-mlogloss:0.55283
[53]	validation_0-mlogloss:0.55103
[54]	validation_0-mlogloss:0.54848
[55]	validation_0-mlogloss:0.54717
[56]	validation_0-mlogloss:0.54580
[57]	validation_0-mlogloss:0.54485
[58]	validation_0-mlogloss:0.54429
[59]	validation_0-mlogloss:0.54348
[60]	validation_0-mlogloss:0.54173
[61]	validation_0-mlogloss:0.54092
[62]	validation_0-mlogloss:0.53995
[63]	validation_0-mlogloss:0.53866
[64]	validation_0-mlogloss:0.53831
[65]	validation_0-mlogloss:0.53765
[66]	validation_0-mlogloss:0.53714
[67]	validation_0-mlogloss:0.53656
[68]	validation_0-mlogloss:0.53638
[69]	validation_0-mlogloss:0.53584
[70]	validation_0-mlogloss:0.53612
[71]	validation_0-mlogloss:0.53557
[72]	validation_0-mlogloss:0.53456
[73]	validation_0-mlogloss:0.53392
[74]	validation_0-mlogloss:0.53278
[75]	validation_0-mlogloss:0.53291
[76]	validation_0-mlogloss:0.53192
[77]	validation_0-mlogloss:0.53182
[78]	validation_0-mlogloss:0.53220
[79]	validation_0-mlogloss:0.53227
[80]	validation_0-mlogloss:0.53185
[81]	validation_0-mlogloss:0.53200
[82]	validation_0-mlogloss:0.53186
[83]	validation_0-mlogloss:0.53255
[84]	validation_0-mlogloss:0.53258
In [ ]:
test_y_orig = model.predict_proba(X_test)
print(test_y_orig.shape)

test_y = np.argmax(test_y_orig,axis=1)
print("acc",accuracy_score(y_test, test_y))
print("f1_score",f1_score(y_test,test_y, labels=[0,1,2],average='macro'))
print("logLoss",log_loss(y_test,test_y_orig))
(677, 3)
acc 0.7858197932053176
f1_score 0.41304707411345726
logLoss 0.5318208347560296

Save your trained model

In [ ]:
# model.save()
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'

pickle.dump(model, open(Filename, "wb"))

Prediction phase 🔎

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [ ]:
# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'
# load model from file
loaded_model = pickle.load(open(Filename, "rb"))

Load test data

In [ ]:
test_df = pd.read_csv(AICROWD_DATASET_PATH)
test_df.head()
Out[ ]:
row_id number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hand_count_dummy hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre intersection_pos_rel_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit final_rotation_angle ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect
0 LA9JQ1JZMJ9D2MBZV 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 314.649805 NaN 408.240125 323.348110 321.706776 264.496219 203.330396 205.081081 282.015071 343.657169 416.716030 435.900218 6.119758 25.267069 17.29 6.006505 10.246421 14.43 4.778738 43.124586 46.80 NaN 67.293643 3.90 2001.0 4180.0 6318.0 6528.0 6370.0 8127.0 5610.0 3312.0 9372.0 NaN 3500.0 6336.0 69.0 95.0 117.0 128.0 98.0 129.0 102.0 69.0 142.0 NaN 70.0 72.0 29.0 44.0 54.0 51.0 65.0 63.0 55.0 48.0 66.0 NaN 50.0 88.0 225.618182 730.963636 4.773900e+06 20.605000 360.0 854.199907 NaN 8623.343673 NaN 8623.343673 0.0 0.0 3.0 3.0 NaN NaN 183.844962 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 0.0 84.753550 106 1.000000 0 118.971780 106.379109 111.720745 112.581495 0.500272 0.499368 0.553194 0.446447 0 0 0 1 NaN NaN
1 PSSRCWAPTAG72A1NT 6.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 NaN NaN 235.663425 NaN NaN 325.616723 NaN NaN 288.257264 292.027396 334.951116 370.648756 NaN NaN 22.88 NaN NaN 72.80 72.787316 20.133319 96.33 NaN 60.955820 NaN NaN NaN 12390.0 NaN NaN 8848.0 5632.0 10434.0 7739.0 NaN 11834.0 NaN NaN NaN 118.0 NaN NaN 79.0 64.0 94.0 71.0 NaN 97.0 NaN NaN NaN 105.0 NaN NaN 112.0 88.0 111.0 109.0 NaN 122.0 NaN 126.166667 391.766667 6.631428e+06 64.003333 NaN 5998.258485 NaN 16273.285540 NaN 16273.285540 0.0 0.0 1.0 1.0 NaN NaN 99.180032 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 180.0 73.359021 99 1.000000 0 123.968624 99.208099 104.829045 114.955335 0.572472 0.427196 0.496352 0.503273 0 1 0 1 NaN NaN
2 GCTODIZJB42VCBZRZ 11.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 438.627689 429.789774 447.455305 447.033835 409.185166 361.946474 359.824957 NaN 345.937133 366.201106 375.225266 427.154831 112.333641 100.371900 86.45 86.234478 NaN 89.57 94.556399 97.331146 111.02 111.411562 116.061975 116.22 3182.0 4473.0 4554.0 5032.0 NaN 5355.0 4148.0 4320.0 4420.0 7290.0 2726.0 5184.0 43.0 71.0 69.0 68.0 NaN 51.0 68.0 48.0 52.0 81.0 47.0 81.0 74.0 63.0 66.0 74.0 NaN 105.0 61.0 90.0 85.0 90.0 58.0 64.0 228.072727 192.618182 1.418911e+06 100.815000 360.0 315.683251 NaN 257.619483 NaN 257.619483 1.0 0.0 2.0 2.0 42.707325 78.437307 NaN 1.836624 35.729983 106.779868 55.597531 BL 6.15111 0.57766 11.0 11 2.0 2 270.0 86.346225 120 1.000000 0 124.134670 120.392100 122.909870 121.542463 0.494076 0.505583 0.503047 0.496615 1 0 0 0 0.0 0.0
3 7YMVQGV1CDB1WZFNE 3.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 NaN NaN NaN 408.827592 272.472476 NaN 195.714716 NaN NaN NaN NaN NaN NaN 2.506574 NaN 4.353660 NaN NaN NaN NaN NaN NaN NaN 12.48 NaN 1794.0 NaN 3416.0 NaN NaN NaN NaN NaN NaN NaN 3360.0 NaN 39.0 NaN 56.0 NaN NaN NaN NaN NaN NaN NaN 56.0 NaN 46.0 NaN 61.0 NaN NaN NaN NaN NaN NaN NaN 60.0 70.333333 96.333333 8.477293e+05 12.480000 360.0 NaN 360.0 11194.405100 NaN 11194.405100 1.0 0.0 3.0 3.0 NaN NaN 204.987533 NaN NaN NaN NaN NaN NaN NaN NaN 11 NaN 2 30.0 51.132436 16 0.800000 1 69.766987 53.627186 53.983727 69.002438 0.555033 0.444633 0.580023 0.419575 0 1 0 1 NaN NaN
4 PHEQC6DV3LTFJYIJU 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 436.069089 NaN NaN NaN NaN NaN NaN NaN NaN 113.252059 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 25542.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 129.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 198.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN 0.0 NaN 0.0 0.0 2.0 2.0 77.405367 92.911356 NaN 1.200322 15.505989 100.478258 8.853306 TR NaN NaN 8.0 11 8.0 2 30.0 54.115853 18 0.666667 1 112.043734 87.607876 94.088846 101.540792 0.603666 0.395976 0.494990 0.504604 0 0 0 1 150.0 0.0

Generate predictions

In [ ]:
#Test set Pre processing
test_df.fillna(-999,inplace=True)

#Create more features from categorical features
df_dummies = pd.get_dummies(test_df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
                          dummy_na=False).add_prefix('c_i_')
test_df = test_df.drop('intersection_pos_rel_centre', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)

df_dummies = pd.get_dummies(test_df['hand_count_dummy'], columns='hand_count_dummy',
                          dummy_na=False).add_prefix('c_h_')
test_df = test_df.drop('hand_count_dummy', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)

feat_col = test_df['final_rotation_angle']
test_df['rotation_angle_180'] = (feat_col <= 180).astype('int')    #we will also include NaN in this columntest_
test_df['rotation_angle_360'] = (feat_col > 180).astype('int') 
test_df = test_df.drop('final_rotation_angle', axis=1)

features =test_df.columns[1:].to_list()
# test_data = (test_data-test_data.mean())/test_data.std()
test_df.describe()
Out[ ]:
number_of_digits missing_digit_1 missing_digit_2 missing_digit_3 missing_digit_4 missing_digit_5 missing_digit_6 missing_digit_7 missing_digit_8 missing_digit_9 missing_digit_10 missing_digit_11 missing_digit_12 1 dist from cen 10 dist from cen 11 dist from cen 12 dist from cen 2 dist from cen 3 dist from cen 4 dist from cen 5 dist from cen 6 dist from cen 7 dist from cen 8 dist from cen 9 dist from cen euc_dist_digit_1 euc_dist_digit_2 euc_dist_digit_3 euc_dist_digit_4 euc_dist_digit_5 euc_dist_digit_6 euc_dist_digit_7 euc_dist_digit_8 euc_dist_digit_9 euc_dist_digit_10 euc_dist_digit_11 euc_dist_digit_12 area_digit_1 area_digit_2 area_digit_3 area_digit_4 area_digit_5 area_digit_6 area_digit_7 area_digit_8 area_digit_9 area_digit_10 area_digit_11 area_digit_12 height_digit_1 height_digit_2 height_digit_3 height_digit_4 height_digit_5 height_digit_6 height_digit_7 height_digit_8 height_digit_9 height_digit_10 height_digit_11 height_digit_12 width_digit_1 width_digit_2 width_digit_3 width_digit_4 width_digit_5 width_digit_6 width_digit_7 width_digit_8 width_digit_9 width_digit_10 width_digit_11 width_digit_12 variance_width variance_height variance_area deviation_dist_from_mid_axis between_axis_digits_angle_sum between_axis_digits_angle_var between_digits_angle_cw_sum between_digits_angle_cw_var between_digits_angle_ccw_sum between_digits_angle_ccw_var sequence_flag_cw sequence_flag_ccw number_of_hands hour_hand_length minute_hand_length single_hand_length clockhand_ratio clockhand_diff angle_between_hands deviation_from_centre hour_proximity_from_11 minute_proximity_from_2 hour_pointing_digit actual_hour_digit minute_pointing_digit actual_minute_digit ellipse_circle_ratio count_defects percentage_inside_ellipse pred_tremor double_major double_minor vertical_dist horizontal_dist top_area_perc bottom_area_perc left_area_perc right_area_perc hor_count vert_count eleven_ten_error other_error time_diff centre_dot_detect c_i_-999 c_i_BL c_i_BR c_i_TL c_i_TR c_h_-999.0 c_h_1.0 c_h_2.0 c_h_3.0 rotation_angle_180 rotation_angle_360
counte+02 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.0 362.000000 362.0 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000 362.000000
mean 1.162983 -7.991713 -8.102210 -8.107735 -8.030387 -8.019337 -8.088398 -8.082873 -8.093923 -7.991713 -8.044199 -8.058011 -8.091160 -45.680436 31.642033 55.295891 102.210829 92.376567 90.287175 -13.501951 -26.706205 79.147537 85.511666 99.085287 -30.102850 -275.058934 -156.037805 -150.985663 -232.805849 -248.223235 -176.613370 -186.233368 -173.101758 -278.854475 -218.951005 -209.023783 -175.511602 1387.035912 3198.044199 3905.262431 3924.861878 4630.842541 4684.417127 3675.535912 4496.571823 3633.817680 4703.665746 3817.712707 5386.571823 -252.676796 -129.790055 -114.955801 -192.715470 -204.524862 -127.077348 -139.024862 -123.234807 -231.419890 -182.058011 -171.676796 -133.657459 -266.469613 -136.226519 -128.914365 -208.651934 -212.323204 -144.569061 -157.389503 -139.649171 -252.367403 -183.378453 -176.593923 -126.798343 291.317883 274.955367 4.573130e+06 -23.236683 133.085635 3015.339961 -248.618785 4345.408310 -924.364641 4067.561501 -29.718232 -30.317680 -150.320442 -378.732888 -367.350713 -717.272715 -418.681869 -402.568102 -361.103059 -409.578908 -470.287429 -475.533980 -417.058011 11.0 -422.248619 2.0 42.627259 90.502762 -34.977202 0.370166 102.970476 105.856352 96.312866 110.905686 -74.034884 -74.075900 -74.017108 -74.084348 0.640884 0.616022 0.030387 0.701657 -366.298343 -413.803867 0.414365 0.110497 0.063536 0.270718 0.140884 0.151934 0.248619 0.585635 0.013812 0.837017 0.162983
std 91.608728 90.718691 90.708251 90.707725 90.715052 90.716093 90.709563 90.710088 90.709038 90.718691 90.713749 90.712443 90.709301 620.279114 587.337546 577.393553 547.077751 523.548776 512.949847 582.246763 590.840671 540.226328 553.225637 541.535323 629.647782 470.360625 403.259548 398.283328 451.937553 458.762786 411.248521 420.471216 413.137277 470.931505 443.592290 439.178061 412.022393 1845.688981 2642.225254 3247.723684 3643.716621 4272.443444 3744.436749 3108.776373 3767.060731 3578.529749 4171.058854 3526.989193 4645.630064 484.438451 415.132879 414.544209 475.209106 481.700497 435.742522 437.047158 430.028528 498.334084 463.951228 452.171630 428.512073 475.397341 412.002566 408.079185 465.664206 476.957280 426.714034 427.533019 421.809131 484.546268 463.221351 449.561187 432.158271 417.552432 412.680847 5.320984e+06 238.272403 494.618100 7075.591485 670.813929 7278.555434 296.908559 7273.713097 171.828469 171.721675 359.713647 522.571851 532.151435 473.315023 494.400626 502.484286 537.687130 502.340675 515.915244 519.510469 498.608806 0.0 496.965762 0.0 193.684341 40.329684 186.314716 0.483517 145.715864 15.527933 132.187436 60.881096 262.957519 262.945862 262.962562 262.943447 0.684848 0.681426 0.171887 0.458164 570.532069 492.923960 0.493294 0.313942 0.244262 0.444946 0.348383 0.359453 0.432811 0.493294 0.116872 0.369862 0.369862
mine+02 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 11.0 -999.000000 2.0 -999.000000 1.000000 -999.000000 0.000000 -999.000000 11.273212 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -999.000000 144.294198 262.042659 284.137571 249.118225 243.287434 -999.000000 -999.000000 266.724106 249.861748 271.547610 -999.000000 -999.000000 2.518303 3.542500 -999.000000 -999.000000 1.592500 1.207865 1.613887 -999.000000 0.473824 1.379151 1.982500 -999.000000 2094.250000 2217.000000 -999.000000 -999.000000 2775.000000 2095.500000 2782.000000 -999.000000 1980.250000 1917.000000 2965.750000 -999.000000 43.000000 52.250000 -999.000000 -999.000000 47.000000 46.000000 52.000000 -999.000000 41.000000 43.250000 51.000000 -999.000000 41.000000 39.000000 -999.000000 -999.000000 43.000000 36.000000 42.000000 -999.000000 41.250000 39.000000 55.250000 126.895833 126.715909 1.153591e+06 9.725625 360.000000 63.526159 -999.000000 47.786087 -999.000000 47.786087 0.000000 0.000000 1.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 -999.000000 11.0 -999.000000 2.0 75.978601 72.250000 1.000000 0.000000 115.206222 101.999828 105.967279 108.592271 0.465116 0.472290 0.496989 0.446083 0.000000 0.000000 0.000000 0.000000 -999.000000 -999.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 11.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 340.594135 348.951608 357.445736 367.249656 334.426383 322.181795 313.459855 312.240464 337.474813 358.576575 356.006472 358.128023 8.123884 17.638277 18.265000 10.728551 9.665663 9.620000 10.250842 13.793486 7.605000 11.827152 11.258665 9.360000 1598.000000 3256.000000 3942.500000 4191.000000 5068.500000 4931.000000 3825.000000 4560.000000 4180.000000 4884.500000 4048.500000 5137.000000 49.000000 61.000000 71.000000 73.000000 75.000000 79.500000 73.000000 76.500000 77.000000 71.000000 66.000000 67.000000 30.000000 52.000000 52.000000 54.000000 63.500000 60.000000 50.000000 57.000000 51.000000 68.000000 57.000000 74.000000 242.727273 233.799242 2.902186e+06 17.192500 360.000000 331.140164 360.000000 218.383128 -999.000000 211.791500 1.000000 0.000000 2.000000 45.604622 64.340685 -999.000000 1.078746 5.955294 75.630401 5.934735 0.207853 0.128705 2.000000 11.0 2.000000 2.0 82.989690 99.000000 1.000000 0.000000 120.411917 109.019512 112.206355 115.840623 0.494028 0.499047 0.519191 0.473735 1.000000 1.000000 0.000000 1.000000 -10.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
75% 12.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 379.660083 392.670725 392.631904 401.540704 370.185361 359.760763 355.246755 356.668000 375.137852 391.751116 392.089319 395.523221 22.987777 41.173630 39.422500 33.101467 32.315729 25.902500 33.088856 33.130066 22.620000 29.896813 33.380212 24.765000 2472.000000 4670.000000 5692.500000 6287.500000 7552.250000 7100.500000 5622.000000 6390.000000 6016.500000 7194.500000 5731.000000 7551.250000 66.750000 75.750000 86.000000 94.000000 93.000000 102.000000 92.750000 97.000000 100.000000 85.000000 81.000000 84.000000 40.000000 64.750000 67.000000 69.000000 82.000000 72.000000 63.000000 71.000000 63.000000 85.000000 74.000000 92.000000 420.986742 403.266793 6.006315e+06 51.390625 360.000000 2449.662538 360.000000 8611.869491 -999.000000 7907.706964 1.000000 0.000000 2.000000 62.474110 81.952387 31.634053 1.337104 20.814321 93.535639 14.051095 5.927157 5.204592 11.000000 11.0 2.000000 2.0 86.643218 120.000000 1.000000 1.000000 124.412267 113.820122 116.341874 120.725598 0.514035 0.522828 0.539565 0.494168 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 0.000000
max 13.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 492.941426 505.421853 495.745903 571.075520 486.444498 456.444137 437.433424 478.221967 535.724276 658.555427 529.761503 486.066868 112.333641 116.282230 114.400000 113.624643 108.472561 115.700000 115.930144 118.501563 111.410000 114.565815 116.061975 119.600000 8890.000000 12330.000000 16380.000000 19314.000000 22912.000000 19608.000000 17030.000000 25542.000000 14522.000000 18810.000000 22490.000000 23406.000000 127.000000 122.000000 133.000000 167.000000 154.000000 167.000000 152.000000 214.000000 165.000000 165.000000 173.000000 157.000000 127.000000 137.000000 163.000000 174.000000 179.000000 167.000000 139.000000 198.000000 145.000000 145.000000 142.000000 249.000000 2714.636364 2840.333333 3.639078e+07 114.205000 360.000000 46608.017880 360.000000 58155.070340 360.000000 58155.070340 1.000000 1.000000 3.000000 102.329077 116.749554 241.175192 2.318983 57.496215 179.918115 166.550485 172.292277 178.954617 12.000000 11.0 12.000000 2.0 99.147283 159.000000 1.000000 1.000000 391.080498 187.195461 390.900352 258.983248 0.999766 0.999609 0.999422 1.000000 3.000000 2.000000 1.000000 1.000000 600.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [ ]:
test_data = test_df[features]
preds = loaded_model.predict_proba(test_data)
# preds
In [ ]:
# (preds==0).astype(int)
In [ ]:
check_val =False

if check_val:
    
    y_true = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "validation_ground_truth"))
    y_test = y_true['diagnosis'].map(target_dict).values
    preds_2 = np.argmax(preds,axis=1)
    
    print("acc",accuracy_score(y_test, preds_2))
    print("f1_score",f1_score(y_test,preds_2, labels=[0,1,2],average='macro'))
    print("logLoss",log_loss(y_test,preds))
In [ ]:
# predictions = {
#     "row_id": test_data["row_id"].values,
#     "normal_diagnosis_probability": (preds==0).astype(int),
#     "post_alzheimer_diagnosis_probability":(preds==1).astype(int),
#     "pre_alzheimer_diagnosis_probability": (preds==2).astype(int),
# }


predictions = {
    "row_id": test_df["row_id"].values,
    "normal_diagnosis_probability": preds[:,0],
    "post_alzheimer_diagnosis_probability":preds[:,1],
    "pre_alzheimer_diagnosis_probability": preds[:,2],
}

predictions_df = pd.DataFrame.from_dict(predictions)

Save predictions 📨

In [ ]:
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

Submit to AIcrowd 🚀

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [ ]:
%env DATASET_PATH=$AICROWD_DATASET_PATH
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge
env: DATASET_PATH=Z:/challenge-data/validation.csv
Using notebook: C:\Users\workspace\EDA, FE and HPO - All you need (LB - 0.640).ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
Executing predict.ipynb...
submission.zip --------------------- 100.0% • 24.2/24.2 MB • 2.2 MB/s • 0:00:00                                                 +-------------------------+                                                 
                                                 | Successfully submitted! |                                                 
                                                 +-------------------------+                                                 
                                                       Important links                                                       
+---------------------------------------------------------------------------------------------------------------------------+
|  This submission | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions/136714              |
|                  |                                                                                                        |
|  All submissions | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/submissions?my_submissions=true |
|                  |                                                                                                        |
|      Leaderboard | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge/leaderboards                    |
|                  |                                                                                                        |
| Discussion forum | https://discourse.aicrowd.com/c/addi-alzheimers-detection-challenge                                    |
|                  |                                                                                                        |
|   Challenge page | https://www.aicrowd.com/challenges/addi-alzheimers-detection-challenge                                 |
+---------------------------------------------------------------------------------------------------------------------------+
[NbConvertApp] Converting notebook C:\Users\workspace\submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 1080 bytes to C:\Users\workspace\submission\install.nbconvert.ipynb
[NbConvertApp] Converting notebook C:\Users\workspace\submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 116528 bytes to C:\Users\workspace\submission\predict.nbconvert.ipynb

Conclusion:

This notebook demonstrates exploratory data analysis using which we can perfrom some feature engineering. Next part explains the use of Optuna for hyper-parameter optimization of Xgboost model. My idea is to further elaborate the feature engineering and ensemble in a follow-up discussion/blog walkthrough thread. I'll also try to have a good stratifiedKfolds cross validation as I believe current validation is a bit unsatisfying to me. Stay tuned for next part and if you find this notebook useful please don't forget to hit the like button on top of the notebook.

References:

In [ ]:


Comments

Johnowhitaker
Over 3 years ago

Great notebook!

santiactis
Over 3 years ago

Nice one! Really useful EDA.

k-_-k
Over 3 years ago

Thank you for your great notebook!I learned a lot :)

You must login before you can post a comment.

Execute