ADDI Alzheimers Detection Challenge
EDA, FE, HPO - All you need (LB: 0.640)
Detailed EDA, FE with Class Balancing, Hyper-Parameter Optimization of XGBoost using Optuna
This notebook explains feature-level exploratory data analysis along with observation comments, simple feature engineering including class balancing and XGBoost hyper-parameter optimization using HPO framework Optuna.
What is the notebook about?¶
The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:
1) Pre-Alzheimer’s (Early Warning) 2) Post-Alzheimer’s (Detection) 3) Normal (Not an Alzheimer’s patient)
In machine learning terms: this is a 3-class classification task.
How to use this notebook? 📝¶
- Update the config parameters. You can define the common variables here
Variable | Description |
---|---|
AICROWD_DATASET_PATH |
Path to the file containing test data (The data will be available at /ds_shared_drive/ on aridhia workspace). This should be an absolute path. |
AICROWD_PREDICTIONS_PATH |
Path to write the output to. |
AICROWD_ASSETS_DIR |
In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation. |
AICROWD_API_KEY |
In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me |
- Installing packages. Please use the Install packages 🗃 section to install the packages
- Training your models. All the code within the Training phase ⚙️ section will be skipped during evaluation. Please make sure to save your model weights in the assets directory and load them in the predictions phase section
Content:¶
- Exploratory Data Analysis
- Feature Engineering
- Hyper-parameter Optimization
- Training Best Parameters Model
- Final Prediction and submission
Introduction:¶
Hello I am Jyot Makadiya, a pre-final year student pursuing bachelor of technology in computer science & engineering. I have been experimenting with data for 1 year now and so far the journey has been smooth and I learned a lot on the way.
This challenge can be assumed to be a multiclass classification problem with 3 classes ( Normal, Pre-Alzheimer’s, Post-Alzheimer’s). The main tasks to achieve a good score include having a good cross-validation with balanced dataset, good feature engineering and Fine-tuning hyper-parameters along with ensembling. </br>
This notebook covers my approach for this competition starting with exploratory data analysis. Then it covers simple feature engineering for a few features (I'll expand the idea of FE and ensemble in next part/walkthrough blog). Finally we use Optuna for hyper-parameter optimization. </br>
The aim of this notebook is to introduce you with the variety of concepts including but not limited to hyper-parameter optimization aka AutoML tools, Simple but feature level EDA and FE.
</br>
For a better view of graphs and plots, open this notebook in colab using open in colab
button
Setup AIcrowd Utilities 🛠¶
We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.
!pip install -q -U aicrowd-cli
%load_ext aicrowd.magic
AIcrowd Runtime Configuration 🧷¶
Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR
. We will copy the contents of this directory to your final submission file 🙂
The dataset is available under /ds_shared_drive
on the workspace.
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "Z:/challenge-data/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "Z:/challenge-data/predictions.csv")
AICROWD_ASSETS_DIR = "assets"
Install packages 🗃¶
Please add all pacakage installations in this section
!pip install -q numpy pandas
!pip install -q xgboost scikit-learn seaborn lightgbm optuna
Define preprocessing code 💻¶
The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.
Import common packages¶
Please import packages that are common for training and prediction phases here.
import xgboost as xgb
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()
sns.color_palette("rocket_r")
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 1000)
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, log_loss, f1_score
import joblib
import warnings
warnings.filterwarnings("ignore")
# df
# with open(AICROWD_DATASET_PATH) as f:
# f.read()
# some precessing code
# os.listdir('Z:/challenge-data/')
#Pre Processing functions
Training phase ⚙️¶
You can define your training code here. This sections will be skipped during evaluation.
# model = define_your_model
Load training data¶
df_orig = pd.read_csv("Z:/challenge-data/train.csv")
df_valid = pd.read_csv("Z:/challenge-data/validation.csv")
df_valid_target = pd.read_csv("Z:/challenge-data/validation_ground_truth.csv")
df = df_orig.copy()
df.describe()
# list(df.columns)
Exploratory Data Analysis¶
# Final Rotation Angle in degrees
feat_col = df['final_rotation_angle']
feat_col.fillna(-5,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(x = 'final_rotation_angle',data=df, palette='rocket_r', hue='diagnosis')
fig.set_xlabel("Rotation Angle in Degree",size=15)
fig.set_ylabel("Angle Frequency",size=15)
plt.title('Angle frequencies for all samples',size = 20)
plt.show()
We can notice that there are only 13 discrete values in rotation angles, instead of using these, we can resample that to 4 different columns each representing 90 degrees range or 1 quarter of circle angles.
print(f"number of unique values for rotation angles: {feat_col.nunique()}")
#now we can change that to 4 different quarters columns
df['rotation_angle_90'] = (feat_col <= 90).astype('int') #we will also include NaN in this column
df['rotation_angle_180'] = (90 < feat_col) & (feat_col <= 180).astype('int')
df['rotation_angle_270'] = (180 < feat_col) & (feat_col <= 270).astype('int')
df['rotation_angle_360'] = (feat_col > 270).astype('int')
#We care not using this currently instead we will use two columns for below 180 and above 180
# number of digits
feat_col = df['number_of_digits']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="number_of_digits",palette='rocket', hue="diagnosis" )
fig.set_xlabel("number of digits",size=15)
fig.set_ylabel("Digits Frequency",size=15)
plt.title('Num Digits frequencies for all samples',size = 20)
plt.show()
print(f"number of unique values for number digits: {df['number_of_digits'].nunique()}")
We can notice that most of the values lie in 10,11,12 count range which is good indicator for large normal part of our dataset. And so maybe a new feature with either 10 or 11 or 12 true maybe useful
#Let's look at some of the features with categorical values of repeating multiple instances
#For missing Digit values
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"missing_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
sns.countplot(data=df, x=feature,palette='rocket' )
plt.xlabel(f"Count of values for {feature}", fontsize=12);# plt.legend()
plt.show()
The ratio is same for almost all the digits with around 5000 values being missing. We can notice the large portion in missing_digit_1 & missing_digit_5 variable
#Let's look at Euclidean distance from digits
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"euc_dist_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-10,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency of values for {feature}", fontsize=12);# plt.legend()
plt.show()
#Let's look at Euclidean distance from center(512,512) to digits
#this feature can be calculated using Euclidean distance forumula for ideal and found digit positions with sqrt(a^2+b^2)
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"{i} dist from cen" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-10,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution of values for {feature}", fontsize=12);# plt.legend()
plt.show()
The distribution seems to have variance around 200 with balanced gaussian distribution. Another thing to notice is that there are a lot of missing values in those variables.
#Next set of variables are area for each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"area_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
We can notice the distributions have large variance and the distributions seem to be skewed. We may use some feature engineering to mkae it right.
#Next set of variables are height of each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"height_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12);# plt.legend()
plt.show()
There is a lot of variance in height of bounding boxes. This may explain the different sizes of bounding boxes as we can see the size will be diferent for some digits and 11, 12.
#Next set of variables are width for each digit bounding boxes
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = [f"width_digit_{i}" for i in range(1,13)]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='Red')
plt.xlabel(f"Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
Again we can notice some skewness and a large portion of missing values inside variables. The variance is also different for most of the variables.
# we will look into the varinace of features distribution of width and height now, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_height'],color="blue", kde=True,bins=120, label='variance_height')
sns.distplot(df['variance_width'],color="red", kde=True,bins=120, label='variance_width')
# sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()
Surprisingly, they are almost identical which is good as we can atleast know there is good correlation with height and width variables. Another thing to notice is that we can extract area variable by multiplying height and width features as area = H*W for a bounding box. (Sadly we can't get the missing values in Area from H & W as they are also missing in both other variables)
# we will look into the varinace of area, to get insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.distplot(df['variance_area'],color="green", kde=True,bins=120, label='variance_area')
plt.title('Variance in height and width features',size = 20)
plt.legend()
plt.show()
#Next set of variables are Angle calculated as counterclockwise and clockwise sum,variance
plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
plt.subplot(2, 1,i+1)
df[feature].fillna(-1,inplace=True)
sns.countplot(data=df, x=feature,palette='rocket')
plt.xlabel(f"count values Frequency distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
#same with variance variable
plt.figure()
fig, ax = plt.subplots(2, 1,figsize=(14, 8))
cont_features = ['between_digits_angle_cw_sum','between_digits_angle_ccw_sum']
for i,feature in enumerate(cont_features):
plt.subplot(2, 1,i+1)
df[feature].fillna(-1,inplace=True)
# sns.distplot(df[feature],color="blue", kde=True,bins=120, label='sum')
sns.distplot(df[feature.replace('sum','var')],color="red", kde=True,bins=120, label='var')
plt.xlabel(f"Frequency distribution for {feature.replace('sum','var')}", fontsize=12); # plt.legend()
plt.show()
Majority of above values are concentrated at value 0 in both the cases in variance, that indicates the presence of very precise data or large number of missing values which we can confirm from Sum countplot.
features = df_orig.columns[1:-1].to_list()
for f in features:
print(f" {f} is having : {df[f].nunique()} distinct values")
#Now we will take a look at how the different categorical features with only a few values hold as an countplot distribution
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['sequence_flag_cw',
'sequence_flag_ccw',
'number_of_hands',
'hand_count_dummy',
'pred_tremor',
'hor_count',
'vert_count',
'eleven_ten_error',
'other_error',
'centre_dot_detect']
for i,feature in enumerate(cont_features):
plt.subplot(4,3,i+1)
df[feature].fillna(-1,inplace=True)
sns.countplot(data=df, x=feature,palette='rocket')
plt.xlabel(f"Count Values for {feature}", fontsize=12); # plt.legend()
plt.show()
The above count plots explain all the categorical values with categories less than 7. we can already see some odd patterns in hand_count_dummy
and number_of_hands
. Upon further checking the abnormal values (greater than 2) seem to come from normal
diagnosis labels.
#check the null values in training data
print(f" Training data has Null values : {df_orig.isnull().sum()}")
# Now finally we take a look the remaining feature distribution as they contain large number of distinct values suitable for a distribution plot
plt.figure()
fig, ax = plt.subplots(4, 3,figsize=(14, 20))
cont_features = ['deviation_dist_from_mid_axis',
'between_axis_digits_angle_sum',
'between_axis_digits_angle_var',
'hour_hand_length',
'minute_hand_length',
'single_hand_length',
'clockhand_ratio',
'clockhand_diff',
'angle_between_hands',
'deviation_from_centre',
'hour_proximity_from_11',
'minute_proximity_from_2',
]
for i,feature in enumerate(cont_features):
plt.subplot(4, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='blue')
plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
cont_features = ['hour_pointing_digit',
'minute_pointing_digit',
'final_rotation_angle',
'ellipse_circle_ratio',
'count_defects',
'percentage_inside_ellipse',
'double_major',
'double_minor',
'vertical_dist',
'horizontal_dist',
'top_area_perc',
'bottom_area_perc',
'left_area_perc',
'right_area_perc',
'time_diff']
plt.figure()
fig, ax = plt.subplots(5, 3,figsize=(14, 20))
for i,feature in enumerate(cont_features):
plt.subplot(5, 3,i+1)
df[feature].fillna(-1,inplace=True)
sns.distplot(df[feature] , color='blue')
plt.xlabel(f"Distribution for {feature}", fontsize=12); # plt.legend()
plt.show()
#one categorical feature with true categorical values
# intersection_pos_rel_centre
feat_col = df['intersection_pos_rel_centre']
feat_col.fillna(-1,inplace=True)
plt.figure(figsize=(14,8))
fig = sns.countplot(data=df, x="intersection_pos_rel_centre",palette='rocket', hue="diagnosis" )
fig.set_xlabel("categories in intersection_pos_rel_centre",size=15)
fig.set_ylabel("Frequency values",size=15)
plt.title('Categorical values distribution with classes',size = 20)
plt.show()
# we will look into the vfinal target variable to get more insight into the data
plt.figure()
fig, ax = plt.subplots(1, 1,figsize=(14, 8))
sns.countplot(data=df, x='diagnosis',palette='rocket')
plt.title('Distribution of target variable',size = 20)
plt.legend()
plt.show()
We can notice a very large imbalance in data classes which we will address later during feature engineering to make distribution more even
def CorrMtx(df, dropDuplicates = True):
# Your dataset is already a correlation matrix.
# If you have a dateset where you need to include the calculation
# of a correlation matrix, just uncomment the line below:
df = df.corr()
# Exclude duplicate correlations by masking uper right values
if dropDuplicates:
mask = np.zeros_like(df, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set background color / chart style
sns.set_style(style = 'white')
# Set up matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Add diverging colormap from red to blue
cmap = sns.diverging_palette(250, 10, as_cmap=True)
# Draw correlation plot with or without duplicates
if dropDuplicates:
sns.heatmap(df, mask=mask, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
else:
sns.heatmap(df, cmap=cmap,
square=True,
linewidth=.5, cbar_kws={"shrink": .5}, ax=ax)
df.fillna(-1,inplace=True)
CorrMtx(df, dropDuplicates =False)
The correlation plot looks interesting to me as it gives a lot of insight into data, for example we can notice some clustered features are interlinked and having high correlation,others seem to have negative correlation with a few.
Feature Engineering and Data Preparation¶
FE - Part I: Creating new features¶
# Now we apply some feature engineering from the conclusions drawn from above EDA
df = df_orig.copy()
# Standardize features
def standardize(df):
numeric = df.select_dtypes(include=['int64', 'float64'])
# subtracy mean and divide by std
df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
return df
#we will use -999 to fill up the missing values as of now
df.fillna(-999,inplace=True)
#Create more features from categorical features
df_dummies = pd.get_dummies(df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
dummy_na=False).add_prefix('c_i_')
df = df.drop('intersection_pos_rel_centre', axis=1)
df = pd.concat([df, df_dummies], axis=1)
df_dummies = pd.get_dummies(df['hand_count_dummy'], columns='hand_count_dummy',
dummy_na=False).add_prefix('c_h_')
df = df.drop('hand_count_dummy', axis=1)
df = pd.concat([df, df_dummies], axis=1)
feat_col = df['final_rotation_angle']
df['rotation_angle_180'] = (feat_col <= 180).astype('int') #we will also include NaN in this column
df['rotation_angle_360'] = (feat_col > 180).astype('int')
df = df.drop('final_rotation_angle', axis=1)
features =df.columns[1:].to_list()
features.remove('diagnosis')
#currently we are not using standardize but you can use that by uncommeting below line
# df = standardize(df)
features
FE - Part II: Dealing with Class Imbalance¶
#Now we will use one of the methods described in https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#and used by https://www.aicrowd.com/showcase/dealing-with-class-imbalance
#check those out, great notebooks
df_final = pd.concat([
df.loc[df.diagnosis == 'pre_alzheimer'],
df.loc[df.diagnosis == 'post_alzheimer'],
df.loc[df.diagnosis == 'normal'].sample(frac=1/6),
]).reset_index().drop('index', axis=1)
train_data = df_final[features]
target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
train_labels = df_final['diagnosis'].map(target_dict).astype('int')
train_data.describe()
We mainly used very simple feature engineering as of now but in susequent notebooks (probably part 2 or workthrough blog/video), I'll explain more methods of feature engineering and try to dig deeper into how we can leverage the different FE techniques, now let's focus on hyper parameter optimization
Redundant Code¶
# features = df_orig.columns[1:-1].to_list()
# cont_f = []
# for f in features:
# print(f" {f} is having : {df[f].nunique()}")
# if df[f].nunique() >= 7:
# cont_f.append(f)
# train = df[features]
# train = train.drop(['intersection_pos_rel_centre'],axis = 1)
# train.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train.describe()
# target_values = list(df_orig['diagnosis'].unique())
# target_col = 'diagnosis'
# df_pos = df_orig[df_orig[target_col].isin(target_values[1:])]
# nb_pos = df_pos.shape[0]
# nb_neg = nb_pos*2
# df_neg = df_orig[df_orig[target_col] == "normal"].sample(n=nb_neg, random_state=42)
# df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)
# train_data = df_samples[features]
# train_data.drop(['intersection_pos_rel_centre'],axis = 1, inplace=True)
# train_data.fillna(-1, inplace=True)
# # train_data = (train_data-train_data.mean())/train_data.std()
# train_data.describe()
# df_orig['diagnosis'].unique()
# target_dict = {'normal':0, 'post_alzheimer':1, 'pre_alzheimer':2}
# remap_vals = {0:'normal', 1:'post_alzheimer',2:'pre_alzheimer'}
# train_labels = df_samples['diagnosis'].map(target_dict).astype('int')
# train_labels
Train your model¶
Part I: Hyper-parameter Optimization using Optuna¶
#use 10% train data for validation while tuning hyperparamters
X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
#For tuning hyperparameters we are using default sampler and pruner of Optunafor simplicity, you can find moer info about them
#at https://github.com/optuna/optuna/ [ps: I am one of the contributors so feel free to ask any queries or give feedback]
import optuna
def objective(trial):
train_x, valid_x, train_y, valid_y = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
dtrain = xgb.DMatrix(train_x, label=train_y)
dvalid = xgb.DMatrix(valid_x, label=valid_y)
param = {
"verbosity": 0,
"eval_metric":"mlogloss",
"use_label_encoder":False,
# L2 regularization weight.
"reg_lambda": trial.suggest_float("lambda", 1e-8, 1.0, log=True),
# L1 regularization weight.
"reg_alpha": trial.suggest_float("alpha", 1e-8, 1.0, log=True),
# sampling ratio for training data.
"subsample": trial.suggest_float("subsample", 0.2, 1.0),
# sampling according to each tree.
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
"learning_rate": trial.suggest_float("learning_rate", 0.001, 1.0, log=True),
"max_depth": trial.suggest_int("max_depth", 8, 20),
"n_estimators": trial.suggest_int("n_estimators", 50, 200),
}
model = xgb.XGBClassifier(**param)
model.fit(train_x,train_y)
pred_labels = model.predict_proba(valid_x)
return log_loss(valid_y, pred_labels)
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=100)
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Part II:Using best parameters to train XGBoost¶
#Task remaining, use stratified folds with Kfolds for training.
params = {'lambda': 0.0021482290862969993,
'alpha': 2.4438454633711583e-08,
'subsample': 0.2658469152130181,
'colsample_bytree': 0.26317295728868534,
'learning_rate': 0.0419633326885014,
'max_depth': 18,
'n_estimators': 85}
X_train, X_test, Y_train, y_test = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
# X_train, X_test, Y_train, y_test = train_test_split(train, y_train, test_size=0.1, random_state=42)
# model = xgb.XGBClassifier(**{'colsample_bylevel': 0.9, 'learning_rate': 0.05, 'max_depth': 20, 'n_estimators': 200,
# 'reg_lambda': 15, 'eval_metric':'mlogloss'
# }).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)
model = xgb.XGBClassifier(**params).fit(X_train, Y_train,eval_set=[(X_test,y_test)],verbose=True,early_stopping_rounds=10)
test_y_orig = model.predict_proba(X_test)
print(test_y_orig.shape)
test_y = np.argmax(test_y_orig,axis=1)
print("acc",accuracy_score(y_test, test_y))
print("f1_score",f1_score(y_test,test_y, labels=[0,1,2],average='macro'))
print("logLoss",log_loss(y_test,test_y_orig))
Save your trained model¶
# model.save()
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'
pickle.dump(model, open(Filename, "wb"))
Prediction phase 🔎¶
Please make sure to save the weights from the training section in your assets directory and load them in this section
# model = load_model_from_assets_dir(AIcrowdConfig.ASSETS_DIR)
Filename = f'{AICROWD_ASSETS_DIR}/model_xgb_exp_4-2.pkl'
# load model from file
loaded_model = pickle.load(open(Filename, "rb"))
Load test data¶
test_df = pd.read_csv(AICROWD_DATASET_PATH)
test_df.head()
Generate predictions¶
#Test set Pre processing
test_df.fillna(-999,inplace=True)
#Create more features from categorical features
df_dummies = pd.get_dummies(test_df['intersection_pos_rel_centre'], columns='intersection_pos_rel_centre',
dummy_na=False).add_prefix('c_i_')
test_df = test_df.drop('intersection_pos_rel_centre', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)
df_dummies = pd.get_dummies(test_df['hand_count_dummy'], columns='hand_count_dummy',
dummy_na=False).add_prefix('c_h_')
test_df = test_df.drop('hand_count_dummy', axis=1)
test_df = pd.concat([test_df, df_dummies], axis=1)
feat_col = test_df['final_rotation_angle']
test_df['rotation_angle_180'] = (feat_col <= 180).astype('int') #we will also include NaN in this columntest_
test_df['rotation_angle_360'] = (feat_col > 180).astype('int')
test_df = test_df.drop('final_rotation_angle', axis=1)
features =test_df.columns[1:].to_list()
# test_data = (test_data-test_data.mean())/test_data.std()
test_df.describe()
test_data = test_df[features]
preds = loaded_model.predict_proba(test_data)
# preds
# (preds==0).astype(int)
check_val =False
if check_val:
y_true = pd.read_csv(AICROWD_DATASET_PATH.replace("validation", "validation_ground_truth"))
y_test = y_true['diagnosis'].map(target_dict).values
preds_2 = np.argmax(preds,axis=1)
print("acc",accuracy_score(y_test, preds_2))
print("f1_score",f1_score(y_test,preds_2, labels=[0,1,2],average='macro'))
print("logLoss",log_loss(y_test,preds))
# predictions = {
# "row_id": test_data["row_id"].values,
# "normal_diagnosis_probability": (preds==0).astype(int),
# "post_alzheimer_diagnosis_probability":(preds==1).astype(int),
# "pre_alzheimer_diagnosis_probability": (preds==2).astype(int),
# }
predictions = {
"row_id": test_df["row_id"].values,
"normal_diagnosis_probability": preds[:,0],
"post_alzheimer_diagnosis_probability":preds[:,1],
"pre_alzheimer_diagnosis_probability": preds[:,2],
}
predictions_df = pd.DataFrame.from_dict(predictions)
Save predictions 📨¶
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)
Submit to AIcrowd 🚀¶
NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)
%env DATASET_PATH=$AICROWD_DATASET_PATH
--assets-dir $AICROWD_ASSETS_DIR \
--challenge addi-alzheimers-detection-challenge
Conclusion:¶
This notebook demonstrates exploratory data analysis using which we can perfrom some feature engineering. Next part explains the use of Optuna for hyper-parameter optimization of Xgboost model. My idea is to further elaborate the feature engineering and ensemble in a follow-up discussion/blog walkthrough thread. I'll also try to have a good stratifiedKfolds cross validation as I believe current validation is a bit unsatisfying to me. Stay tuned for next part and if you find this notebook useful please don't forget to hit the like button on top of the notebook.
References:
Content
Comments
You must login before you can post a comment.
Great notebook!
Nice one! Really useful EDA.
Thank you for your great notebook!I learned a lot :)