Sentiment Classification
Lightgbm with Cross Validation
A notebook to get started with lightgbm and setting up a proper cross validation scheme
Downloading the dataset ✨¶
In [1]:
!pip install -qq aicrowd-cli
%load_ext aicrowd.magic
In [2]:
%aicrowd login
In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c sentiment-classification -o data
Some imports ✔️¶
In [4]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import os
from ast import literal_eval
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import StratifiedKFold
import lightgbm
In [5]:
train = pd.read_csv('./data/train.csv')
val = pd.read_csv('./data/val.csv')
test = pd.read_csv('./data/test.csv')
submission = pd.read_csv('./data/sample_submission.csv')
In [6]:
train = pd.concat([train, val]) # concat the train and validation set, we will be using the k fold method later
train.shape
Out[6]:
Emotion mapper 🌀¶
In [7]:
train.label.unique()
label_mapper = {'negative' : 0, 'neutral' : 1, 'positive' : 2}
train['label'] = train['label'].map(label_mapper) # mapping the emotions to different values
In [8]:
train.reset_index(drop = True, inplace = True)
Setting up a 10 k fold 😵💫¶
In [9]:
folds = train.copy()
Fold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 42)
for n, (trn_, val_) in enumerate(Fold.split(folds, folds['label'])):
folds.loc[val_, 'fold'] = int(n)
folds['fold'] = folds['fold'].astype(int)
print(folds.groupby(['fold', 'label']).size()) #now we can see each label is equally distributed among each fold
LGBM model 💣¶
In [10]:
final_test_preds = []
for fold in range(10):
print(f'###################### Fold {fold + 1} Training ############################')
trn_idx = folds[folds['fold'] != fold].index
val_idx = folds[folds['fold'] == fold].index
train_folds = folds.loc[trn_idx].reset_index(drop=True)
valid_folds = folds.loc[val_idx].reset_index(drop=True)
train_targets = train_folds.label.values
valid_targets = valid_folds.label.values
train_features = np.array([literal_eval(embedding) for embedding in train_folds['embeddings'].values])
valid_features = np.array([literal_eval(embedding) for embedding in valid_folds['embeddings'].values])
params = {
"objective" : "multiclass",
"metric" : "multi_logloss",
"num_leaves" : 40,
"learning_rate" : 0.004,
"bagging_fraction" : 0.6,
"feature_fraction" : 0.6,
"bagging_frequency" : 6,
"bagging_seed" : 42,
"verbosity" : -1,
"seed": 42
}
model = lightgbm.LGBMClassifier(**params)
model.fit(train_features, train_targets, eval_set = [(valid_features, valid_targets)], verbose = False)
valid_preds = model.predict(valid_features)
test_features = np.array([literal_eval(embedding) for embedding in submission['embeddings'].values])
test_preds = model.predict(test_features)
final_test_preds.append(test_preds)
acc = accuracy_score(valid_targets, valid_preds)
f1 = f1_score(valid_targets, valid_preds, average ='weighted')
print('accuracy score : ', acc)
print('f1 : ', f1)
Since we have 10 set of predictions for 10 folds, we can apply a mean and then round it¶
In [11]:
final_test_preds = np.array(final_test_preds)
preds = np.mean(final_test_preds, axis = 0)
rounded = np.round(preds)
In [12]:
reverse_mapper = {}
for k, v in label_mapper.items():
reverse_mapper[v] = k
Submission Time ! 🏁¶
In [13]:
submission['label'] = rounded
submission['label'] = submission['label'].astype('int')
submission['label'] = submission.label.map(reverse_mapper) #mapping the numbers back to emotions
submission.head()
Out[13]:
In [14]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))
In [15]:
%aicrowd notebook submit -c sentiment-classification -a assets --no-verify
Thanks for sticking till the end, feel free to comment down for queries and help, I'll be responding to all of them¶
Content
Comments
You must login before you can post a comment.
Thank for sharing 😄!