Loading

Sentiment Classification

Lightgbm with Cross Validation

A notebook to get started with lightgbm and setting up a proper cross validation scheme

jinoooooooooo

Sentiment Classification Challenge

In this notebook, I will demonstrate the use of lightgbm in modelling and how to set up a proper cross validation scheme. Let's get it 😋

Downloading the dataset ✨

In [1]:
!pip install -qq aicrowd-cli
%load_ext aicrowd.magic
     |████████████████████████████████| 48 kB 5.2 MB/s 
     |████████████████████████████████| 170 kB 42.1 MB/s 
     |████████████████████████████████| 1.1 MB 51.1 MB/s 
     |████████████████████████████████| 54 kB 2.6 MB/s 
     |████████████████████████████████| 63 kB 1.9 MB/s 
     |████████████████████████████████| 214 kB 55.0 MB/s 
     |████████████████████████████████| 63 kB 1.7 MB/s 
     |████████████████████████████████| 51 kB 7.1 MB/s 
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
In [2]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/ItfgyVk_BF4oMIO9XGGk6MaZDGZGlH_KrhQ5l_xs4is
API Key valid
Gitlab access token valid
Saved details successfully!
In [3]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c sentiment-classification -o data

Some imports ✔️

In [4]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

import os

from ast import literal_eval

from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import StratifiedKFold

import lightgbm
In [5]:
train = pd.read_csv('./data/train.csv')
val = pd.read_csv('./data/val.csv')
test = pd.read_csv('./data/test.csv')
submission = pd.read_csv('./data/sample_submission.csv')
In [6]:
train = pd.concat([train, val]) # concat the train and validation set, we will be using the k fold method later
train.shape
Out[6]:
(7000, 2)

Emotion mapper 🌀

In [7]:
train.label.unique()
label_mapper = {'negative' : 0, 'neutral' : 1, 'positive' : 2}
train['label'] = train['label'].map(label_mapper) # mapping the emotions to different values
In [8]:
train.reset_index(drop = True, inplace = True)

Setting up a 10 k fold 😵‍💫

In [9]:
folds = train.copy()
Fold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 42)
for n, (trn_, val_) in enumerate(Fold.split(folds, folds['label'])):
    folds.loc[val_, 'fold'] = int(n)

folds['fold'] = folds['fold'].astype(int)
print(folds.groupby(['fold', 'label']).size()) #now we can see each label is equally distributed among each fold
fold  label
0     0        226
      1        232
      2        242
1     0        227
      1        232
      2        241
2     0        227
      1        232
      2        241
3     0        226
      1        233
      2        241
4     0        226
      1        233
      2        241
5     0        226
      1        233
      2        241
6     0        226
      1        233
      2        241
7     0        226
      1        233
      2        241
8     0        226
      1        233
      2        241
9     0        226
      1        233
      2        241
dtype: int64

LGBM model 💣

In [10]:
final_test_preds = []
for fold in range(10):
    print(f'###################### Fold {fold + 1} Training ############################')
    trn_idx = folds[folds['fold'] != fold].index
    val_idx = folds[folds['fold'] == fold].index

    train_folds = folds.loc[trn_idx].reset_index(drop=True)
    valid_folds = folds.loc[val_idx].reset_index(drop=True)
    
    train_targets = train_folds.label.values
    valid_targets = valid_folds.label.values
    
    train_features = np.array([literal_eval(embedding) for embedding in train_folds['embeddings'].values])
    valid_features = np.array([literal_eval(embedding) for embedding in valid_folds['embeddings'].values])
    
    params = {
        "objective" : "multiclass",
        "metric" : "multi_logloss",
        "num_leaves" : 40,
        "learning_rate" : 0.004,
        "bagging_fraction" : 0.6,
        "feature_fraction" : 0.6,
        "bagging_frequency" : 6,
        "bagging_seed" : 42,
        "verbosity" : -1,
        "seed": 42
    }
    
    model = lightgbm.LGBMClassifier(**params)
    model.fit(train_features, train_targets, eval_set = [(valid_features, valid_targets)], verbose = False)
    valid_preds = model.predict(valid_features)
    
    test_features = np.array([literal_eval(embedding)  for embedding in submission['embeddings'].values])
    test_preds = model.predict(test_features)
    final_test_preds.append(test_preds)
    acc = accuracy_score(valid_targets, valid_preds)
    f1 = f1_score(valid_targets, valid_preds, average ='weighted')
    print('accuracy score : ', acc)
    print('f1 : ', f1)
###################### Fold 1 Training ############################
accuracy score :  0.6685714285714286
f1 :  0.6541209073449378
###################### Fold 2 Training ############################
accuracy score :  0.6842857142857143
f1 :  0.6717204562664548
###################### Fold 3 Training ############################
accuracy score :  0.7171428571428572
f1 :  0.704672286957367
###################### Fold 4 Training ############################
accuracy score :  0.6757142857142857
f1 :  0.6608600711775392
###################### Fold 5 Training ############################
accuracy score :  0.6685714285714286
f1 :  0.6538651494277097
###################### Fold 6 Training ############################
accuracy score :  0.6742857142857143
f1 :  0.656277207448636
###################### Fold 7 Training ############################
accuracy score :  0.65
f1 :  0.6372821314552594
###################### Fold 8 Training ############################
accuracy score :  0.6957142857142857
f1 :  0.6847875333176111
###################### Fold 9 Training ############################
accuracy score :  0.6642857142857143
f1 :  0.6450674104903439
###################### Fold 10 Training ############################
accuracy score :  0.6571428571428571
f1 :  0.6424764812340602

Since we have 10 set of predictions for 10 folds, we can apply a mean and then round it

In [11]:
final_test_preds = np.array(final_test_preds)
preds = np.mean(final_test_preds, axis = 0)
rounded = np.round(preds)
In [12]:
reverse_mapper = {}
for k, v in label_mapper.items():
    reverse_mapper[v] = k

Submission Time ! 🏁

In [13]:
submission['label'] = rounded
submission['label'] = submission['label'].astype('int')
submission['label'] = submission.label.map(reverse_mapper) #mapping the numbers back to emotions
submission.head()
Out[13]:
embeddings label
0 [0.08109518140554428, 0.3090009093284607, 1.36... positive
1 [0.6809610724449158, 1.1909409761428833, 0.892... positive
2 [0.14851869642734528, 0.7872061133384705, 0.89... neutral
3 [0.44697386026382446, 0.36429283022880554, 0.7... neutral
4 [1.8009324073791504, 0.26081395149230957, 0.40... negative
In [14]:
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"))
In [15]:
%aicrowd notebook submit -c sentiment-classification -a assets --no-verify
Using notebook: sentiment-classification (1).ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...


                                                       ╭─────────────────────────╮                                                       
                                                       │ Successfully submitted! │                                                       
                                                       ╰─────────────────────────╯                                                       
                                                             Important links                                                             
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/submissions/172895              │
│                  │                                                                                                                    │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/submissions?my_submissions=true │
│                  │                                                                                                                    │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/leaderboards                    │
│                  │                                                                                                                    │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xiii                                                                      │
│                  │                                                                                                                    │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Thanks for sticking till the end, feel free to comment down for queries and help, I'll be responding to all of them


Comments

vad13irt
Almost 3 years ago

Thank for sharing 😄!

You must login before you can post a comment.

Execute