Loading

Sentiment Classification

Sentiment Classification: SVM/LGBM/CatBoost/XGBC classifier

A notebook trying to solve this challenge with four methods (from a high-schooler in Vietnam^^)

trancongthinh

Set up

In [ ]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Requirement already satisfied: aicrowd-cli in /usr/local/lib/python3.7/dist-packages (0.1.13)
Requirement already satisfied: requests<3,>=2.25.1 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (2.27.1)
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: rich<11,>=10.0.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (10.16.2)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: python-slugify<6,>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (5.0.2)
Requirement already satisfied: pyzmq==22.1.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: GitPython==3.1.18 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /usr/local/lib/python3.7/dist-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-packages (from python-slugify<6,>=5.0.0->aicrowd-cli) (1.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.11)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
In [ ]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/8jGmSji8ElCTF4rQwV8-700DJfrvbOkjzfkKRfSdUPI
API Key valid
Gitlab access token valid
Saved details successfully!
In [ ]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c sentiment-classification -o data

Process the data

In [ ]:
import pandas as pd
import numpy as np
import os
import tensorflow as tf
import ast
import time
In [ ]:
train_df = pd.read_csv("data/train.csv")
val_df = pd.read_csv("data/val.csv")
In [ ]:
test_df = pd.read_csv("data/test.csv")
In [ ]:
train_df.head()
Out[ ]:
embeddings label
0 [0.3206779360771179, 0.988215982913971, 1.0441... positive
1 [0.05074610561132431, 1.0742985010147095, 0.60... negative
2 [0.41962647438049316, 0.4505457878112793, 1.39... negative
3 [0.4361684024333954, 0.19191382825374603, 0.83... positive
4 [0.6382085084915161, 0.8352395296096802, 0.393... neutral
In [ ]:
x_train = []
y_train = []
x_val = []
y_val = []
label_dict = {'positive': 1, 'negative': 0, 'neutral': 2}

for i in range(len(train_df)):
  x_train.append(ast.literal_eval(train_df.embeddings[i]))
  y_train.append(label_dict[train_df.label[i]])

for i in range(len(val_df)):
  x_val.append(ast.literal_eval(val_df.embeddings[i]))
  y_val.append(label_dict[val_df.label[i]])

'''y_train = tf.keras.utils.to_categorical(y_train, num_classes = 3)
y_val = tf.keras.utils.to_categorical(y_val, num_classes = 3)'''
x_train = np.array(x_train)
x_val = np.array(x_val)

SVM

In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import f1_score
In [ ]:
C_values = [0.01, 0.1, 1, 10, 100]
gamma_values = [0.01, 0.1, 1, 10, 100]

# Kernel rbf has gamma but linear and poly doesn't
# We need to define two searching methods
# With rbf kernel, search a gamma and C
rbf_search = {'kernel': ['rbf'], 'gamma': gamma_values, 'C': C_values}
# With linear and poly kernel, search C
linear_poly_search = {'kernel': ['linear','poly'], 'C': C_values}

# A list of searching method
param_grid = [rbf_search, linear_poly_search]
# Just random state model random_state
model = SVC(random_state = 42)
# GridSearchCV
grid = GridSearchCV(model, param_grid, cv = 3, verbose = 1)
# fit
grid.fit(x_train, y_train)
Fitting 3 folds for each of 35 candidates, totalling 105 fits
Out[ ]:
GridSearchCV(cv=3, estimator=SVC(random_state=42),
             param_grid=[{'C': [0.01, 0.1, 1, 10, 100],
                          'gamma': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf']},
                         {'C': [0.01, 0.1, 1, 10, 100],
                          'kernel': ['linear', 'poly']}],
             verbose=1)
In [ ]:
# best_model
# print the best params
print(grid.best_params_)

best_model = grid.best_estimator_
{'C': 1, 'kernel': 'poly'}
In [ ]:
# deeper GridSearch
C_values = list(np.linspace(0.1, 10, 100))
poly_search = {'kernel': ['poly'], 'C': C_values}

model = SVC(random_state = 42)
grid = GridSearchCV(model, poly_search, cv = 3, verbose = 1)
grid.fit(x_train, y_train)
Fitting 3 folds for each of 100 candidates, totalling 300 fits
Out[ ]:
GridSearchCV(cv=3, estimator=SVC(random_state=42),
             param_grid={'C': [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6,
                               0.7000000000000001, 0.8, 0.9, 1.0, 1.1,
                               1.2000000000000002, 1.3000000000000003,
                               1.4000000000000001, 1.5000000000000002, 1.6,
                               1.7000000000000002, 1.8000000000000003,
                               1.9000000000000001, 2.0, 2.1, 2.2,
                               2.3000000000000003, 2.4000000000000004,
                               2.5000000000000004, 2.6, 2.7, 2.8000000000000003,
                               2.9000000000000004, 3.0000000000000004, ...],
                         'kernel': ['poly']},
             verbose=1)

LGBM/CatBoost/XGBC classifier

In [ ]:
!pip install catboost
Collecting catboost
  Downloading catboost-1.0.4-cp37-none-manylinux1_x86_64.whl (76.1 MB)
     |████████████████████████████████| 76.1 MB 56.2 MB/s 
Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from catboost) (5.5.0)
Requirement already satisfied: graphviz in /usr/local/lib/python3.7/dist-packages (from catboost) (0.10.1)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.3.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from catboost) (1.15.0)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from catboost) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from catboost) (1.19.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from catboost) (1.4.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->catboost) (3.0.7)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly->catboost) (8.0.1)
Installing collected packages: catboost
Successfully installed catboost-1.0.4
In [ ]:
from sklearn.metrics import accuracy_score
In [ ]:
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
In [ ]:
lgb_params = {
    'objective' : 'multiclass',
    'metric' : 'multi_logloss',
    'device' : 'cpu',
}

start_time = time.time()
model = LGBMClassifier(**lgb_params)
model.fit(x_train, y_train, verbose = 0)

val_pred = model.predict(x_val)
acc = accuracy_score(y_val, val_pred)
run_time = time.time() - start_time

print(acc)
print(f'Run time: {run_time:.2f}s')
0.726
Run time: 50.63s
In [ ]:
best_model = model
In [ ]:
catb_params = {
    "objective": "MultiClass",
    "task_type": "CPU",
}

start_time = time.time()
model = CatBoostClassifier(**catb_params)
model.fit(x_train, y_train, verbose = 0)

val_pred = model.predict(x_val)
acc = accuracy_score(y_val, val_pred)
run_time = time.time() - start_time

print(acc)
print(f'Run time: {run_time:.2f}s')
0.754
Run time: 344.24s
In [ ]:
best_model = model
In [ ]:
xgb_params = {
    'objective': 'multi:softmax',
    'eval_metric': 'mlogloss',
    'predictor': 'cpu_predictor'}

start_time = time.time()
model = XGBClassifier(**xgb_params)
model.fit(x_train, y_train, verbose = 0)

val_pred = model.predict(x_val)
acc = accuracy_score(y_val, val_pred)
run_time = time.time() - start_time

print(acc)
print(f"Run time: {run_time:.2f}s")
0.7015
Run time: 49.77s
In [ ]:
best_model = model

Submit

In [ ]:
x_test = []
for i in range(len(test_df)):
  x_test.append(ast.literal_eval(test_df.embeddings[i]))
In [ ]:
labels = dict((v,k) for k,v in label_dict.items())
pred = best_model.predict(x_test)
print(pred[0:10])
[1 2 2 2 0 1 1 1 0 1]
In [ ]:
results = []
for i in pred:
  results.append(labels[i])
In [ ]:
test_df['label'] = results
test_df
Out[ ]:
embeddings label
0 [0.08109518140554428, 0.3090009093284607, 1.36... positive
1 [0.6809610724449158, 1.1909409761428833, 0.892... neutral
2 [0.14851869642734528, 0.7872061133384705, 0.89... neutral
3 [0.44697386026382446, 0.36429283022880554, 0.7... neutral
4 [1.8009324073791504, 0.26081395149230957, 0.40... negative
... ... ...
2996 [0.9138844609260559, 0.9460961222648621, 0.571... negative
2997 [0.7667452096939087, 0.7896291613578796, 0.648... negative
2998 [0.8158280849456787, 2.404792070388794, 0.9924... neutral
2999 [0.4161085784435272, 0.3146701455116272, 1.139... positive
3000 [0.7037264108657837, 0.6421875357627869, 1.215... negative

3001 rows × 2 columns

In [ ]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))
In [ ]:
%aicrowd notebook submit -c sentiment-classification -a assets --no-verify
Using notebook: Sentiment-Classification.ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...


                                                       ╭─────────────────────────╮                                                       
                                                       │ Successfully submitted! │                                                       
                                                       ╰─────────────────────────╯                                                       
                                                             Important links                                                             
┌──────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/submissions/173128              │
│                  │                                                                                                                    │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/submissions?my_submissions=true │
│                  │                                                                                                                    │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification/leaderboards                    │
│                  │                                                                                                                    │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xiii                                                                      │
│                  │                                                                                                                    │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xiii/problems/sentiment-classification                                 │
└──────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Comments

You must login before you can post a comment.

Execute