Loading

AGREE

[Getting Started Notebook] AGREE Challange

This is a Baseline Code to get you started with the challenge.

gauransh_k

You can use this code to start understanding the data and create a baseline model for further imporvments.

Starter Code for AGREE Practice Challange

Note : Create a copy of the notebook and use the copy for submission. Go to File > Save a Copy in Drive to create a new copy

Downloading Dataset

Installing aicrowd-cli

In [1]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Requirement already satisfied: aicrowd-cli in /home/gauransh/anaconda3/lib/python3.8/site-packages (0.1.10)
Requirement already satisfied: rich<11,>=10.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (10.15.2)
Requirement already satisfied: requests<3,>=2.25.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (2.26.0)
Requirement already satisfied: requests-toolbelt<1,>=0.9.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.9.1)
Requirement already satisfied: GitPython==3.1.18 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (3.1.18)
Requirement already satisfied: toml<1,>=0.10.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (0.10.2)
Requirement already satisfied: click<8,>=7.1.2 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: pyzmq==22.1.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (22.1.0)
Requirement already satisfied: tqdm<5,>=4.56.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from aicrowd-cli) (4.62.2)
Requirement already satisfied: gitdb<5,>=4.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from GitPython==3.1.18->aicrowd-cli) (4.0.9)
Requirement already satisfied: smmap<6,>=3.0.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from gitdb<5,>=4.0.1->GitPython==3.1.18->aicrowd-cli) (5.0.0)
Requirement already satisfied: charset-normalizer~=2.0.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.0)
Requirement already satisfied: idna<4,>=2.5 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (3.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.5.30)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.26.6)
Requirement already satisfied: colorama<0.5.0,>=0.4.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.4.4)
Requirement already satisfied: commonmark<0.10.0,>=0.9.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (0.9.1)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /home/gauransh/anaconda3/lib/python3.8/site-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.10.0)
In [2]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/IqLVXeP3B-1vtTLZ4BPTMoMuBiZL5CDXYLablaSoIkI
Opening in existing browser session.
API Key valid
Saved API Key successfully!
In [2]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c agree -o data
In [3]:
# removing extra blank line at the end from the dataset
!sed -i '$ d' data/train.csv
!sed -i '$ d' data/test.csv

Importing Libraries

In this baseline, we will be using skleanr library to train the model and generate the predictions

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
from scipy.sparse import hstack
import os
from IPython.display import display

Reading the dataset

Here, we will read the train.csv which contains both training samples & labels, and test.csv which contains testing samples.

In [5]:
# Reading the CSV
name=["unit_id", "golden_or_not", "unit_state", "trusted_judgments", "last_judgment_at", "agree_or_not_variance", "sentence", "agree_or_not"]
train_data_df = pd.read_csv("data/train.csv",names=name , encoding='ISO-8859–1')
test_data_df = pd.read_csv("data/test.csv",names=name[:-1], encoding='ISO-8859–1')

# train_data.shape, test_data.shape
display(train_data_df.head())
display(test_data_df.head())
print(train_data_df.shape, test_data_df.shape)
unit_id golden_or_not unit_state trusted_judgments last_judgment_at agree_or_not_variance sentence agree_or_not
0 703393947 False finalized 3 4/12/15 17:24 0.471 captain can be used with the same meaning of m... 4.67
1 703395558 False finalized 3 4/12/15 16:04 0.471 action can be used with the same meaning of en... 3.67
2 703398442 False finalized 3 4/14/15 12:58 0.471 climb can be used as the opposite of go_down 4.67
3 703395452 False finalized 3 4/12/15 10:45 0.000 baby is a kind of person 5.00
4 703401266 False finalized 3 4/14/15 6:27 0.471 garden can be used with the same meaning of yard 4.67
unit_id golden_or_not unit_state trusted_judgments last_judgment_at agree_or_not_variance sentence
0 703393947 False finalized 3 4/12/15 17:24 0.471 captain can be used with the same meaning of m...
1 703395484 False finalized 3 4/12/15 11:40 0.943 if lose is true
2 703401918 False finalized 3 4/14/15 5:16 0.471 design can be used as the opposite of destroy
3 703395476 False finalized 3 4/12/15 13:50 1.247 program is a kind of performance
4 703397489 False finalized 3 4/14/15 7:18 0.471 sign can be used with the same meaning of mark
(6644, 8) (1662, 7)

Data Preprocessing

In the preprocessing we have a lot of textual data so we will first One-Hot Encode the Possible Features and use TF IDF Tokens to convert the sentence to a possible feature and use it in the regression.

In [6]:
# utility function to one hot encode the dataset
def one_hot_df(df):
    df = pd.concat([df, pd.get_dummies(df["golden_or_not"])],axis=1)
    df.drop("golden_or_not",axis=1, inplace=True)
    df = pd.concat([df, pd.get_dummies(df["unit_state"])],axis=1)
    df.drop("unit_state",axis=1, inplace=True)
    return df
In [7]:
train_data_df = one_hot_df(train_data_df)
test_data_df = one_hot_df(test_data_df)
display(train_data_df)
display(test_data_df)
unit_id trusted_judgments last_judgment_at agree_or_not_variance sentence agree_or_not False True finalized golden
0 703393947 3 4/12/15 17:24 0.471 captain can be used with the same meaning of m... 4.67 1 0 1 0
1 703395558 3 4/12/15 16:04 0.471 action can be used with the same meaning of en... 3.67 1 0 1 0
2 703398442 3 4/14/15 12:58 0.471 climb can be used as the opposite of go_down 4.67 1 0 1 0
3 703395452 3 4/12/15 10:45 0.000 baby is a kind of person 5.00 1 0 1 0
4 703401266 3 4/14/15 6:27 0.471 garden can be used with the same meaning of yard 4.67 1 0 1 0
... ... ... ... ... ... ... ... ... ... ...
6639 703401541 3 4/14/15 6:47 0.000 venus is part of person 1.00 1 0 1 0
6640 703401351 3 4/12/15 21:14 0.000 hate can be used as the opposite of love 5.00 1 0 1 0
6641 703400471 3 4/13/15 16:10 0.000 slice is a kind of business 1.00 1 0 1 0
6642 703399213 3 4/12/15 21:19 1.886 fence can be used with the same meaning of pale 2.33 1 0 1 0
6643 703397433 3 4/14/15 6:54 0.471 time is a kind of experience 4.33 1 0 1 0

6644 rows × 10 columns

unit_id trusted_judgments last_judgment_at agree_or_not_variance sentence False True finalized golden
0 703393947 3 4/12/15 17:24 0.471 captain can be used with the same meaning of m... 1 0 1 0
1 703395484 3 4/12/15 11:40 0.943 if lose is true 1 0 1 0
2 703401918 3 4/14/15 5:16 0.471 design can be used as the opposite of destroy 1 0 1 0
3 703395476 3 4/12/15 13:50 1.247 program is a kind of performance 1 0 1 0
4 703397489 3 4/14/15 7:18 0.471 sign can be used with the same meaning of mark 1 0 1 0
... ... ... ... ... ... ... ... ... ...
1657 703396215 3 4/12/15 15:53 0.943 screen can be used with the same meaning of pick 1 0 1 0
1658 703394932 3 4/12/15 12:18 0.471 movement is part of clock 1 0 1 0
1659 703401404 3 4/14/15 10:14 0.471 communism is a kind of society 1 0 1 0
1660 703400337 3 4/14/15 8:41 0.000 defeat can be used as the opposite of win 1 0 1 0
1661 703395073 3 4/12/15 14:40 0.471 crowd can be used as the opposite of small 1 0 1 0

1662 rows × 9 columns

Transfroming Train Data for Submission

In [8]:
# For beginning, transform train_data_df['sentence'] to lowercase using text.lower()
train_data_df['sentence'].str.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
train_data_df['sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5)
X_tfidf = vectorizer.fit_transform(train_data_df['sentence']) 

# merging final features to the Dataframe and removing the redundent colums
train_data_df = pd.concat([train_data_df,pd.DataFrame(X_tfidf.toarray())], axis=1)
train_data_df.drop("sentence", axis=1, inplace=True)
display(train_data_df)
unit_id trusted_judgments last_judgment_at agree_or_not_variance agree_or_not False True finalized golden 0 ... 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055
0 703393947 3 4/12/15 17:24 0.471 4.67 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 703395558 3 4/12/15 16:04 0.471 3.67 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 703398442 3 4/14/15 12:58 0.471 4.67 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 703395452 3 4/12/15 10:45 0.000 5.00 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 703401266 3 4/14/15 6:27 0.471 4.67 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6639 703401541 3 4/14/15 6:47 0.000 1.00 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6640 703401351 3 4/12/15 21:14 0.000 5.00 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6641 703400471 3 4/13/15 16:10 0.000 1.00 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6642 703399213 3 4/12/15 21:19 1.886 2.33 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6643 703397433 3 4/14/15 6:54 0.471 4.33 1 0 1 0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

6644 rows × 1065 columns

In [9]:
# Separating data from the dataframe for final training
X = normalize(train_data_df.drop(["unit_id", "last_judgment_at", "agree_or_not"], axis=1).to_numpy())
y = train_data_df["agree_or_not"].to_numpy()
print(X.shape, y.shape)
(6644, 1062) (6644,)

Transfroming Test Data for Submission

In [10]:
# For beginning, transform test_data_df['sentence'] to lowercase using text.lower()
test_data_df['sentence'].str.lower()

# Then replace everything except the letters and numbers in the spaces.
# it will facilitate the further division of the text into words.
test_data_df['sentence'].replace('[^a-zA-Z0-9]', ' ', regex = True)

# Convert a collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
X_tfidf_test = vectorizer.transform(test_data_df['sentence']) 

# merging final features to the Dataframe and removing the redundent colums
test_data_df = pd.concat([test_data_df,pd.DataFrame(X_tfidf_test.toarray())], axis=1)
test_data_df.drop("sentence", axis=1, inplace=True)
display(test_data_df)
unit_id trusted_judgments last_judgment_at agree_or_not_variance False True finalized golden 0 1 ... 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055
0 703393947 3 4/12/15 17:24 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 703395484 3 4/12/15 11:40 0.943 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 703401918 3 4/14/15 5:16 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 703395476 3 4/12/15 13:50 1.247 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 703397489 3 4/14/15 7:18 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1657 703396215 3 4/12/15 15:53 0.943 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1658 703394932 3 4/12/15 12:18 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1659 703401404 3 4/14/15 10:14 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1660 703400337 3 4/14/15 8:41 0.000 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1661 703395073 3 4/12/15 14:40 0.471 1 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1662 rows × 1064 columns

Splitting the data

In [11]:
# Splitting the training set, and training & validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
print(X_train.shape)
print(y_train.shape)
(5315, 1062)
(5315,)
In [12]:
X_train[0], y_train[0]
Out[12]:
(array([0.85812971, 0.13472636, 0.28604324, ..., 0.        , 0.        ,
        0.        ]),
 1.67)

Training the Model

In [13]:
model = GradientBoostingRegressor()
model.fit(X_train, y_train)
Out[13]:
GradientBoostingRegressor()

Validation

In [14]:
model.score(X_val, y_val)
Out[14]:
0.20836791644202257

So, we are done with the baseline let's test with real testing data and see how we submit it to challange.

Predictions

In [15]:
# Separating data from the dataframe for final testing
X_test = normalize(test_data_df.drop(["unit_id", "last_judgment_at"], axis=1).to_numpy())
print(X_test.shape)
(1662, 1062)
In [16]:
# Predicting the labels
predictions = model.predict(X_test)
predictions.shape
Out[16]:
(1662,)
In [17]:
# Converting the predictions array into pandas dataset
submission = pd.DataFrame({"agree_or_not":predictions})
submission
Out[17]:
agree_or_not
0 3.982392
1 2.019913
2 3.553756
3 2.976471
4 3.982392
... ...
1657 3.075700
1658 4.165608
1659 3.839829
1660 3.839829
1661 3.553756

1662 rows × 1 columns

In [18]:
# Saving the pandas dataframe
!rm -rf assets
!mkdir assets
submission.to_csv(os.path.join("assets", "submission.csv"), index=False)

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [19]:
!!aicrowd submission create -c agree -f assets/submission.csv
Out[19]:
['submission.csv ━━━━━━━━━━━━━━━━━━━━━━ 100.0% • 32.3/30.7 KB • 8.0 MB/s • 0:00:00',
 '                                  ╭─────────────────────────╮                                  ',
 '                                  │ Successfully submitted! │                                  ',
 '                                  ╰─────────────────────────╯                                  ',
 '                                        Important links                                        ',
 '┌──────────────────┬──────────────────────────────────────────────────────────────────────────┐',
 '│  This submission │ https://www.aicrowd.com/challenges/agree/submissions/167693              │',
 '│                  │                                                                          │',
 '│  All submissions │ https://www.aicrowd.com/challenges/agree/submissions?my_submissions=true │',
 '│                  │                                                                          │',
 '│      Leaderboard │ https://www.aicrowd.com/challenges/agree/leaderboards                    │',
 '│                  │                                                                          │',
 '│ Discussion forum │ https://discourse.aicrowd.com/c/agree                                    │',
 '│                  │                                                                          │',
 '│   Challenge page │ https://www.aicrowd.com/challenges/agree                                 │',
 '└──────────────────┴──────────────────────────────────────────────────────────────────────────┘',
 "{'submission_id': 167693, 'created_at': '2021-12-12T20:50:44.348Z'}"]
In [ ]:


Comments

You must login before you can post a comment.

Execute