Programming Language Classification
Getting Started Notebook for Language Classification
A getting started notebook with random submission for the challenge.
Getting Started with Programming Language Classification
In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language. As the code snippets are texts, at first we need to tokenize the code snippets. In this process, we will learn more about tokenization and classification algorithms.
In this starter notebook:
For tokenization: We will use CountVectorizer and TfidfTransformer.
For Classification: We will use Multinomial Naive Bayes Classifier.
AIcrowd code utilities for downloading data for Language Classification
!pip install aicrowd-cli
%load_ext aicrowd.magic
Login to AIcrowd ㊗¶¶
%aicrowd login
Download Dataset¶¶
We will create a folder name data and download the files there.
!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data
Importing Libraries:¶
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score
from sklearn import set_config
set_config(display="diagram")
plt.rcParams["figure.figsize"] = (15,6)
Diving in the dataset 🕵️♂️¶
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")
train_df.head()
Distribution by Programming Language¶
sn.countplot(train_df['language'])
# No. of unique programming language in the dataset
train_df["language"].nunique()
Encoding ✊:¶
Here, We will encode the programming language and create a column/feature that is going to be the target for prediction with the help Label Encoder.
In simple words Label Encoder encodes the target labels with value between 0 and n_classes-1. Know more about Label Encoder.
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder().fit(train_df.language)
train_df["target"] = LE.transform(train_df.language)
train_df.head()
Splitting the dataset¶
Here we will be splitting out dataset into training, validation and test set
X_train,X_comb,Y_train,Y_comb = train_test_split(train_df["code"],train_df["target"],test_size=0.3,random_state=0 , shuffle = False)
X_validation,X_test,Y_validation,Y_test = train_test_split(X_comb,Y_comb,test_size=0.5,random_state=0 , shuffle = False)
X_train.shape,X_validation.shape,X_test.shape,Y_train.shape,Y_validation.shape,Y_test.shape
Baseline-Model ⛹:¶
To build the model, going to create a simple pipeline. We'll use
CountVectorizer: Convert a collection of text documents to a matrix of token counts. Learn more about CountVectorizer
TfidfTransformer: Transform a count matrix to a normalized tf(term frequence) or tf-idf representation. Learn more about TfidfTransformer
MultinomialNB: It implements the naive Bayes algorithm for multinomially distributed data.
classifier = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
classifier = classifier.fit(X_train, Y_train)
classifier
print("F1:" ,f1_score(Y_validation,classifier.predict(X_validation),average='macro'))
print("Accuracy:" ,accuracy_score(Y_validation,classifier.predict(X_validation))*100)
print("F1:" ,f1_score(Y_test,classifier.predict(X_test),average='macro'))
print("Accuracy:" ,accuracy_score(Y_test,classifier.predict(X_test))*100)
Prediction Phase ✈¶
test_df.shape
test_df['target'] = classifier.predict(test_df["code"])
test_df.head()
test_df["prediction"] = LE.inverse_transform(test_df.target)
Generating Prediction File¶
test_df = test_df.sample(frac=1)
test_df.head()
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))
Submitting our Predictions¶
Note : Please save the notebook before submitting it (Ctrl + S)
%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
Content
Comments
You must login before you can post a comment.