Programming Language Classification

Getting Started Notebook for Language Classification

A getting started notebook with random submission for the challenge.


Getting Started with Programming Language Classification

In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language. As the code snippets are texts, at first we need to tokenize the code snippets. In this process, we will learn more about tokenization and classification algorithms.

In this starter notebook:

For tokenization: We will use CountVectorizer and TfidfTransformer.

For Classification: We will use Multinomial Naive Bayes Classifier.

AIcrowd code utilities for downloading data for Language Classification

Download the files 💾¶

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

!pip install aicrowd-cli
%load_ext aicrowd.magic
Login to AIcrowd ㊗¶

%aicrowd login
Download Dataset¶

We will create a folder name data and download the files there.

!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data

Importing Libraries:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score

from sklearn import set_config

plt.rcParams["figure.figsize"] = (15,6)

Diving in the dataset 🕵️‍♂️

train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")
id code language
0 14026 var result = testObj1 | testObj2;\... c-sharp
1 12201 /// Initializes a new instance of ... c-sharp
2 17074 /*\n\n Explanation :- a user gives a Strin... javascript
3 21102 int sum = 0;\n\n for (int i = ... c-plus-plus
4 53065 if (p->data < min)\n\n {\n\n ... c

Distribution by Programming Language

# No. of unique programming language in the dataset

Encoding ✊:

Here, We will encode the programming language and create a column/feature that is going to be the target for prediction with the help Label Encoder.

In simple words Label Encoder encodes the target labels with value between 0 and n_classes-1. Know more about Label Encoder.

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder().fit(train_df.language)
train_df["target"] = LE.transform(train_df.language)
id code language target
0 14026 var result = testObj1 | testObj2;\... c-sharp 3
1 12201 /// Initializes a new instance of ... c-sharp 3
2 17074 /*\n\n Explanation :- a user gives a Strin... javascript 8
3 21102 int sum = 0;\n\n for (int i = ... c-plus-plus 2
4 53065 if (p->data < min)\n\n {\n\n ... c 1

Splitting the dataset

Here we will be splitting out dataset into training, validation and test set

X_train,X_comb,Y_train,Y_comb = train_test_split(train_df["code"],train_df["target"],test_size=0.3,random_state=0 , shuffle = False)
X_validation,X_test,Y_validation,Y_test = train_test_split(X_comb,Y_comb,test_size=0.5,random_state=0 , shuffle = False)
((31939,), (6844,), (6845,), (31939,), (6844,), (6845,))

Baseline-Model ⛹:

To build the model, going to create a simple pipeline. We'll use

  • CountVectorizer: Convert a collection of text documents to a matrix of token counts. Learn more about CountVectorizer

  • TfidfTransformer: Transform a count matrix to a normalized tf(term frequence) or tf-idf representation. Learn more about TfidfTransformer

  • MultinomialNB: It implements the naive Bayes algorithm for multinomially distributed data.

classifier = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
classifier =, Y_train)
Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])
print("F1:" ,f1_score(Y_validation,classifier.predict(X_validation),average='macro'))
print("Accuracy:" ,accuracy_score(Y_validation,classifier.predict(X_validation))*100)
F1: 0.3301637817357806
Accuracy: 65.31268264172998
print("F1:" ,f1_score(Y_test,classifier.predict(X_test),average='macro'))
print("Accuracy:" ,accuracy_score(Y_test,classifier.predict(X_test))*100)
F1: 0.32193277845963536
Accuracy: 65.2154857560263

Prediction Phase ✈

Out[ ]:
(9277, 2)
test_df['target'] = classifier.predict(test_df["code"])
In [ ]:
Out[ ]:
id code target
0 10684 28 = 22 + 23 + 24\n\n 33 = 32 + 23 + 24\n\n 49... 11
1 17536 this.path = path;\n\n this.estimat... 11
2 26383 {\n\n ... 1
3 29090 /**\n\n * Class for converting from "any" bas... 11
4 10482 { cout<<"Destructing base \n"; } ... 2
test_df["prediction"] = LE.inverse_transform(

Generating Prediction File

test_df = test_df.sample(frac=1)
Out[ ]:
id code target prediction
1886 16780 }\n\n }\n\n }\n 11 python
7674 95609 free(p);\n\n }\n\n }\n\n /////////... 2 c-plus-plus
2753 16603 """\n\n Arguments:\n\n ... 11 python
785 81619 /** Creates a random set of points distributed... 11 python
4747 19636 using (new AssertionScope())\n\n ... 3 c-sharp
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
