Getting Started with Programming Language Classification

In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language. As the code snippets are texts, at first we need to tokenize the code snippets. In this process, we will learn more about tokenization and classification algorithms.

In this starter notebook:

For tokenization: We will use CountVectorizer and TfidfTransformer.

For Classification: We will use Multinomial Naive Bayes Classifier.

AIcrowd code utilities for downloading data for Language Classification

Download the files 💾¶¶

Download AIcrowd CLI¶

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [ ]:

!pip install aicrowd-cli
%load_ext aicrowd.magic

Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 2.7 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 3.1 MB/s 
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.1-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 22.3 MB/s 
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 44.4 MB/s 
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 56.8 MB/s 
Collecting requests<3,>=2.25.1
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 994 kB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 2.0 MB/s 
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 7.6 MB/s 
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.26.0 requests-toolbelt-0.9.1 rich-10.16.1 smmap-5.0.0

In [ ]:

%aicrowd login

Please login here: https://api.aicrowd.com/auth/YNKO9RiTFTggqCK7TpnZTllYXRrS2M3lSdjoupjW0Nk
API Key valid
Saved API Key successfully!

Download Dataset¶¶

We will create a folder name data and download the files there.

In [ ]:

!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data

Importing Libraries:¶

In [ ]:

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score

from sklearn import set_config
set_config(display="diagram")

plt.rcParams["figure.figsize"] = (15,6)

Diving in the dataset 🕵️‍♂️¶

In [ ]:

train_df = pd.read_csv("data/train.csv")

In [ ]:

test_df = pd.read_csv("data/test.csv")

In [ ]:

train_df.head()

Out[ ]:

	id	code	language
0	14026	var result = testObj1 \| testObj2;\...	c-sharp
1	12201	/// Initializes a new instance of ...	c-sharp
2	17074	/*\n\n Explanation :- a user gives a Strin...	javascript
3	21102	int sum = 0;\n\n for (int i = ...	c-plus-plus
4	53065	if (p->data < min)\n\n {\n\n ...	c

Distribution by Programming Language¶

In [ ]:

sn.countplot(train_df['language'])

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f6f6f87ed90>

In [ ]:

# No. of unique programming language in the dataset

train_df["language"].nunique()

Out[ ]:

Encoding ✊:¶

Here, We will encode the programming language and create a column/feature that is going to be the target for prediction with the help Label Encoder.

In simple words Label Encoder encodes the target labels with value between 0 and n_classes-1. Know more about Label Encoder.

In [ ]:

from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder().fit(train_df.language)

In [ ]:

train_df["target"] = LE.transform(train_df.language)

In [ ]:

train_df.head()

Out[ ]:

	id	code	language	target
0	14026	var result = testObj1 \| testObj2;\...	c-sharp	3
1	12201	/// Initializes a new instance of ...	c-sharp	3
2	17074	/*\n\n Explanation :- a user gives a Strin...	javascript	8
3	21102	int sum = 0;\n\n for (int i = ...	c-plus-plus	2
4	53065	if (p->data < min)\n\n {\n\n ...	c	1

Splitting the dataset¶

Here we will be splitting out dataset into training, validation and test set

In [ ]:

X_train,X_comb,Y_train,Y_comb = train_test_split(train_df["code"],train_df["target"],test_size=0.3,random_state=0 , shuffle = False)
X_validation,X_test,Y_validation,Y_test = train_test_split(X_comb,Y_comb,test_size=0.5,random_state=0 , shuffle = False)

In [ ]:

X_train.shape,X_validation.shape,X_test.shape,Y_train.shape,Y_validation.shape,Y_test.shape

Out[ ]:

((31939,), (6844,), (6845,), (31939,), (6844,), (6845,))

Baseline-Model ⛹:¶

To build the model, going to create a simple pipeline. We'll use

CountVectorizer: Convert a collection of text documents to a matrix of token counts. Learn more about CountVectorizer
TfidfTransformer: Transform a count matrix to a normalized tf(term frequence) or tf-idf representation. Learn more about TfidfTransformer
MultinomialNB: It implements the naive Bayes algorithm for multinomially distributed data.

In [ ]:

classifier = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
classifier = classifier.fit(X_train, Y_train)

In [ ]:

classifier

Out[ ]:

Pipeline

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])

CountVectorizer

CountVectorizer()

TfidfTransformer

TfidfTransformer()

MultinomialNB

MultinomialNB()

In [ ]:

print("F1:" ,f1_score(Y_validation,classifier.predict(X_validation),average='macro'))
print("Accuracy:" ,accuracy_score(Y_validation,classifier.predict(X_validation))*100)

F1: 0.3301637817357806
Accuracy: 65.31268264172998

In [ ]:

print("F1:" ,f1_score(Y_test,classifier.predict(X_test),average='macro'))
print("Accuracy:" ,accuracy_score(Y_test,classifier.predict(X_test))*100)

F1: 0.32193277845963536
Accuracy: 65.2154857560263

Prediction Phase ✈¶

In [ ]:

test_df.shape

Out[ ]:

(9277, 2)

In [ ]:

test_df['target'] = classifier.predict(test_df["code"])

In [ ]:

test_df.head()

Out[ ]:

	id	code	target
0	10684	28 = 22 + 23 + 24\n\n 33 = 32 + 23 + 24\n\n 49...	11
1	17536	this.path = path;\n\n this.estimat...	11
2	26383	{\n\n ...	1
3	29090	/*\n\n Class for converting from "any" bas...	11
4	10482	{ cout<<"Destructing base \n"; } ...	2

In [ ]:

test_df["prediction"] = LE.inverse_transform(test_df.target)

Generating Prediction File¶

In [ ]:

test_df = test_df.sample(frac=1)
test_df.head()

Out[ ]:

	id	code	target	prediction
1886	16780	}\n\n }\n\n }\n	11	python
7674	95609	free(p);\n\n }\n\n }\n\n /////////...	2	c-plus-plus
2753	16603	"""\n\n Arguments:\n\n ...	11	python
785	81619	/** Creates a random set of points distributed...	11	python
4747	19636	using (new AssertionScope())\n\n ...	3	c-sharp

In [ ]:

!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions¶

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:

%aicrowd notebook submit -c programming-language-classification -a assets --no-verify

Using notebook: Baseline_Programming_Language_Classification.ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...

                                                            ╭─────────────────────────╮                                                            
                                                            │ Successfully submitted! │                                                            
                                                            ╰─────────────────────────╯

                                                                  Important links                                                                  
┌──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/submissions/169597              │
│                  │                                                                                                                              │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/submissions?my_submissions=true │
│                  │                                                                                                                              │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/leaderboards                    │
│                  │                                                                                                                              │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xii                                                                                 │
│                  │                                                                                                                              │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification                                 │
└──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

In [ ]:

Programming Language Classification

Getting Started Notebook for Language Classification

Getting Started with Programming Language Classification

Download the files 💾¶¶

Download AIcrowd CLI¶

Download Dataset¶¶

Importing Libraries:¶

Diving in the dataset 🕵️‍♂️¶

Distribution by Programming Language¶

Encoding ✊:¶

Splitting the dataset¶

Baseline-Model ⛹:¶

Prediction Phase ✈¶

Generating Prediction File¶

Submitting our Predictions¶

Content

Comments

Programming Language Classification

Getting Started Notebook for Language Classification

Getting Started with Programming Language Classification

Download the files 💾¶¶

Download AIcrowd CLI¶

Login to AIcrowd ㊗¶¶

Download Dataset¶¶

Importing Libraries:¶

Diving in the dataset 🕵️‍♂️¶

Distribution by Programming Language¶

Encoding ✊:¶

Splitting the dataset¶

Baseline-Model ⛹:¶

Prediction Phase ✈¶

Generating Prediction File¶

Submitting our Predictions¶

Content