Programming Language Classification

Getting Started Notebook for Language Classification

A getting started notebook with random submission for the challenge.

ashivani

Getting Started with Programming Language Classification

In this puzzle, we have to classify the programming language from code. For classifying programming language we will have code snippets from which we need to identify the programming language. As the code snippets are texts, at first we need to tokenize the code snippets. In this process, we will learn more about tokenization and classification algorithms.

In this starter notebook:

For tokenization: We will use CountVectorizer and TfidfTransformer.

For Classification: We will use Multinomial Naive Bayes Classifier.

AIcrowd code utilities for downloading data for Language Classification

Download the files 💾¶

Download AIcrowd CLI

We will first install aicrowd-cli which will help you download and later make submission directly via the notebook.

In [ ]:
!pip install aicrowd-cli
%load_ext aicrowd.magic
Collecting aicrowd-cli
  Downloading aicrowd_cli-0.1.10-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 2.7 MB/s 
Requirement already satisfied: toml<1,>=0.10.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (0.10.2)
Collecting requests-toolbelt<1,>=0.9.1
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
     |████████████████████████████████| 54 kB 3.1 MB/s 
Collecting rich<11,>=10.0.0
  Downloading rich-10.16.1-py3-none-any.whl (214 kB)
     |████████████████████████████████| 214 kB 22.3 MB/s 
Collecting pyzmq==22.1.0
  Downloading pyzmq-22.1.0-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 44.4 MB/s 
Requirement already satisfied: tqdm<5,>=4.56.0 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (4.62.3)
Collecting GitPython==3.1.18
  Downloading GitPython-3.1.18-py3-none-any.whl (170 kB)
     |████████████████████████████████| 170 kB 56.8 MB/s 
Collecting requests<3,>=2.25.1
  Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
     |████████████████████████████████| 62 kB 994 kB/s 
Requirement already satisfied: click<8,>=7.1.2 in /usr/local/lib/python3.7/dist-packages (from aicrowd-cli) (7.1.2)
Requirement already satisfied: typing-extensions>=3.7.4.0 in /usr/local/lib/python3.7/dist-packages (from GitPython==3.1.18->aicrowd-cli) (3.10.0.2)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.9-py3-none-any.whl (63 kB)
     |████████████████████████████████| 63 kB 2.0 MB/s 
Collecting smmap<6,>=3.0.1
  Downloading smmap-5.0.0-py3-none-any.whl (24 kB)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (1.24.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.0.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.25.1->aicrowd-cli) (2.10)
Collecting colorama<0.5.0,>=0.4.0
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Requirement already satisfied: pygments<3.0.0,>=2.6.0 in /usr/local/lib/python3.7/dist-packages (from rich<11,>=10.0.0->aicrowd-cli) (2.6.1)
Collecting commonmark<0.10.0,>=0.9.0
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
     |████████████████████████████████| 51 kB 7.6 MB/s 
Installing collected packages: smmap, requests, gitdb, commonmark, colorama, rich, requests-toolbelt, pyzmq, GitPython, aicrowd-cli
  Attempting uninstall: requests
    Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Attempting uninstall: pyzmq
    Found existing installation: pyzmq 22.3.0
    Uninstalling pyzmq-22.3.0:
      Successfully uninstalled pyzmq-22.3.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.26.0 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
Successfully installed GitPython-3.1.18 aicrowd-cli-0.1.10 colorama-0.4.4 commonmark-0.9.1 gitdb-4.0.9 pyzmq-22.1.0 requests-2.26.0 requests-toolbelt-0.9.1 rich-10.16.1 smmap-5.0.0

Login to AIcrowd ㊗¶

In [ ]:
%aicrowd login
Please login here: https://api.aicrowd.com/auth/YNKO9RiTFTggqCK7TpnZTllYXRrS2M3lSdjoupjW0Nk
API Key valid
Saved API Key successfully!

Download Dataset¶

We will create a folder name data and download the files there.

In [ ]:
!rm -rf data
!mkdir data
%aicrowd ds dl -c programming-language-classification -o data

Importing Libraries:

In [ ]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sn

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score,accuracy_score,f1_score

from sklearn import set_config
set_config(display="diagram")

plt.rcParams["figure.figsize"] = (15,6)

Diving in the dataset 🕵️‍♂️

In [ ]:
train_df = pd.read_csv("data/train.csv")
In [ ]:
test_df = pd.read_csv("data/test.csv")
In [ ]:
train_df.head()
Out[ ]:
id code language
0 14026 var result = testObj1 | testObj2;\... c-sharp
1 12201 /// Initializes a new instance of ... c-sharp
2 17074 /*\n\n Explanation :- a user gives a Strin... javascript
3 21102 int sum = 0;\n\n for (int i = ... c-plus-plus
4 53065 if (p->data < min)\n\n {\n\n ... c

Distribution by Programming Language

In [ ]:
sn.countplot(train_df['language'])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6f6f87ed90>
In [ ]:
# No. of unique programming language in the dataset

train_df["language"].nunique()
Out[ ]:
15

Encoding ✊:

Here, We will encode the programming language and create a column/feature that is going to be the target for prediction with the help Label Encoder.

In simple words Label Encoder encodes the target labels with value between 0 and n_classes-1. Know more about Label Encoder.

In [ ]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder().fit(train_df.language)
In [ ]:
train_df["target"] = LE.transform(train_df.language)
In [ ]:
train_df.head()
Out[ ]:
id code language target
0 14026 var result = testObj1 | testObj2;\... c-sharp 3
1 12201 /// Initializes a new instance of ... c-sharp 3
2 17074 /*\n\n Explanation :- a user gives a Strin... javascript 8
3 21102 int sum = 0;\n\n for (int i = ... c-plus-plus 2
4 53065 if (p->data < min)\n\n {\n\n ... c 1

Splitting the dataset

Here we will be splitting out dataset into training, validation and test set

In [ ]:
X_train,X_comb,Y_train,Y_comb = train_test_split(train_df["code"],train_df["target"],test_size=0.3,random_state=0 , shuffle = False)
X_validation,X_test,Y_validation,Y_test = train_test_split(X_comb,Y_comb,test_size=0.5,random_state=0 , shuffle = False)
In [ ]:
X_train.shape,X_validation.shape,X_test.shape,Y_train.shape,Y_validation.shape,Y_test.shape
Out[ ]:
((31939,), (6844,), (6845,), (31939,), (6844,), (6845,))

Baseline-Model ⛹:

To build the model, going to create a simple pipeline. We'll use

  • CountVectorizer: Convert a collection of text documents to a matrix of token counts. Learn more about CountVectorizer

  • TfidfTransformer: Transform a count matrix to a normalized tf(term frequence) or tf-idf representation. Learn more about TfidfTransformer

  • MultinomialNB: It implements the naive Bayes algorithm for multinomially distributed data.

In [ ]:
classifier = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
classifier = classifier.fit(X_train, Y_train)
In [ ]:
classifier
Out[ ]:
Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB())])
CountVectorizer()
TfidfTransformer()
MultinomialNB()
In [ ]:
print("F1:" ,f1_score(Y_validation,classifier.predict(X_validation),average='macro'))
print("Accuracy:" ,accuracy_score(Y_validation,classifier.predict(X_validation))*100)
F1: 0.3301637817357806
Accuracy: 65.31268264172998
In [ ]:
print("F1:" ,f1_score(Y_test,classifier.predict(X_test),average='macro'))
print("Accuracy:" ,accuracy_score(Y_test,classifier.predict(X_test))*100)
F1: 0.32193277845963536
Accuracy: 65.2154857560263

Prediction Phase ✈

In [ ]:
test_df.shape
Out[ ]:
(9277, 2)
In [ ]:
test_df['target'] = classifier.predict(test_df["code"])
In [ ]:
test_df.head()
Out[ ]:
id code target
0 10684 28 = 22 + 23 + 24\n\n 33 = 32 + 23 + 24\n\n 49... 11
1 17536 this.path = path;\n\n this.estimat... 11
2 26383 {\n\n ... 1
3 29090 /**\n\n * Class for converting from "any" bas... 11
4 10482 { cout<<"Destructing base \n"; } ... 2
In [ ]:
test_df["prediction"] = LE.inverse_transform(test_df.target)

Generating Prediction File

In [ ]:
test_df = test_df.sample(frac=1)
test_df.head()
Out[ ]:
id code target prediction
1886 16780 }\n\n }\n\n }\n 11 python
7674 95609 free(p);\n\n }\n\n }\n\n /////////... 2 c-plus-plus
2753 16603 """\n\n Arguments:\n\n ... 11 python
785 81619 /** Creates a random set of points distributed... 11 python
4747 19636 using (new AssertionScope())\n\n ... 3 c-sharp
In [ ]:
!rm -rf assets
!mkdir assets
test_df.to_csv(os.path.join("assets", "submission.csv"))

Submitting our Predictions

Note : Please save the notebook before submitting it (Ctrl + S)

In [ ]:
%aicrowd notebook submit -c programming-language-classification -a assets --no-verify
Using notebook: Baseline_Programming_Language_Classification.ipynb for submission...
Scrubbing API keys from the notebook...
Collecting notebook...


                                                            ╭─────────────────────────╮                                                            
                                                            │ Successfully submitted! │                                                            
                                                            ╰─────────────────────────╯                                                            
                                                                  Important links                                                                  
┌──────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  This submission │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/submissions/169597              │
│                  │                                                                                                                              │
│  All submissions │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/submissions?my_submissions=true │
│                  │                                                                                                                              │
│      Leaderboard │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification/leaderboards                    │
│                  │                                                                                                                              │
│ Discussion forum │ https://discourse.aicrowd.com/c/ai-blitz-xii                                                                                 │
│                  │                                                                                                                              │
│   Challenge page │ https://www.aicrowd.com/challenges/ai-blitz-xii/problems/programming-language-classification                                 │
└──────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
In [ ]:


Comments

You must login before you can post a comment.

Execute