Loading

NLP Feature Engineering

Solution for submission 148413

A detailed solution for submission 148413 submitted for challenge NLP Feature Engineering

BerAnton

Solution for NLP Feature Engineering LB: 0.803

This solution consists utilises a count vectorizer a TF IDF and a stopword filter as feature engineering.

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

The dataset is available under /data on the workspace.

In [1]:
import os

# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")

Install packages 🗃

We are going to use sklearn to do Count Vectorization and TF IDF.

In [2]:
!pip install --upgrade scikit-learn gensim
!pip install -q -U aicrowd-cli
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/a8/eb/a48f25c967526b66d5f1fa7a984594f0bf0a5afafa94a8c4dbc317744620/scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
     |████████████████████████████████| 22.3MB 1.5MB/s 
Collecting gensim
  Downloading https://files.pythonhosted.org/packages/44/52/f1417772965652d4ca6f901515debcd9d6c5430969e8c02ee7737e6de61c/gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9MB)
     |████████████████████████████████| 23.9MB 46.4MB/s 
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Requirement already satisfied, skipping upgrade: smart-open>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from gensim) (5.1.0)
Installing collected packages: threadpoolctl, scikit-learn, gensim
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1 scikit-learn-0.24.2 threadpoolctl-2.1.0
     |████████████████████████████████| 51kB 6.4MB/s 
     |████████████████████████████████| 61kB 7.2MB/s 
     |████████████████████████████████| 61kB 8.2MB/s 
     |████████████████████████████████| 174kB 37.0MB/s 
     |████████████████████████████████| 81kB 9.6MB/s 
     |████████████████████████████████| 215kB 44.9MB/s 
     |████████████████████████████████| 71kB 9.9MB/s 
     |████████████████████████████████| 51kB 6.9MB/s 
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

In [3]:
from glob import glob
import os
import pandas as pd
import numpy as np
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score
import sklearn

Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

For this solution approach there is no training needed! 🙂

In [4]:

API Key valid
Saved API Key successfully!
In [5]:
# Downloading the Dataset
!mkdir data
data.csv: 100% 110k/110k [00:00<00:00, 735kB/s]

Prediction phase 🔎

Generating the features in test dataset.

In [28]:
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
Out[28]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.3745401188473625, 0.9507143064099162, 0.731...
1 1 This paper is an exposition of the so-called i... [0.9327284833540133, 0.8660638895004084, 0.045...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0.9442664891134339, 0.47421421665746377, 0.86...
3 3 We calculate the equation of state of dense hy... [0.18114934953468032, 0.6811178539649828, 0.18...
4 4 The Donald-Flanigan conjecture asserts that fo... [0.5435382173426461, 0.08172534574677826, 0.45...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [0.7945155444907487, 0.7070864772666982, 0.050...
6 6 The paper deals with the study of labor market... [0.3129073942136482, 0.27109625376406576, 0.59...
7 7 Axisymmetric equilibria with incompressible fl... [0.40680480095172356, 0.3282331056783394, 0.45...
8 8 This paper analyses the possibilities of perfo... [0.013682414760681105, 0.08159872000483837, 0....
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0.9562918815133613, 0.37667644042946247, 0.33...
In [34]:
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(max_features = 512, ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform([remove_stopwords(i) for i in test_dataset.text.tolist()])

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=True).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf = np.round(X_train_tf.toarray()*5).astype(int)

test_dataset.feature = [str(i) for i in X_train_tf.tolist()]
test_dataset
Out[34]:
id text feature
0 0 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 1 This paper is an exposition of the so-called i... [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 2 Zero-divisors (ZDs) derived by Cayley-Dickson ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 3 We calculate the equation of state of dense hy... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 4 The Donald-Flanigan conjecture asserts that fo... [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
5 5 Let $E$ be a primarily quasilocal field, $M/E$... [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
6 6 The paper deals with the study of labor market... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
7 7 Axisymmetric equilibria with incompressible fl... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
8 8 This paper analyses the possibilities of perfo... [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
9 9 I show that an (n+2)-dimensional n-Lie algebra... [0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
In [35]:
test_dataset.to_csv(os.path.join(AICROWD_OUTPUTS_PATH,'submission.csv'), index=False)

Submit to AIcrowd

In [36]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd -v notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge nlp-feature-engineering
WARNING: Assets directory is empty
Using notebook: /content/drive/MyDrive/Colab Notebooks/Task_4.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /content/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3

Aborted!
Exception ignored in: <function Popen.__del__ at 0x7f25f929fb00>
Traceback (most recent call last):
  File "/usr/lib/python3.7/subprocess.py", line 883, in __del__
    ResourceWarning, source=self)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-nbconvert", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/jupyter_core/application.py", line 267, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 338, in start
    self.convert_notebooks()
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 508, in convert_notebooks
    self.convert_single_notebook(notebook_filename)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 479, in convert_single_notebook
    output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/nbconvertapp.py", line 408, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 179, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 197, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node
    nb_copy, resources = super(NotebookExporter, self).from_notebook_node(nb, resources, **kw)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 139, in from_notebook_node
    nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/exporters/exporter.py", line 316, in _preprocess
    nbc, resc = preprocessor(nbc, resc)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/base.py", line 47, in __call__
    return self.preprocess(nb, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 381, in preprocess
    nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
    nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 414, in preprocess_cell
    reply, outputs = self.run_cell(cell, cell_index)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 491, in run_cell
    exec_reply = self._wait_for_reply(parent_msg_id, cell)
  File "/usr/local/lib/python2.7/dist-packages/nbconvert/preprocessors/execute.py", line 471, in _wait_for_reply
    msg = self.kc.shell_channel.get_msg(timeout=timeout_interval)
  File "/usr/local/lib/python2.7/dist-packages/jupyter_client/blocking/channels.py", line 50, in get_msg
    ready = self.socket.poll(timeout)
  File "/usr/local/lib/python2.7/dist-packages/zmq/sugar/socket.py", line 702, in poll
    evts = dict(p.poll(timeout))
  File "/usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py", line 99, in poll
    return zmq_poll(self.sockets, timeout=timeout)
  File "zmq/backend/cython/_poll.pyx", line 123, in zmq.backend.cython._poll.zmq_poll
  File "zmq/backend/cython/checkrc.pxd", line 13, in zmq.backend.cython.checkrc._check_rc
KeyboardInterrupt

Comments

You must login before you can post a comment.

Execute