AI Blitz #9
Different methods in NLP feature Engineering
A notebook featuring different ways for extracting features from text
A walkthrough over different methods for feature engineering from conventional approaches to SOTA methods such as Transformers
Motivation behind Feature Enginering in Natural Language Processing¶
Let us start with why we are interested in employing Feature engineering,
Say you are given an unstructured(images,text,audio,videos,ect...) dataset and now you have to employ a Machine learning. But how do you convert the dataset into numbers!?
For NLP datasets such as Text, we are interested in finding a vectorial representation of the words which in turn helps for ML algorithm to learn better
So what we should do for this competetion!?¶
In this competition, we should use the given dataset data.csv to generate features from the text, i.e we should convert the text into its vector representation.
The corresponding generated features will be used to train a classical Machine Learning model in the testing phase and the results are evaluated based on that.
In this notebook, we will look on some interesting ways to create features for text using NLP techniques
Note: Some of the methods here cannot be directly used for submission, since it takes a long time for creating the vectors
Let's load the Data¶
Download the dataset using Aicrowd CLI
We will be using the previous datset also, for experimentation and learning purposes
Install packages 🗃¶
!pip install aicrowd-cli -q
!pip install gensim zeugma pandas numpy -q
API_KEY = "" # Please enter your API Key from [https://www.aicrowd.com/participants/me]
!aicrowd login --api-key $API_KEY
import os
# Please use the absolute for the location of the dataset.
# Or you can use relative path with `os.getcwd() + "test_data/test.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/data/data.csv")
AICROWD_OUTPUTS_PATH = os.getenv("OUTPUTS_DIR", "")
AICROWD_ASSETS_DIR = os.getenv("ASSETS_DIR", "assets")
# Downloading the Dataset
!mkdir data
!aicrowd dataset download --challenge nlp-feature-engineering -j 3 -o data
# Donwloading research paper classification for training purposes
!mkdir research-paper-data
!aicrowd dataset download --challenge research-paper-classification -j 3 -o research-paper-data
Peek into the dataset¶
Remember, this competition is quite different from other ones, the dataset which is shared contains only 10 samples and we can use any other dataset to train a model for converting this 10 samples into features
Here, I will be using dataset from previous research paper classification task
import pandas as pd
train_data_path = "/content/research-paper-data/train.csv"
val_data_path = "/content/research-paper-data/val.csv"
test_data_path = "/content/research-paper-data/test.csv"
train_data = pd.read_csv(train_data_path)
#make a copy of the original dataset
train = train_data.copy()
train.head()
train.shape
Preprocessing in NLP¶
Before moving onto vectorization let's see some methods for preprocessing the sentences which will help us in later stages
Converting text into lowercase¶
Let's convert all the text into lower case which will later help us in preprocessing
def to_lowercase(text):
return text.lower()
train["text"] = train["text"].apply(to_lowercase)
train.head()
Tokenization¶
Tokenization is nothing but splitting each sentences into words, there are many types of tokenization,the most important types are
- Word level tokenization(splitting by words)
- Character level tokenzation(splitting by characters)
- Subword based tokenization(splitting by subword)
We will implement each of these for experimental purposes
Word tokenization¶
from nltk import word_tokenize
import nltk
nltk.download('punkt')
#apply word tokenize
train['word_tokenize'] = train['text'].apply(word_tokenize)
From the below output, you can notice the sentence have been splitted based on words, including punctuations
train['word_tokenize'][0]
Char tokenization¶
Character tokenization is a way of tokenizing by splitting into characters
text = "NLP for feature engeneering"
lst = [x for x in text]
print(lst)
Stopwords Removal¶
Stopwords removal is one of the essential step in preprocessing in NLP projects, it involves removing the unwanted words such as and, is, was, the we remove these words since, these words doesn't have any impact on the topic of the sentences
#nltk contains all the stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword = stopwords.words('english')
def remove_stopwords(text):
"""custom function to remove the stopwords"""
return " ".join([word for word in str(text).split() if word not in stopword])
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
train['text_without_stopwords'] = train['text'].apply(remove_stopwords)
train['text_without_stopwords'][0]
Stopowrd is a list of frequent words
stopword
#sanity check
train['text_without_stopwords'][0]
You may notice, in the above text we don't have any stopwords. Now for our competition, we can finalize our approch by using word tokenizer in our main text field
train
#apply word tokenization
#df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
train['text'] = train['text'].apply(word_tokenize)
# using list comprehension
def list_to_str(text):
return ' '.join([str(elem) for elem in text])
train['text'] = train['text'].apply(list_to_str)
#print(train['text'][0])
#remove all the stopwords
train['text'] = train['text'].apply(remove_stopwords)
train['text']
Representing words as vectors¶
Let's start our workflow for the competition, here I will walk you through the methods for representing words as vectors.
As always there are many ways to do this,
We will start with the most simplest one
One Hot Encoding¶
Let's say we have a corpus consists of all the unique words from this dataset, this corpus is called as Vocab
Example: [cat,dog,word,text,research,....] After applying One-hot encoding, each word would be represented as one and all the other words as zero
cat: [1,0,0,0,0..] dog: [0,1,0,0,..] and same goes for all!
train
pd.get_dummies(train['text'])
Problems:¶
As you can notice, there are some problems with the one-hot encoding
- By using One-Hot encoding, the Vocab tends to be very large (for example, 50K for PTB, 13M for Google 1T corpus)
- These representations do not capture any notion of similarity
- We can't find any representations between these words, Ideally we would be needing some representations to find similar words
TFIDFVectorizer¶
Terms In Frequency and Inverse Document frequency based vectorizer , in short TFIDF Vectorizer is better than One-Hot encoding in many ways.
In a large vocab, some words will be present frequently(e.g, 'we', 'are') so these carries a very less meaningful information about the actual contents of the document.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency
Term Frequency:
The number of times a word appears in a document divided by the total number of words in the document.Every document has its own term frequency.
Inverse Document frequency
The inverse document Frequency helps us to identify whether a term is common or not.
To know more about TFIDF, have a look over here
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
train['tfidf'] = tfidf.fit_transform(train['text'])
train['tfidf']
Problems:¶
Though by using TFIDF, you may conviniently find similarity behind two documents, there are some shortcomings, such as
- TF-IDF does not capture position in text, semantics, co-occurrences in different documents, etc.
- Cannot capture semantics (e.g. as compared to topic models, word embeddings)
Method Ahead¶
The above methods are very old methods which are called as count based models, since they use co-occurence words matrices
Now we will now see methods which directly learn word representations (these are called (direct) prediction based models. Using these models we will construct embeddings from the text, which will help us to capture the semantics and context of the text
Continuos Bag of Words¶
Consider this Task: Predict n-th word given previous n-1 words
Example: we propose a deep learning model (the word to be predicted)
Training data: All n-word windows in your corpus
Training data for this task is easily available (take all n word windows from the whole of wikipedia)
To model this task, we can use a CBOW model which is a part of Word2Vec model. We will be using gensim for implementation.
As said in the starter notebook, gensim needs a list of words in each sentences as an input format, so we can use the word tokenized column from our data
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
sentences = train['word_tokenize'].values
w2v_model = Word2Vec()
w2v_model.build_vocab(sentences)
# Will take a 4-8 min
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30)
As we know the Word2Vec models utilizes context and semantics of the corpus, it can even find the similar words
w2v_model.wv.most_similar(positive=["novel"])
Skipgram¶
Skipgram is a similar to CBOW but the main difference is is that CBOW uses context to predict a target word while skip-gram is uses a word to predict a target context.
We can implement Skipgram in gensim with a very small change during the initialization, by using
Word2Vec(sg=1)
This will create a skipgram model
#sg=1 equals skipgram otherwise CBOW
skipgram = Word2Vec(sg=1)
skipgram.build_vocab(sentences)
# will take some time
skipgram.train(sentences, total_examples=skipgram.corpus_count, epochs=30)
skipgram.wv.most_similar(positive=["research"])
Define preprocessing code 💻¶
from zeugma.embeddings import EmbeddingTransformer
#download glove
glove = EmbeddingTransformer('glove')
def glove_embeddings(features):
glove = EmbeddingTransformer("glove")
glove_features = glove.transform(features)
return glove_features
features = train["text"]
glove_features = glove_embeddings(features)
glove_features
test_dataset = pd.read_csv(AICROWD_DATASET_PATH)
test_dataset
features = test_dataset["text"]
glove_features = glove_embeddings(features)
test_dataset.feature = [str(i) for i in glove_features.tolist()]
test_dataset
Other pretrained embeddings¶
There are many pretrained word embeddings other than glove are available.Each have different methods.
If you are interested, you can explore about Elmo and fasttext!
Transfomers for extracting embeddings¶
Transformers are one of the deep learning architecture that was introduced in the paper Attention is all you need. It is considered as a massive development in research since it doesn't use any Convolutions or Recurrent cells for NLP.
In coming years, the transformers have gone massive developemnt by pretrained architectures such as BERT and Roberta
The main advantage of these methods are, these architecture are already pretrained, so these models will provide better results with samaller datasets as well//
!pip install transformers -q
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
from torch.utils.data import Dataset, DataLoader
import numpy as np
import torch.nn as nn
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
device = 'cuda' if torch.cuda.is_available else 'cpu'
class FeatureDatset(Dataset):
def __init__(self,df,tokenizer):
self.text = df['text'].to_numpy()
self.tokenizer = tokenizer
def __len__(self):
return len(self.text)
def __getitem__(self,idx):
enocde = self.tokenizer(self.text[idx],return_tensors='pt',
max_length=512,padding='max_length',
truncation=True)
return enocde
def __len__(self):
return len(self.text)
bert = AutoModel.from_pretrained(checkpoint)
class Model(nn.Module):
def __init__(self):
super(Model,self).__init__()
self.bert = bert
#add a dropout layer
self.dropout = nn.Dropout(0.1)
self.linear = nn.Linear(768,1)
def forward(self,**xb):
x = self.bert(**xb)[0]
x = self.dropout(x)
x = self.linear(x)
return x
def get_embeddings(df,model,tokenizer):
"""Extract embeddings from the given dataset using pretrained Transformers"""
model = Model()
model.eval()
tokenizer = tokenizer
ds = FeatureDatset(df,tokenizer)
dl = DataLoader(ds,
batch_size = 2,
shuffle=False,
num_workers = 4,
pin_memory=True,
drop_last=False
)
embeddings = list()
with torch.no_grad():
for i, inputs in enumerate(dl):
inputs = {key:val.reshape(val.shape[0],-1).cpu() for key,val in inputs.items()}
outputs = model(**inputs)
outputs = outputs.detach().cpu().numpy()
embeddings.extend(outputs)
return np.array(embeddings)
model = Model()
embeddings = get_embeddings(test_dataset,model,tokenizer)
test_dataset.feature = [i for i in embeddings]
test_dataset
Content
Comments
You must login before you can post a comment.