AI Blitz #9: Completed #educational Weight: 20.0
4240
287
17
138

π Welcome thread | π₯ Looking for teammates? | π Easy-2-Follow Code Notebooks

π Don't forget to participate in the Community Contribution Prize!

# IntroductionIntiructodonIntoductrionIntroionductIroductntionIncttroduionIntctroduionucdtIroionntointnItourcdotrodctinInu

Now that you have tackled binary and multi-class classification, itβs time to step up the NLP challenge! Itβs time to deshuffle some text using Transformers and BERT! We will be using HuggingFace which is another very popular and powerful Natural Language Processing library with a large collection of pre-trained models like Google BERT, OpenAI GPT, and many more to solve this puzzle. In this challenge, youβll perform a Sequence to Sequence predict to convert a scrambled text into the right one!

## πͺ Getting Started

The first step is getting started with Transformers. Not the robots from the space but a novel NLP architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. It relies entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. Popular paper Attention is All You Need talks about this in detail. For a quick theoretical understanding of this concept, check out this blog. Using the BERT model through HuggingFace you can start training and predict from the sample.

An easy guide to solving this problem is right here! The starter kit breaks down everything from downloading the dataset, loading the libraries, processing the data, creating, training, and testing the model.

## πΎ Dataset

In any dataset, there will be two columns of text & label. The text is the scambled sentence which we need to unscramble, the label column is the unscrambled text of the corresponding text column which again, contains the scrambled text. Below is a dataset sample -

 text label data science enables enthusiasts real-world experts solve to AIcrowd challenges. and through collaboratively problems AIcrowd enables data science experts and enthusiasts to collaboratively solve real-world problems, through challenges.

In the case of a scrambled sentence generating multiple valid de-scrambled texts. You can submit at most 5 different output sentences for a text, like this -

 text label data science enables enthusiasts real-world experts solve to AIcrowd challenges. and through collaboratively problems, ["AIcrowd enables data science experts and enthusiasts to collaboratively solve real-world problems, through challenges", "AIcrowd enables data enthusiasts and experts to collaboratively solve real-world science problems, through challenges"]

## π Files

Following files are available in the resources section:

• train.csv - ( 40000 samples ) This CSV file containing a text column as the scrambled sentence and a label column un-scambled sentence.
• val.csv - ( 4000 samples ) This CSV file containing a text column as the scrambled sentence and a label column un-scambled sentence.
• test.csv - ( 1000 samples ) This CSV file contains a text column as the scrambled sentence and a label column containing random un-scambled sentences. This file also serves the purpose of sample_submission.csv

## π  Submission

• Creating a submission directory
• Use test.csv and fill the corresponding labels.
• Save the test.csv in the submission directory. The name of the above file should be submission.csv.
• Inside a submission directory, put the .ipynb notebook from which you trained the model and made inference and save it as original_notebook.ipynb.

Overall, this is what your submission directory should look like

submission
βββ submission.csv
βββ original_notebook.ipynb

Zip the submission directory!

Make your first submission here π !!

## π Evaluation Criteria

During the evaluation, the Mean Word Error Rate across each text in ground truth and submission will be used to test the efficiency of the model. We are using wer function from jiwer python library to calculate word error rate.