Location
Badges
Activity
Challenge Categories
Challenges Entered
Improve RAG with Real-World Benchmarks
Latest submissions
Shopping Session Dataset
Latest submissions
Interactive embodied agents for Human-AI collaboration
Latest submissions
Participant | Rating |
---|---|
amiruddin_nagri | 0 |
Participant | Rating |
---|
-
day-day-up ESCI Challenge for Improving Product SearchView
Meta Comprehensive RAG Benchmark: KDD Cup 2-9d1937
ESCI Challenge for Improving Product Search
My solution (good-good-study, day-day-up)
Over 2 years ago-
I didnβt consider these situations too much, but separated the text with white space characters and punctuation (for Japanese, I uses mecab as a word segmentation tool). I donβt think this situation has much impact. We just want to use them as the input of the model, so they doesnβt need to be very accurate.
-
I am more familiar with TFIDF, so I choose TFIDF. I havenβt studied and tested the effect of BM25 in detail.
-
One of the characteristics of TFIDF is its fast speed. I can include the process of extracting keywords in PyTorch Dataset and DataLoader. Using more complex methods to extract keywords may bring better results, but it will also produce huge consumption. In addition, I consider training an embedding for query and the corresponding product set in advance, and then taking it as the feature of query, but the effect is not good.
My solution (good-good-study, day-day-up)
Over 2 years agoBasic Solution
All my models are based on the infoxlm large model. I concat the training set of Task1 and task2 as a new training set after de duplication. Then all next three tasks use the same model trained on the new training set. Finally, I used 8 models on Task1 and 4 models on task2 and task3.
The output of the model can be submitted to different tasks after different processing:
-
Task1: order the product by \hat P_{exact} +\hat P_ {substitute} * 0.1 + \hat P_ {completion} * 0.01. The class weight is the gain of four labels.
-
Task2: take the label with the highest prediction probability as the prediction result
-
Task3: check whether \hat P_{substitute} is greater than 0.5 and the prediction result is obtained
Keywords of Query
The query is short text, which is very unfavorable for understanding the meaning of query. Therefore, I take the titles of all products corresponding to the query as a document, and then use TFIDF to extract keywords. Also, I get the keywords of product_bullet_point
and product_description
for each query. In this way, the extracted keywords can be used as the feature of query. And it can be put into the input text.
In addition, I also add the brand and color names of all products with the same query to the model. (In intuition, if there is a word in query that represents a brand, but we donβt clearly point it out, it will affect the prediction results of some goods that are not of this brand.)
This idea has a great gain for task2 and task3 models, I get a improvement of more than 0.01 from it (task2). With this idea, my task2 score of single model at public leaderboard is 0.821 (without post-processing).
But I donβt get much gain in task1. I think it is because Task1 focuses on the ordering of different products with the same query, so the features of query are not important, and the features of products are more important.
Self Distillation
I get the prediction probability on the whole training set through 10-fold cross validation on the training data, and take the mean of prediction probability and the true label probability as a soft label, and then use this soft label for model training. For example, suppose the prediction probability of one sample is (0.4, 0.3, 0.2, 0.1)
, and the true label is 0, the we have a soft label (0.7, 0.15, 0.1, 0.05)
, and then I use it for model training.
Such an approach can significantly enhance the robustness of the model and overcome the impact of noise data. With with this approach, my task2 score of single model at public leaderboard is 0.824 (without post-processing).
However, this approach will affect the effect of model ensemble. Using four models can only improve the result to 0.826. If there are we use many models, this method does not seem to bring significant gain.
Post Processing
In the last several days of the competition, I found that the threshold has a great impact on task3, and further found that task2 score can also be significantly improved by increasing the probability of special label. After exploring, I think there may be two part of marking data, one of which is task1 data, and all of the data is used as task2 and task3 data. In this way, after the leak is removed from the test set of task2, the distribution of the data set will change significantly, so that we can improve the score through post-processing. After discovering this, I improved my score on task2 to 0.830 through simple post-processing rules.
Later, I used a lightgbm model to replace the manual design post-processing rules, and added the feature of the sample index (the data is not shuffled, which is a small leak) and the feature of whether the sample appeared in the Task1 public test set. This improve my score to 0.832.
External data
I crawled the titles and comments of the products in English, Spanish, Japanese and Chinese, as well as the pictures of the goods from Amazon. But maybe because I used them in a wrong way, I only get a gain of 0.001 through the crawled title. I may publish these data later. Welcome to explore how comment data and image data can help improve search ranking.
Model acceleration
-
pytorch amp
-
I read 1024 samples from the dataloder at a time, and order them according to the number of non padded tokens. Then the 1024 samples are splited into 16 pieces. In this way, shorter texts can have shorter prediction time.
-
For model ensemble, not all models need to make complete predictions. For example, suppose we have four models, and the mean prediction probability of the first three model is
(0.7, 0.1, 0.1, 0.1)
. Then the fourth model does not need to predict this sample, because its prediction results can not change the final prediction anyway. Even if the prediction probability of the fourth model for this sample is(0.0, 1.0, 0.0, 0.0)
, the mean prediction probability of the four models is still(0.525, 0.325, 0.075, 0.075)
, and the final prediction result is still the first label. Based on this idea, we can reduce many unnecessary predictions in the prediction of the third and fourth models.
π Deadline Extension to 20th July && β³ Increased Timeout of 120 mins
Over 2 years ago@mohanty I notice that there are some new submissions after competition deadline. Is the deadline extended again?
π Deadline Extension to 20th July && β³ Increased Timeout of 120 mins
Over 2 years agoThe submission pipeline isnβt broken, we just need wait in queue for some hours. Donβt extend it again.
π Deadline Extension to 20th July && β³ Increased Timeout of 120 mins
Over 2 years agoAlthough I know that we have been unable to change your decision, I still want to tell you how your decision hurt many participants who have worked very hard from the beginning. We followed all the rules, tried to submit and have been waiting for the deadline. Now in the last two days of the game, you told us that we need to continue to fight for this game for another six days. Itβs like telling a marathon runner who has run 40km that the finish line is 50km. Although the final ranking may not change much, we need to spend a lot of extra energy.
π Deadline Extension to 20th July && β³ Increased Timeout of 120 mins
Over 2 years agoTotally agree, I felt very tortured when I saw this news. Both extended schedule and increased timeout are bad news for me.
[Updated] Customize Dockerfile for both phase
Over 2 years agoI test torch-1.12.0+cu113 and torch-1.12.0+cu116 on my 450 driver machine, both of them can use gpu normally.
In [1]: import torch
In [2]: !nvidia-smi | head -n 4
Wed Jul 6 11:06:06 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
In [3]: torch.version.cuda
Out[3]: '11.6'
In [4]: torch.zeros((2, 2)).cuda()
Out[4]:
tensor([[0., 0.],
[0., 0.]], device='cuda:0')
[Updated] Customize Dockerfile for both phase
Over 2 years agoReally thank you for your exploring and sharing. And I have some comment which may be helpful for someone.
Exactly, I find that 450 driver can support all 11.x cuda. And if you only use pytorch, you donβt need to install cuda by yourself since pytorch has packed a cuda (that is why pytorch has a so large whl file).
π Code Submission Round Launched π
Over 2 years agopandas 1.4.x only support python 3.8+. Your problem is probably because the python version in docker mirror is less than 3.8.
π Code Submission Round Launched π
Over 2 years ago@mohanty @shivam Is the 30min time constraints means that we have 30min to run our prediction code? My submission on task2 always fail without any error message. I think this may cause by timeout, but the time between the failure and the log aicrowd_evaluations.evaluator.client:register:168 - connected to evaluation server
is always around 27min, which is less than 30 min.
π Code Submission Round Launched π
Over 2 years ago@xuange_cui I have met similar problem before. And after I disabled the debug mode, I can submit normally.
Is there a limit on the total number of submissions when merging teams?
Over 2 years agoHi @shivam , I still get a " no submission slots remaining for today" error. Is this because I just created a team?
EDIT: I resubmitted it again. And this time itβs OK.
Is there a limit on the total number of submissions when merging teams?
Over 2 years agoHi, shivam. I got a " Submission failed : The participant has no submission slots remaining for today.
" error on my third code submission today. Donβt we have 5 submission times each day?
π Code Submission Round Launched π
Over 2 years agoHi, what is the exact dataset size of private dataset for each task ?
F1 Score for ranking task 2 and 3
Over 2 years agoFor multiclass classification, micro-f1, micro-precision, micro-recall and accuracy are always the same, since we always recall one and only one label for each sample.
π Datasets Released & Submissions Open π
Almost 3 years agoJust click the download button, and copy the link from the browser download content page, it should be a aws link.
About submission times
10 months agoIs it means that I can only submit 5 times for each tracks if I want to participate all three tracks? Seems unreasonable.