Saw some interesing discussion post by @gaurav_singhal here .
I managed to read some great resources about active learning :
DeepAL: Deep Active Learning in Python, Kuan-Hao Huang, 2021 -> this implementation heavily based off
And I ended up trying lot's of them.
SO here's some implementation NOTEBOOKS using this challenge's data :
It's basically find the maximum probability from each label, then find the lowest one from there.
find the difference of probability entropy vs it's entropy means then get the lowest one.
Sort the highest probability then find the difference between each label probability. The formula is quite weird, I'm not confidence about using this one.
It's the slowest one! basically collect the embeddings, cluster it, calculate the distance of each unlabelled data, and find the farthest one from any cluster.
And continuing the experiment before here : https://www.aicrowd.com/showcase/lb-0-880-my-experiment-results-baseline-too-i-guess
here's the results from each method !
|% Score Increase*
*I'll rerun it again multiple times to get the std interval (±) result
On the paper implementation, its usually consist of multiple 'rounds' of buying the label so the end-result is good which I think it's difficult to achieve using this competition limited runtime (well that's the challenges). So make sure to optimize it however you like between your training epoch vs rounds, and still pay attention to 3 hours running time. The Notebook default setting is obviously not the best one!
I'm planning to add more methods soon.
Feel free to comment or correct me if there's an improvement, correction, or anything for this implementation!
Hope this will help you guys!