Loading

Learning to Smell

Right fingerprint is all you need

Using PubChem along with k-nearest neighbours and more!

lacemaker

Dear Community,

My submission is also a very basic one, despite that it gives a high score on the current leaderboard. I hope that I’ll manage to find some spare time to write something more interesting in the upcoming rounds, that’s why I’ve decided to publish it.

I wrote a Medium post with a short explanation and some thoughts about what to do next.
Google Colab can be found here, and GitHub repository with the full source code is there.

In [93]:
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
In [94]:
!wget https://www.dropbox.com/s/3b2ta3qr706d1ua/aicrowd-learning-to-smell-data.zip
--2020-10-04 21:26:18--  https://www.dropbox.com/s/3b2ta3qr706d1ua/aicrowd-learning-to-smell-data.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.3.1, 2620:100:6018:1::a27d:301
Connecting to www.dropbox.com (www.dropbox.com)|162.125.3.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/3b2ta3qr706d1ua/aicrowd-learning-to-smell-data.zip [following]
--2020-10-04 21:26:18--  https://www.dropbox.com/s/raw/3b2ta3qr706d1ua/aicrowd-learning-to-smell-data.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com/cd/0/inline/BApvdQ-un7z_yczvNGMd6IeXZtoBgClGSpMwWXjfyO2ZKJi3I0ihigar1U5eh9f8zN3Vv4xhPS6PdAahtbFED218gHKEsUpFgWXYIchCJNYuJj2RIMMQxRqmhvoN7uDz3u0/file# [following]
--2020-10-04 21:26:18--  https://uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com/cd/0/inline/BApvdQ-un7z_yczvNGMd6IeXZtoBgClGSpMwWXjfyO2ZKJi3I0ihigar1U5eh9f8zN3Vv4xhPS6PdAahtbFED218gHKEsUpFgWXYIchCJNYuJj2RIMMQxRqmhvoN7uDz3u0/file
Resolving uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com (uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com)... 162.125.3.15, 2620:100:6018:15::a27d:30f
Connecting to uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com (uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com)|162.125.3.15|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /cd/0/inline2/BApLZWzS3Rc-uTVYiTx2MzXcdnJQrAaEfAVW5Zowwmm5O-WXBaDX05HT3GbYqUlkz2Q-ZVyEBne5Q3f0LkHk4aoGEl9pCg1UbelpokG9xWbrqcfYIvX-f1NO_bPX1g3pqYRAVdp5V88ZyUZbiCJYcViE3AWXc5K5UHKx18m-dfq54TgJwwEnWuKo6bPf6qduztcvX9F2E0Xq8yleam_tlwPJAeQScnj1DfF4HSUhj57Q9PEAgVb0yi1TPFVUGGF8_zCyfpHNlHSwbl2EiOgMbTUoJEEt4Amryz-rtanq12Jd5ReU-b-8lvD7V_jkvVK1rxu9tE5PCVPkukkp5gjhpCdSYt04mweexIuq4o3bIgjxFQ/file [following]
--2020-10-04 21:26:19--  https://uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com/cd/0/inline2/BApLZWzS3Rc-uTVYiTx2MzXcdnJQrAaEfAVW5Zowwmm5O-WXBaDX05HT3GbYqUlkz2Q-ZVyEBne5Q3f0LkHk4aoGEl9pCg1UbelpokG9xWbrqcfYIvX-f1NO_bPX1g3pqYRAVdp5V88ZyUZbiCJYcViE3AWXc5K5UHKx18m-dfq54TgJwwEnWuKo6bPf6qduztcvX9F2E0Xq8yleam_tlwPJAeQScnj1DfF4HSUhj57Q9PEAgVb0yi1TPFVUGGF8_zCyfpHNlHSwbl2EiOgMbTUoJEEt4Amryz-rtanq12Jd5ReU-b-8lvD7V_jkvVK1rxu9tE5PCVPkukkp5gjhpCdSYt04mweexIuq4o3bIgjxFQ/file
Reusing existing connection to uc29ccbd4c826b54d9e4d4306de4.dl.dropboxusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 56694 (55K) [application/zip]
Saving to: ‘aicrowd-learning-to-smell-data.zip.6’

aicrowd-learning-to 100%[===================>]  55.37K  --.-KB/s    in 0.01s   

2020-10-04 21:26:19 (4.40 MB/s) - ‘aicrowd-learning-to-smell-data.zip.6’ saved [56694/56694]

In [95]:
!unzip -o aicrowd-learning-to-smell-data.zip
Archive:  aicrowd-learning-to-smell-data.zip
  inflating: data/test.csv           
  inflating: data/train.csv          
  inflating: data/vocabulary.txt     
In [96]:
os.listdir("./data")
Out[96]:
['data', 'test.csv', 'vocabulary.txt', 'train.csv']
In [97]:
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")
vocab = pd.read_csv("data/vocabulary.txt", header=None)

I used precomputed fingerprints from PubChem. To reproduce, you can run python download_data_from_pubchem.py, which is available on github, or simply download file with collected fingerprints from there:

In [98]:
!wget https://raw.githubusercontent.com/latticetower/learning-to-smell-baseline/main/pubchem_fingerprints.csv
--2020-10-04 21:26:23--  https://raw.githubusercontent.com/latticetower/learning-to-smell-baseline/main/pubchem_fingerprints.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1359150 (1.3M) [text/plain]
Saving to: ‘pubchem_fingerprints.csv.3’

pubchem_fingerprint 100%[===================>]   1.30M  --.-KB/s    in 0.1s    

2020-10-04 21:26:23 (13.0 MB/s) - ‘pubchem_fingerprints.csv.3’ saved [1359150/1359150]

In [99]:
fingerprints = pd.read_csv("pubchem_fingerprints.csv")
In [100]:
train_df = train.merge(fingerprints, on="SMILES", how="left")
test_df = test.merge(fingerprints, on="SMILES", how="left")
print(train_df.fingerprint.isnull().sum(), "train molecules have no associated fingerprint")
print(test_df.fingerprint.isnull().sum(), "test molecules have no associated fingerprint")
33 train molecules have no associated fingerprint
5 test molecules have no associated fingerprint

I use only molecules which have fingerprint available to find k nearest neighbours, that's why I filter both train and test data and use unpacked fingerprints to compute K nearest neighbours.

In [101]:
def to_bits(x):
    try:
        unpacked = np.unpackbits(np.frombuffer(bytes.fromhex(x), dtype=np.uint8))
    except Exception as e:
        print(e)
        print(x)
        
    return unpacked


train_df = train_df[~train_df.fingerprint.isnull()]
train_fingerprints = train_df.fingerprint.apply(to_bits)#lambda fingerprint_string: [x=='1' for x in fingerprint_string])
train_fingerprints = np.stack(train_fingerprints.values)

test_df = test_df[~test_df.fingerprint.isnull()]
test_fingerprints = test_df.fingerprint.apply(to_bits)#lambda fingerprint_string: [x=='1' for x in fingerprint_string])
test_fingerprints = np.stack(test_fingerprints.values)
In [102]:
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(train_fingerprints)
distances, neighbour_indices = nbrs.kneighbors(test_fingerprints)
In [103]:
for i, neighbours in zip(test_df.index, neighbour_indices):
    test.loc[i, "PREDICTIONS"] = ";".join([train.loc[train_df.index[x], "SENTENCE"] for x in neighbours])
In [104]:
test.PREDICTIONS.isnull().sum()
Out[104]:
5

We still need to fill several predictions, for this we use top-5 most common molecular scents from train dataset.

In [105]:
train.SENTENCE.value_counts()[:5]
Out[105]:
odorless    57
mint        36
fruity      32
woody       28
oily        24
Name: SENTENCE, dtype: int64
In [106]:
default_prediction = ";".join(train.SENTENCE.value_counts()[:5].index)
In [107]:
test.loc[test.PREDICTIONS.isnull(), "PREDICTIONS"] = default_prediction
In [108]:
test.to_csv("baseline_submission.csv", index=None)
In [109]:
from google.colab import files
files.download("baseline_submission.csv")
In [ ]:


Comments

You must login before you can post a comment.

Execute