Learning to Smell
Olfactory Explainer
This is a quick guide exploring what CNNs can do with an image representation of molecules
I’d really appreciate your feedback!
Made a simple colab notebook explainer for aicrowd Learning to Smell Challenge: https://t.co/XOsSRGAyhv
September 30, 2020
This is a quick guide exploring what CNNs can do with an image representation of molecules
Covered in this guide:
1- Downloading and using rdkit
2- Visualizing molecules using rdkit
3- Using a pretrained inception_v3 model
4- making predictions and submission file
We first download the data:
!gdown --id 1t5be8KLHOz3YuSmiiPQjopb4c_q2U4tG
!unzip olfactorydata.zip
!ls
Next we install rdkit:
Credits to: https://stackoverflow.com/questions/63152656/installing-rdkit-in-google-colab
!wget -q -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -q -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")
vocabulary = open("vocabulary.txt").read()
Convert smiles to rdkit Mol objects
mols = [Chem.MolFromSmiles(smile) for smile in train.SMILES]
mols_test = [Chem.MolFromSmiles(smile) for smile in test.SMILES]
Let's have a look at the data:
img = Draw.MolsToGridImage(mols[:8],molsPerRow=4,subImgSize=(256,256), legends=train.SENTENCE.tolist()[:8])
img
Saving the data as images:
!mkdir imgs
!mkdir imgs_test
for i in range(len(mols)):
Draw.MolToFile(mols[i], "imgs/{}.png".format(i), size=(256,256), imageType="png")
for i in range(len(mols_test)):
Draw.MolToFile(mols_test[i], "imgs_test/{}.png".format(i), size=(256,256), imageType="png")
Representing vocabulary as a one-hot vector:
vocabulary = vocabulary.split('\n')
def oneHotSentence(sentence):
l = np.zeros(len(vocabulary))
for label in sentence.split(','):
l[vocabulary.index(label)] = 1
return l
y = np.zeros((train.shape[0], len(vocabulary)), dtype=np.float32)
for i in range(train.shape[0]):
y[i] = oneHotSentence(train.SENTENCE.iloc[i])
Dataset and train pipeline:
Credits to: https://towardsdatascience.com/multi-label-image-classification-in-tensorflow-2-0-7d4cf8a4bc72
IMG_SIZE = 256
CHANNELS = 3
def parse_function(filename, label):
# Read an image from a file
image_string = tf.io.read_file(filename)
# Decode it into a dense vector
image_decoded = tf.image.decode_png(image_string, channels=CHANNELS)
# Normalize it from [0, 255] to [0.0, 1.0]
image_normalized = tf.cast(image_decoded, tf.float32) / 255.0
return image_normalized, label
import tensorflow as tf
tf.random.set_seed(0)
BATCH_SIZE = 64
AUTOTUNE = tf.data.experimental.AUTOTUNE # Adapt preprocessing and prefetching dynamically to reduce GPU and CPU idle time
SHUFFLE_BUFFER_SIZE = 1024 # Shuffle the training data by a chunck of 1024 observations
def create_dataset(filenames, labels, is_training=True):
# Create a first dataset of file paths and labels
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
# Parse and preprocess observations in parallel
dataset = dataset.map(parse_function, num_parallel_calls=AUTOTUNE)
if is_training == True:
# This is a small dataset, only load it once, and keep it in memory.
dataset = dataset.cache()
# Shuffle the data each buffer size
dataset = dataset.shuffle(buffer_size=SHUFFLE_BUFFER_SIZE)
# Batch the data for multiple steps
dataset = dataset.batch(BATCH_SIZE)
# Fetch batches in the background while the model is training.
dataset = dataset.prefetch(buffer_size=AUTOTUNE)
return dataset
filenames = np.array(["imgs/{}.png".format(i) for i in range(train.shape[0])])
Split train and validation sets:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(filenames, y, test_size=0.1, random_state=0)
train_ds = create_dataset(X_train, y_train)
val_ds = create_dataset(X_val, y_val)
Use Tensorflow Hub to get pretrained inception_v3 model:
import tensorflow_hub as hub
feature_extractor_url = "https://tfhub.dev/google/imagenet/inception_v3/feature_vector/4"
feature_extractor_layer = hub.KerasLayer(feature_extractor_url,
input_shape=(IMG_SIZE,IMG_SIZE,CHANNELS))
feature_extractor_layer.trainable = False
from tensorflow.keras import layers
from tensorflow.keras.losses import BinaryCrossentropy
model = tf.keras.Sequential([
feature_extractor_layer,
layers.Dense(1024, activation='relu', name='hidden_layer'),
layers.Dense(len(vocabulary), activation='sigmoid', name='output')
])
model.summary()
LR = 1e-5
EPOCHS = 8
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=LR),
loss=BinaryCrossentropy(),
metrics=["accuracy"])
Start Training:
from time import time
history = model.fit(train_ds,
epochs=EPOCHS,
validation_data=val_ds)
val_preds = model.predict(val_ds)
Let's sanity check some outputs:
ind = 64
for i in np.argpartition(val_preds[ind], -4)[-4:]: #This gets the indices of top four predictions
print(vocabulary[i])
filename = X_val[ind]
train.SENTENCE.iloc[int(filename[5:-4])]
Looks like the model is overfitting; let's submit predictions anyway.
Now load the test set and make predictions:
filenames_test = ["imgs_test/{}.png".format(i) for i in range(test.shape[0])]
imgs_test = np.zeros((len(filenames_test), 256,256,3))
i = 0
for filename in filenames_test:
imgs_test[i] = parse_function(filename,0)[0]
i += 1
preds_test = model.predict(imgs_test)
In this part we take the top 15 predictions and convert them to sentence format for submission:
predictions = []
for i in range(preds_test.shape[0]):
top_15 = np.argpartition(preds_test[i], -15)[-15:] #Get indices of top 15 predictions
top_15_voc = [vocabulary[i] for i in top_15] #Get words from those indices
pred_string = ""
for i in range(0,15,3):
pred_string += ','.join(top_15_voc[i:i+3]) + ';'
predictions.append(pred_string[:-1])
sample_submission.PREDICTIONS = predictions
sample_submission.to_csv("CNN_submission.csv",index=None)
This was a simple guide exploring CNNs in the context of this competition and getting started with a simple pipeline.
If you have suggestions/improvements I'm all ears.
Content
Comments
You must login before you can post a comment.