Learning to Smell
SMILESVec package to train a fully-connected neural network
Explains how to use vectors created with the SMILESVec package to train a fully-connected neural network using Tensorflow Keras
Hi everyone,
I wrote a Google Colab tutorial/explainer on how to use vectors created with the SMILESVec package to train a fully-connected neural network using Tensorflow Keras on the learning to smell dataset:
https://colab.research.google.com/drive/1cePlnWwWOsYxwqs8NWebVHFwRr624tNc?usp=sharing
Let me know if you have any suggestions or questions, always happy to help out!
Cheers,
Cas
Introduction¶
The learning to smell challenge is all about using machine learning to learn attributes of compounds in order to predict what they might smell like. The dataset provided consists of SMILES sentences and smell labels.
In this notebook I show you how to use SMILES vectors created with the SMILESVEC package to train a fully-connected neural network using the Keras library. SMILESVec package: https://github.com/hkmztrk/SMILESVecProteinRepresentation
Loading data files and setting up the environment¶
!gdown --id 1N6MRQR-N-rRag84uH11GxQDGPm56ZTzj
!unzip smell_colab.zip
!ls
import pickle
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
import csv
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow import keras as K
import sklearn
from tensorflow.keras import regularizers
from tensorflow.keras import layers
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.callbacks import ModelCheckpoint
TRAIN_PICKLE_DIR = "/content/smiles_train.vec"
TEST_PICKLE_DIR = "/content/smiles.vec"
VOCABULARY_DIR = '/content/vocabulary.txt'
Y_TRAIN_DIR = '/content/y_train.csv'
TRAIN_SMILES = '/content/smiles_train.txt'
TEST_SMILES = '/content/smiles_test.txt'
Create functions and load the data¶
def load_data(pickle_dir):
SMILESVec = pickle.load(open(pickle_dir, "rb"))
SMILES_array = np.asarray(SMILESVec)
SMILES_array = SMILES_array[:-1]
return SMILES_array
def dictionarize_vocabulary():
with open(VOCABULARY_DIR, newline='') as f:
reader = csv.reader(f)
vocabulary = list(reader)
vocabulary_dict = {}
for i in range(len(vocabulary)):
vocabulary_dict[vocabulary[i][0]] = i
return vocabulary_dict
def get_words():
with open(Y_TRAIN_DIR, newline='') as f:
reader = csv.reader(f)
words = list(reader)
words = words[1:]
return words
def split_words(words):
y_train = []
for molecule in words:
linelist = []
line = molecule[0].split(',')
y_train.append(line)
y_train = [tuple(x) for x in y_train]
return y_train
def normalize_labels(labels):
normalized_labels = []
for label in labels:
total = np.sum(label)
normalized_label = label / total
normalized_labels.append(normalized_label)
return np.asarray(normalized_labels)
Load the training and test data
X_vector = load_data(TRAIN_PICKLE_DIR)
x_test_vector = load_data(TEST_PICKLE_DIR)
Load the targets, split the data and normalize
vocabulary_dictionary = dictionarize_vocabulary()
reversed_vocabulary_dictionary = {value : key for (key, value) in vocabulary_dictionary.items()}
words = get_words()
Y = split_words(words)
one_hot = MultiLabelBinarizer()
Y = np.asarray(Y)
Y = one_hot.fit_transform(Y)
x_train_vector, x_val_vector, y_train, y_val = train_test_split(X_vector, Y, test_size = 0.2, random_state = 1996)
y_train_normalized = normalize_labels(y_train)
y_val_normalized = normalize_labels(y_val)
Y_normalized = normalize_labels(Y)
print(x_train_vector.shape) ## vector of 100
print(y_train.shape) ## 109 0's or 1's
print(reversed_vocabulary_dictionary[0], reversed_vocabulary_dictionary[1]) ## what kind of classes are we looking at?
Building the model and create top-5 predictions¶
Define the functions to build the model
def build_model():
model = Sequential()
model.add(Dense(124, input_dim=(100), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(109, activation='sigmoid'))
model.compile(loss='CategoricalCrossentropy', optimizer='adam', metrics=['accuracy'])
return model
Train the model until validation loss gets worse
es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5) ## Stop training when val loss gets worse
checkpoint = ModelCheckpoint('weights.h5', monitor='val_loss', save_best_only=True, verbose=0) ## Only keep the weights of the best epoch
model = build_model() ## Wrapping the model in a function allows you to re-initialize the weights
model.summary() ## Let's see what the model looks like
Let's start training!
model.fit(x_train_vector, y_train_normalized, validation_data=(x_val_vector, y_val_normalized), epochs = 1000, verbose = 1, callbacks=[es_callback, checkpoint])
Now we can create a prediction vector for every value in the test set
preds = model.predict(x_test_vector)
Keep the top 5 predictions and turn them back into words using the reversed vocabulary dictionary
top_5_preds = []
for pred in preds:
top_values_index = sorted(range(len(pred)), key=lambda i: pred[i])[-5:] ## sort to only keep the top 5
top_values_words = []
for i in range(5):
top_values_words.append(reversed_vocabulary_dictionary[top_values_index[i]]) ## look up the numbers in the dictionary
top_5_preds.append(top_values_words)
print(top_5_preds[0:10])
If you have any feedback or questions, feel free to reach out on the discussion board! Always happy to help
Content
Comments
You must login before you can post a comment.