Graph Convolutions using Protein structures

This post shows how to implement a simple graph convolutional deep learning method to predict interfaces between protein residues, i.e. given a pair of interacting proteins, can we classify a pair of amino acid residues as interacting or not. This is based on the paper published in NIPS 2017 (Protein Interface Prediction using Graph Convolutional Networks).

The code and data here is a simplified and easy to understand version based on the published repository. My goal is to quickly introduce the readers to a deep learning architecture that uses graph convolutions without getting bogged down with the internals of protein structure. The full implementation of the things discussed below can be found at full code.

First I start with the description of the training data and then how we can build a graph convolutional model step by step using the data to solve the protein interface prediction problem. First load the training data ‘train.cpkl’ as :

import pickle
import os
#assumes train.cpkl is placed in a local directory 'data'.
train_data_file = os.path.join('./data/','train.cpkl')   
train_list, train_data = pickle.load(open(train_data_file,'rb'),encoding='latin1')
print('No of molecules in Training = ', len(train_data))
molecule_pair = train_data[0]
print(molecule_pair.keys())

Output:

No of molecules in Training =  175
dict_keys(['l_vertex', 'complex_code', 'r_hood_indices', 'l_hood_indices', 
'r_vertex', 'label', 'r_edge', 'l_edge'])

Here, the training data has 175 pairs of molecules. Each molecule-pair is a dictionary with 8 keys. It has entries for ‘ligand molecule’ denoted by “l_” and receptor molecule denoted by “r_”.

print("l_vertex = [%d,%d]"%(molecule_pair['l_vertex'].shape))
print("l_edge = [%d,%d,%d]"%(molecule_pair['l_edge'].shape))
print("l_hood_indices = [%d,%d,%d]"%(molecule_pair['l_hood_indices'].shape))

print('--------------------------------------------------')

print("r_vertex = [%d,%d]"%(molecule_pair['r_vertex'].shape))
print("r_edge = [%d,%d,%d]"%(molecule_pair['r_edge'].shape))
print("r_hood_indices = [%d,%d,%d]"%(molecule_pair['r_hood_indices'].shape))

print('--------------------------------------------------')

print("label = [%d,%d]"%(molecule_pair['label'].shape))
print(molecule_pair['label'][0:5,:])

Output:

    l_vertex = [185,70]
    l_edge = [185,20,2]
    l_hood_indices = [185,20,1]
    --------------------------------------------------
    r_vertex = [362,70]
    r_edge = [362,20,2]
    r_hood_indices = [362,20,1]
    --------------------------------------------------
    label = [1683,3]
    [[111 358   1]
     [ 11 287   1]
     [115 301   1]
     [ 10 286   1]
     [ 34 352   1]]

Thus for this molecule pair, the ligand molecule has 185 residues each with 20 neighbors (fixed in this dataset). The receptor molecule has 362 residues each with 20 neighbors again. The label indicates the +1/-1 status of every residue pair indicating whether the pair of residues are interacting(+1) or not(-1). Note that not all pairs of residues are present in the label as the authors have chosen to downsample the negative examples for an overall ratio of 10:1 of negative to positive examples. With this description, lets move onto define the placeholder tensors for building the graph convolutional network.

import tensorflow as tf

in_nv_dims = train_data[0]["l_vertex"].shape[-1]
in_ne_dims = train_data[0]["l_edge"].shape[-1]
in_nhood_size = train_data[0]["l_hood_indices"].shape[1]

in_vertex1 = tf.placeholder(tf.float32,[None,in_nv_dims],"vertex1")
in_vertex2 = tf.placeholder(tf.float32,[None,in_nv_dims],"vertex2")
in_edge1 = tf.placeholder(tf.float32,[None,in_nhood_size,in_ne_dims],"edge1")
in_edge2 = tf.placeholder(tf.float32,[None,in_nhood_size,in_ne_dims],"edge2")
in_hood_indices1 = tf.placeholder(tf.int32,[None,in_nhood_size,1],"hood_indices1")
in_hood_indices2 = tf.placeholder(tf.int32,[None,in_nhood_size,1],"hood_indices2")

input1 = in_vertex1, in_edge1, in_hood_indices1
input2 = in_vertex2, in_edge2, in_hood_indices2

examples = tf.placeholder(tf.int32,[None,2],"examples")
labels = tf.placeholder(tf.float32,[None],"labels")
dropout_keep_prob = tf.placeholder(tf.float32,shape=[],name="dropout_keep_prob")

The code implements the following network architecture :

ligand-residue --> ligand_graph_conv_layer1 --> ligand_graph_conv_layer2   
                                                                         |
                                                                          --> merged --> dense1 --> dense2 --> prediction 
                                                                         | 
receptor-residue --> receptor_graph_conv_layer1 --> receptor_graph_conv_layer1

So input1 will represent the input to the network containing information for ligand residue of the pair, analogously, input2 will represent the input to th network containing information for receptor residue of the pair. We will use the following function ‘node_average_model’ to perform the graph convolution (e.g. in ligand_graph_conv_layer1 and receptor_graph_conv_layer2) :

import numpy as np

def initializer(init, shape):  #helper function to initialize a tensor
    if init == "zero":
        return tf.zeros(shape)
    elif init == "he":
        fan_in = np.prod(shape[0:-1])
        std = 1/np.sqrt(fan_in)
        return tf.random_uniform(shape, minval=-std, maxval=std)

def nonlinearity(nl): #helper function to determine the type of non-linearity
    if nl == "relu":
        return tf.nn.relu
    elif nl == "tanh":
        return tf.nn.tanh
    elif nl == "linear" or nl == "none":
        return lambda x: x
    
# the function that defines the ops for graph onvolution.    
def node_average_model(input, params, filters=None, dropout_keep_prob=1.0, trainable=True):
    vertices, edges, nh_indices = input
    nh_indices = tf.squeeze(nh_indices, axis=2)
    v_shape = vertices.get_shape()
    nh_sizes = tf.expand_dims(tf.count_nonzero(nh_indices + 1, axis=1, dtype=tf.float32), -1)  
    # for fixed number of neighbors, -1 is a pad value
    if params is None:
        # create new weights
        Wc = tf.Variable(initializer("he", (v_shape[1].value, filters)), name="Wc", trainable=trainable)  # (v_dims, filters)
        Wn = tf.Variable(initializer("he", (v_shape[1].value, filters)), name="Wn", trainable=trainable)  # (v_dims, filters)
        b = tf.Variable(initializer("zero", (filters,)), name="b", trainable=trainable)
    else:
        Wn, Wc = params["Wn"], params["Wc"]
        filters = Wc.get_shape()[-1].value
        b = params["b"]
    params = {"Wn": Wn, "Wc": Wc, "b": b}
    # generate vertex signals
    Zc = tf.matmul(vertices, Wc, name="Zc")  # (n_verts, filters)
    # create neighbor signals
    v_Wn = tf.matmul(vertices, Wn, name="v_Wn")  # (n_verts, filters)
    Zn = tf.divide(tf.reduce_sum(tf.gather(v_Wn, nh_indices), 1), tf.maximum(nh_sizes, tf.ones_like(nh_sizes)))  # (n_verts, v_filters)
    nonlin = nonlinearity("relu")
    sig = Zn + Zc + b
    h = tf.reshape(nonlin(sig), tf.constant([-1, filters]))
    h = tf.nn.dropout(h, dropout_keep_prob)
    return h, params

The equation behind the tensor operations in the above function ‘node_average_model’ are in

an image alt text

#layer 1
layer_no = 1
name = "ligand_graph_conv_layer1"
with tf.name_scope(name):
        output, params = node_average_model(input1, None, filters=256, dropout_keep_prob=0.5)
        input1 = output, in_edge1, in_hood_indices1

name = "receptor_graph_conv_layer1"
with tf.name_scope(name):
        output, _ = node_average_model(input2, params, filters=256, dropout_keep_prob=0.5)
        input2 = output, in_edge2, in_hood_indices2

#layer 2
layer_no = 2
name = "ligand_graph_conv_layer2"
with tf.name_scope(name):
        output, params = node_average_model(input1, None, filters=256, dropout_keep_prob=0.5)
        input1 = output, in_edge1, in_hood_indices1

name = "receptor_graph_conv_layer2"
with tf.name_scope(name):
        output, _ = node_average_model(input2, params, filters=256, dropout_keep_prob=0.5)
        input2 = output, in_edge2, in_hood_indices2

Note that the weights are shared between ligand_graph_conv_layer1 receptor_graph_conv_layer1. Similarly weights are shared between ligand_graph_conv_layer2 receptor_graph_conv_layer2. Next add the layers to merge the outputs from ‘ligand_graph_conv_layer2’ and ‘receptor_graph_conv_layer2’ - essentially concatenating the tensors along the feature dimension with both possible orders : (ligand,receptor) and (receptor,ligand).

# merged layers
layer_no = 3
name = "merged"
merge_input1 = input1[0] 
merge_input2 = input2[0]
with tf.name_scope(name):
    m_out1 = tf.gather(merge_input1, examples[:, 0])
    m_out2 = tf.gather(merge_input2, examples[:, 1])
    # concatenate in both possible orders : (ligand,receptor) and (receptor,ligand).
    output1 = tf.concat([m_out1, m_out2], axis=0)
    output2 = tf.concat([m_out2, m_out1], axis=0)
    merged_output = tf.concat((output1, output2), axis=1)

Next add the two densely connected layers using the merged output. Define the function for the dense layer:

def dense(input, out_dims=None, dropout_keep_prob=1.0, nonlin=True, trainable=True):
    input = tf.nn.dropout(input, dropout_keep_prob)
    in_dims = input.get_shape()[-1].value
    out_dims = in_dims if out_dims is None else out_dims
    W = tf.Variable(initializer("he", [in_dims, out_dims]), name="w", trainable=trainable)
    b = tf.Variable(initializer("zero", [out_dims]), name="b", trainable=trainable)
    Z = tf.matmul(input, W) + b
    if(nonlin):
        nonlin = nonlinearity("relu")
        Z = nonlin(Z)
    Z = tf.nn.dropout(Z, dropout_keep_prob)
    return Z

And connect it with merged_output:

# dense layer 1 
layer_no = 4
name = "dense1"
with tf.name_scope(name):
        dense1_output = dense(merged_output, out_dims=512, dropout_keep_prob=0.5, nonlin=True, trainable=True)

# dense layer 2
layer_no = 5
name = "dense2"
with tf.name_scope(name):
        dense2_output = dense(dense1_output, out_dims=1, dropout_keep_prob=0.5, nonlin=False, trainable=True)

Add the final layer, essentially averaging the predictions from both orders of combinations. See

an image alt text

# add layer to get mean prediction across both the orders (ligand,receptor) and (receptor,ligand)
layer_no = 6
name = "do_prediction"
with tf.name_scope(name):
     preds = tf.reduce_mean(tf.stack(tf.split(dense2_output, 2)), 0)

pn_ratio = 0.1        
learning_rate = 0.05

#add loss op
# Loss and optimizer
with tf.name_scope("loss"):
     scale_vector = (pn_ratio * (labels - 1) / -2) + ((labels + 1) / 2)
     logits = tf.concat([-preds, preds], axis=1)
     labels_stacked = tf.stack([(labels - 1) / -2, (labels + 1) / 2], axis=1)
     loss = tf.losses.softmax_cross_entropy(labels_stacked, logits, weights=scale_vector)
     with tf.name_scope("optimizer"):
        # generate an op which trains the model
        train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

Now time for training the model, will do just one epoch. In practice, you will need to run more than 100 epochs to get reasonale test results.

def build_feed_dict(minibatch):
   feed_dict = {
                    in_vertex1: minibatch["l_vertex"], in_edge1: minibatch["l_edge"],
                    in_vertex2: minibatch["r_vertex"], in_edge2: minibatch["r_edge"],
                    in_hood_indices1: minibatch["l_hood_indices"],
                    in_hood_indices2: minibatch["r_hood_indices"],
                    examples: minibatch["label"][:, :2],
                    labels: minibatch["label"][:, 2],
                    dropout_keep_prob: dropout_keep
   }
   return feed_dict

num_epochs =  1 #change this while real training.
minibatch_size = 128
dropout_keep = 0.5

sess = tf.Session()
sess.run(tf.global_variables_initializer())

print("Training Graph Conv Model ...")

for epoch in range(0, num_epochs):
     """
     Trains model for one pass through training data, one protein at a time
     Each protein is split into minibatches of paired examples.
     Features for the entire protein is passed to model, but only a minibatch of examples are passed
     """
     prot_perm = np.random.permutation(len(train_data))
     ii = 0
     nn = 0
     avg_loss = 0
     # loop through each protein
     for protein in prot_perm:
         # extract just data for this protein to create minibatches.
         prot_data = train_data[protein]
         pair_examples = prot_data["label"]
         n = len(pair_examples)
         shuffle_indices = np.random.permutation(np.arange(n)).astype(int)
         # loop through each minibatch or ligand-receptor pairs for this protein.
         for i in range(int(n / minibatch_size)):
             # extract data for this minibatch
             index = int(i * minibatch_size)
             example_pairs = pair_examples[shuffle_indices[index: index + minibatch_size]]
             minibatch = {} 
             #copy data to minibatch.
             for feature_type in prot_data:
                 if feature_type == "label":
                    minibatch["label"] = example_pairs
                 else:
                     minibatch[feature_type] = prot_data[feature_type]
             # train the model
             feed_dict = build_feed_dict(minibatch)
             _,loss_v = sess.run([train_op,loss], feed_dict=feed_dict)
             avg_loss += loss_v
             ii += 1
             print(ii," Batch loss = ",loss_v)
         nn += n
     print("Epoch_end =",epoch,", avg_loss = ",avg_loss/ii," nn = ",nn)

One epoch of training gives:

    Training Graph Conv Model ...
    1  Batch loss =  0.112723
    2  Batch loss =  0.118632
    3  Batch loss =  0.125916
    4  Batch loss =  0.162415
    5  Batch loss =  0.117011
    6  Batch loss =  0.112846
    7  Batch loss =  0.135485
    8  Batch loss =  0.119522
    9  Batch loss =  0.112589
     .
     .
     .
    1285  Batch loss =  0.125246
    1286  Batch loss =  0.136294
    1287  Batch loss =  0.11121
    1288  Batch loss =  0.118586
    1289  Batch loss =  0.123084
    Epoch_end = 0 , avg_loss =  0.125743936039  nn =  176044
import copy
from sklearn.metrics import roc_curve, auc, average_precision_score

all_preds = []
all_labels = []
all_losses = []
for prot_data in train_data:
     temp_data = copy.deepcopy(prot_data)
     n = prot_data['label'].shape[0] #no of labels for this protein molecule.
     #split the labels into chunks of minibatch_size.
     batch_split_points = np.arange(0,n,minibatch_size)[1:]
     batches = np.array_split(prot_data['label'],batch_split_points)
     for a_batch in batches:
        temp_data['label'] = a_batch
        feed_dict = build_feed_dict(temp_data)
        res = sess.run([loss,preds,labels], feed_dict=feed_dict)
        pred_v = np.squeeze(res[1])
        if len(pred_v.shape)==0:
           pred_v = [pred_v]
           all_preds += pred_v
        else:
           pred_v = pred_v.tolist()
           all_preds += pred_v
        all_labels += res[2].tolist()
        all_losses += [res[0]]

fpr, tpr, _ = roc_curve(all_labels, all_preds)
roc_auc = auc(fpr, tpr)
print('mean loss = ',np.mean(all_losses))
print('roc_auc = ',roc_auc)

Output :

    mean loss =  0.121488
    roc_auc =  0.569184936411

Looks awful, but this is testing after just one epoch of training. You will get much better results with more than 100 epochs of training.

Written on February 2, 2018