Vatsal Gupta | Bi-LSTM-based Query Resolver

Code

The similarity model makes use of pretrained word embeddings gsarti/biobert-nli (link) and fine tunes it with the training data to generate meaninful sentence embeddings for biomedical semantic similarity analysis. Other pretrained word embeddings model that performed well on the training set are BioBERT (implementation details here).

The pretrained model is finetuned by adding a stacked BiLSTM layer (2 BiLSTMs) and a fully connected layer at the end of the pretrained model. The BiLSTMs are used to extract meaningful sentence embeddings from the input word embeddings. The model takes as input the word embeddings for the sentence and generates embedding for the sentence as a whole. The hidden layer of each LSTM unit is equal to the length of embedding of each word (768 in this case). The BiLSTMs are able to store important context from different parts of the sentence and generate a more semantically accurate embedding. The outputs of the last forward LSTM unit and backward LSTM unit are concatenated and fed into a fully connected layer which generates the final sentence embeddings. A better model would be generated by stacking 10 BiLSTM layers since the memorizing ability of this model is unmatched by a 2 layer BiLSTM stack, however due to the limited computing resources such a model could not be trained.

The model was trained as a siamese network with the loss criterion as contrastive loss and optimizer as the adam optimizer. Training set consists of 10486 sentence pairs and the test set consists of 2080 sentence pairs. Both the training set and test set have been borrowed from various repositories of Asma Ben Abacha (link).

The model performs extremely well on the training as well as the test set:

Accuracy on training set: 0.9673850848750716
Accuracy on test set: 0.9692307692307692

However, since the training set and test set are drawn from similar sources it might be overfit to data and require training on the question answer database. A little overfitting to the question answer database is infact desirable since accurate matching to questions in the database is desired.

The question with the closest semantic relationship to the new query entered by the user is considered to be the one having the minimum pairwise euclidean distance between their sentence embeddings.