Is Bidirectional Sentence Parsing the Future of Natural Language Processing?

Bharathi Sridhar


Traditionally, text is analysed unidirectionally: from left to right. But bidirectional parsing is quickly becoming more prevalent in Natural Language Processing (NLP) and deep learning. Deep learning methods often require large amounts of data to extract patterns but in many NLP tasks, labelled data is scarce, so usually, a model is pre-trained on unsupervised data to learn universal language representations. The Transformer is a deep learning model proposed in the paper Attention is All You Need[1] by researchers at Google and the University of Toronto in 2017. It is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), using attention to boost speed[2].


Traditionally, sentiment analysis aimed at extracting discriminative features such as term frequency and features that show text sentiment. Part-of-Speech (POS) is used as an additional feature to improve the performance of sentiment analysis[3]. POS indicates how a word is used in a sentence based on grammar rules and negation, and inverts the sentiment polarity ( it is usually a score between -1 and +1 that is used to indicate the positivity, negativity, or neutrality of a phrase or sentence). After all applicable features are found, they are used to build models such as Naïve Bayes and Support Vector Machines (SVM).

Naïve Bayes is a probabilistic classifier based on Bayes Theorem; it makes predictions on the basis of the probability of a feature. Support Vector Machines are supervised (requiring labelled data and examples) learning models which can perform nonlinear classification and regression analysis. Regression analysis is the process of estimating the correlation between dependent and independent variable(s).

Using a term frequency and POS as a set of features to perform sentiment analysis can determine text sentiment to a degree, but cannot extract information related to emotion or polarity[4].

Chapter 13 Support Vector Machine | Machine Learning with R

Figure 1. Graphical representation of Support Vector Machines (SVMs)[5]

As seen in figure 1, WTx + b =0 functions as the decision boundary that separates the data into 2 planes (one represented in red, the other in blue). Since SVMs are primarily used for classification, their purpose is to take the input and produce a decision boundary which separates the data into 2 planes based on the required criteria of classification. Each separated class of data points functions as a feature.

Recently, deep learning techniques have become popular in the field of NLP and many researchers have applied them to real-world problems[6]. A Long Short-Term Memory (LSTM) is a Recurrent Neural Network (RNN) model that learns from sequential data and stores the previous state. And because RNNs can take inputs of a large range of sizes, they are widely applied to solve problems such as handwriting and speech recognition. LSTM has now been expanded to Bidirectional LSTM (Bi-LSTM).

Conventional LSTMs can learn sequential data only from the left to right hand side. However, in common language or conversation, patterns sometimes exist from the right to the left direction too. Bi-LSTM has the ability to learn in both directions and combine these pieces of information to form a prediction.

Since conventional deep learning models cannot learn directly from text, a pre-processor is required to transform a word into a vector. This is called Word Embedding. A widely used technique for word embedding is word2vec[6]. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text[7]. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence[7]. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector[7]. The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors [7].

Figure 2: The basic structure of any deep learning model consisting hidden layer(s) which support procession complex relationships between data points[8]

A good word embedding model must be trained from a very large corpus that contains a lot of words. Such a model would be intensively time consuming, thus, the transfer learning technique is used instead. Transfer learning uses storing knowledge gained while solving one problem to apply it to solve a different but related problem. [9]


The framework of sentiment analysis is divided into three main parts: pre-processing with Tokenization and POS-Tagging, feature extraction, and machine language modelling[10].

Figure 3. Overview of the sentiment analysis framework

As seen in Figure 3 representing the sentiment analysis framework, the process begins with data collection from websites from which required paragraphs/sentences are extracted. This text is then fed into the preprocessing unit where it undergoes numerous processes before it can be inputted into any model.

Part I: Pre-processing

  1. Tokenization is required to segment words in every sentence because the word must transform into the vector. Usually, an algorithm is used in order to segment words in sentences. This depends on the language being analysed. For English, tokenization strategy is to use white space and punctuation as token delimiters[11].

Tokenization and Text Data Preparation with TensorFlow & Keras

Figure 4. An example of a sentence being tokenized

  1. Stopword Removal is the process of removing the words that do not add any meaning in order to save storage space and speed up processing [12].

Text Processing

Figure 5. Tokenised text undergoing the process of Stopword removal [12]

  1. Stemming is the step of linguistic normalisation, which aims to simplify words to their root word by removing the derivational affixes[13]. Typically, the stemming process begins with identifying and removing prefixes, suffixes, and inappropriate pluralisation[13]. For example, a typical stemming algorithm would normalise swimming, swims, swam, swimmer, and so on, into swim. The idea is to reduce the computation and storage required for algorithmic processing[13].

Figure 6. An example for Stemming and Lemmatization [13]

  1. Lemmatization transforms words into their base words and makes them linguistically correct. It transforms root words with the help of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming since the stemmer only works on an individual word without knowledge of the context or use (knowledge as to whether the stemmed word is a noun, verb, or adjective, etc. is lost).[14]
  2. Spelling Normalisation: Misspelled words can lead to an unnecessary expansion in the size of the vector space needed to represent a document.
  3. POS-Tagging is the process of classifying words in relation to a part of speech, based on its definition and context[3]. Eight basic POS are generally used: noun, verb, pronoun, preposition, adverb, conjunction, adjective, and article.

Figure 7. The above picture shows some text which has been tagged based on various benchmarks [15]

Part II: Features

  1. Word Embedding Feature: An embedding is a matrix in which each column is the vector that corresponds to an item in your vocabulary[16]. To get the dense vector for a single vocabulary item, you retrieve the column corresponding to that item[16]. The position of a word in the vector space is observed from text and is based on the words that surround the word when it is used. Word embedding can be trained using the input corpus itself or can be generated using pre-trained word embedding such as GloVe[17], FastText[18], and Word2Vec2.4 Text / NLP based features.

Introduction to Word Embeddings | Hunter Heidenreich

Figure 8. A diagrammatic example of Word Embedding [16]

  1. POS Embedding Feature is similar to word embedding features (POS are used as words in the corpus). This aids the model to learn the structure of the sentence by representing POS as vectors (this process is called word embedding[16], as represented in Figure 8). These embeddings are fed along with the data into the model and form the Embedding layer.
  2. Sentic Feature is used to represent emotions in a vector form which is inspired by the Hourglass of Emotions[19]. The Hourglass of Emotions, in turn, allows classifying affective information both in a categorical way (according to a wider number of emotion categories) and in a dimensional format (which facilitates comparison and aggregation)[20]. A Sentic Feature is a five-dimensional vector[19]. The first four elements are pleasantness, attention, sensitivity, and aptitude values, while the last element represents a polarity value[19]. All values range between −1 and 1[19].

API « SenticNet

Figure 9. An example of some text for which sentic features have been identified [21]

Part III: Deep Learning Model

Various deep learning models are utilised in numerous areas including object detection, speech recognition, and image classification. The advantage of this model is its multiple non-linear layers that can capture complex patterns and decision boundaries from the input dataset as seen in Figure 10. Following the success of deep learning models in vision and speech, researchers have shifted their focus to sentiment analysis and natural language processing.

What Is Deep Learning and How Will It Change Healthcare?

Figure 10. Structure of a deep learning model [22]

Sentiment analysis can be broadly classified into traditional approaches (as seen in Figure 11) and deep learning methods (as seen in Figure 12). Traditional approaches involve features such as the bag of words model which is based on combinations of words and their probability based sentiment strength[23]. However, manually operating on all applicable features is highly tedious in application. Even a slight modification in the text requires repeating the entire process all over again.

Figure 11. Traditional approach of processing text[24]

Thus, back-propagation is one of the primary advantages of deep neural networks. The emergence of deep networks like Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and word embedding methods like word2vec, GloVe, etc., marks the new era of machine learning algorithms[25].

Figure 12. RNN, LSTM based approach of processing text[24]

However, one of the issues with deep learning is that deep networks lack generality over unseen inputs due to the complexity of the model. Additionally, word embeddings are able to capture only distributional information but not the polarity of text (sentiment analysis is heavily dependent on polarity information). Polarity, in this case, refers to the aspect of sentiment analysis which tries to understand the opinion of the text, i.e., whether it is positive or negative. Since most of these models do not capture polarity, they are shallow (lacking depth of understanding whether the text is positive or negative) and unidirectional.

Transformer models

Transformer models make use of self-attention layers[1] that reduce the computational path length in a deep learning network[26]. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. [1]

Transformers In NLP | State-Of-The-Art-Models

Figure 13. Structure of a Transformer model[1]

It is principally an Encoder-Decoder architecture for seq2seq learning tasks like machine translation and named-entity recognition (NER) tagging[1]. A standard tool for modeling two sequences with recurrent networks is the encoder-decoder architecture where the second sequence (also known as the target) is being processed conditioned on the first one (also known as the source).[27]

The Long-Short-Term-Memory (LSTM) model is a good example of Encoder-Decoder architecture[27]. With sequence-dependent data, LSTM gives meaning to the sequence while retaining or discarding the parts it finds important or unimportant respectively [26]. Thus, LTSM becomes a special kind of RNN that uses units such as a ‘memory cell’ in addition to standard units to make it possible to retain information for longer periods of time than an RNN.

A Seq2Seq[28] model typically consists of an Encoder and a Decoder where an Encoder takes the input sequence and maps it into an n-dimensional vector[27]. That vector is passed into the Decoder which relays it into an output sequence. Given that modern day language often occurs as some combination of multiple natural languages, with numerous abbreviations and emojis, encoders and decoders help to translate the given data into a single language.

Seq2Seq Models : French to English translation using encoder-decoder model  with attention. | by Hardik Vagadia | Analytics Vidhya | Medium

Figure 15. Example of Seq2seq model to translate languages [29]

These models are versatile in sentiment analysis applications and are therefore used in a variety of NLP tasks, including machine translation, text summarisation, speech recognition, and Q and A systems[27].

Drawbacks of seq2seq model that are addressed by Transformers

There are some constraints to seq-2-seq models. Transformer design erases the auto-regressive model (an auto-regressive model forecasts the value of a variable based on previous values) in the Seq2Seq model and rather relies altogether upon self-attention[1]. A self-attention layer connects all positions with a constant number of sequentially executed operations ( whereas a RNNlayer requires O(n) sequential operations). In terms of computational complexity, self-attention layers are faster than RNN layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations. [1]

The shorter the path between any combination of positions in the input and output arrangements, the simpler it is to learn long-range dependencies. The self-attention layer interfaces each position with a particular number of consecutively executed operations involving long-range conditions.

Attention mechanism

An approach to solve the problem of loss of relevant information in long sentences is to use the attention mechanism[1]. Each time the model predicts an output word using only parts of the input where the most pertinent information is present as opposed to a whole sentence. The encoder works as it usually would, but the decoder’s hidden state is linked to a context vector, the previous output and the previous hidden state. Context vectors are the weighted sum of annotations computed by the encoder and every word has its own context vector[30]. This context vector is then passed to the decoder which generates the required sequence.

Table 1. Comparison between various algorithms


Naive Bayes

Support Vector Machines

Recurrent Neural Network (RNN)

RNN with Long term Short Memory (LTSM)



Independent: No semantic relationship between words


Relationships between words are considered

Relationships between words are considered in one direction

Relationships between words are considered in both directions


Not Available

Not Available




Data Type



Sequential data

Sequential data

Sequential data


Training speed is slow. The time taken for training grows as data grows

Training time increases as the data grows

Training time is small

Training time is smaller

Training time is smaller than that of LTSM


In-memory requirement

Higher memory requirement

Memory requirement is comparatively less as data can be distributed

Memory requirement per neuron increases

Higher memory requirement than that of LTSM

Can data distributed between systems be processed independently?

Cannot be distributed

Cannot be distributed

Can be distributed

Can be distributed

Parallelisation of sequence data required

Is there Sensitivity to the quantity and quality

of data?

Works well with small training set

A large amount of training data is required

A large amount of training data is required

A large amount of training data is required

A large amount of training data is required

Data Collection

Existing lexicons

Is tedious

Easy to collect

Easy to collect

Pre-trained models are available

Limitations of the Transformer

Transformer improves on the RNN-based seq2seq models, but it is not completely without fault.

  1. Attention based models only deal with fixed-length text strings.
  2. The tokenization of text causes context fragmentation.


Transformers have a network architecture based on the self-attention mechanism and are independent of recurrence and convolutions[1]. Thus, commands are executed in parallel, making Transformers more efficient and quicker. The future of NLP is ever-evolving and impacting our lives on a day to day basis. With models such as Transformers, Transforms-XL and BERT (Bidirectional Encoder Representations from Transformers), the evolution of NLP continues positively.


I would like to thank Mr. Abhishek Nigam, Product Head, Machine Learning Products at Times Internet Limited, for providing vital mentorship and guidance throughout this research process over the course of my internship.


  1. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.”, December 6, 2017.
  2. Allard, Maxime. “What Is a Transformer? – Inside Machine Learning – DZone AI.”, January 25, 2019.
  3. “Part-of-Speech Tagging.” Wikipedia. Wikimedia Foundation, March 15, 2021.
  4. Walaa Medhat, Ahmed Hassan, Hoda Korashy,\”Sentiment analysis algorithms and applications: A survey,Ain Shams Engineering Journal,Volume 5, Issue 4,2014\” Pages 1093-1113,ISSN 2090-4479,
  5. Ryckel, François de. “Machine Learning with R.” Chapter 13 Support Vector Machine, February 23, 2019.
  6. Ayutthaya, Thititorn Seneewong, and Kitsuchart Pasupa. “Thai Sentiment Analysis via Bidirectional LSTM-CNN Model with Embedding Vectors and Sentic Features.” 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), 2018.
  7. “Word2vec.” Wikipedia. Wikimedia Foundation, April 15, 2021.
  8. Bansal, Shivam. “A Comprehensive Guide to Understand and Implement Text Classification in Python.” Analytics Vidhya, July 26, 2019.
  9. “Transfer Learning.” Wikipedia. Wikimedia Foundation, April 4, 2021.
  10. Harshith. “Text Preprocessing in Natural Language Processing Using Python.” Medium. Towards Data Science, January 18, 2020.
  11. Miner, Gary. “In Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications”. Amsterdam: Academic Press, 2016.
  12. Published by duyanh on June 14, Duyanh, Recurrent Embedding Dialogue Policy, Pham Hung, and Language Model and Text Generation using Recurrent Neural Network. Text Processing, June 14, 2019.
  13. Sawhney, Prateek. “Introduction to Stemming vs Lemmatization (NLP).” LaptrinhX. LaptrinhX, September 3, 2020.
  14. Schütze, Hinrich. Stemming and lemmatization, 2009.
  15. Horan, Cathal. “NLP Datasets: How Good Is Your Deep Learning Model?” FloydHub Blog. FloydHub Blog, June 10, 2020.
  16. “Embeddings: Translating to a Lower-Dimensional Space.” Google. Accessed November 20, 2020.
  17. Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. GloVe: Global Vectors for Word Representation. Accessed May 20, 2021.
  18. “FastText.” Facebook Research, February 22, 2018.
  19. Poria, Soujanya, Amir Hussain, and Erik Cambria. “Sentic Patterns: Sentiment Data Flow Analysis by Means of Dynamic Linguistic Patterns.” Multimodal Sentiment Analysis, 2018, 117–51.
  20. Cambria, Erik, Amir Hussain, and Andrew Livingstone. “The Hourglass of Emotions – Sentic.” Accessed July 3, 2021.
  21. “SenticNet.” API \” SenticNet. Accessed November 20, 2020.
  22. “Feedforward Deep Learning Models.” Feedforward Deep Learning Models · UC Business Analytics R Programming Guide. Accessed November 20, 2020.
  23. Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. “Sentiment Analysis Algorithms and Applications: A Survey.” Ain Shams Engineering Journal 5, no. 4 (2014): 1093–1113.
  24. Dang, Nhan Cach, María N. Moreno-García, and Fernando De la Prieta. “Sentiment Analysis Based on Deep Learning: A Comparative Study.”, June 5, 2020.
  25. Métais, Elisabeth. “Bidirectional Transformer Based Multi-Task Learning.” Essay. In Natural Language Processing and Information Systems: 24th International Conference on Applications of Natural Language to Information Systems, NLDB 2019, Salford, UK, June 26-28, 2019: Proceedings. Cham: Springer, 2019.
  26. Deguchi, Hiroyuki, Akihiro Tamura, and Takashi Ninomiya. “Dependency-Based Self-Attention for Transformer NMT.” Proceedings – Natural Language Processing in a Deep Learning World, 2019.
  27. Cheng, Jianpeng, Li Dong, and Mirella Lapata. “Long Short-Term Memory-Networks for Machine Reading.” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
  28. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence Learning with Neural Networks.” arXiv, December 14, 2014.
  29. Vagadia, Hardik. “Seq2Seq Models : French to English Translation Using Encoder-Decoder Model with Attention.” Medium. Analytics Vidhya, July 6, 2020.
  30. Synced. “A Brief Overview of Attention Mechanism.” Medium. SyncedReview, September 25, 2017.
  31. Goyal, Palash, Sumit Pandey, and Karan Jain. “Introduction to Natural Language Processing and Deep Learning.” Deep Learning for Natural Language Processing, 2018, 1–74.
  32. Luong, Minh-Thang, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. “Multi-Task Sequence to Sequence Learning.”, March 1, 2016.
  33. Applied Course. Accessed November 20, 2020.

About the Author

Bharathi Sridhar is a Class 12 student at Maharishi Vidya Mandir Senior Secondary School, Chennai. She is keen on Data Science, AI, Human-Centric Computing, and NLP. Working with a leading Indian Internet media organization, she contributed in identifying context extraction algorithms. She looks forward to majoring in Computer Science, and collaborating on impactful research. 

1 thought on “Is Bidirectional Sentence Parsing the Future of Natural Language Processing?”

  1. I really enjoyed the book and am thankful for it. It had something that everyone could appreciate. Also, I made sure to bookmark your post about natural language processing for future use and reference.

Leave a Comment

Your email address will not be published. Required fields are marked *