Why learning by example is the only way to optimise language models
When I moved to Spain at the age of 21, I thought my GCSE-level knowledge of grammar rules and structures would help me get by. What I later learned was that hearing Spanish, in real-life scenarios, was the best way to expand my vocabulary and develop my linguistic skills. It turns out that when it comes to Natural Language Processing (NLP) the same logic applies.
Language is not just a set of independent meaningful words (a vocabulary), these words only take on meaning when they are understood in context. So, instead of trying to specify language with formal grammar rules and structures, you’re better off training the model (or your brain) with real-life examples. This is particularly true when NLP is being used in a legal setting, for example, for contract review and analysis.
Evolution of NLP
The goal of natural language processing (NLP) is to build a technique or model that can understand human language as it is used naturally. Let’s take a look at the evolution of NLP, where it started and where we are today in our quest to imitate and understand human language and apply it to different tasks.
GloVe and word2vec
Traditional methods of language processing, known as word embedding, fell short of capturing the full representation of a text’s meaning. The most common approaches, namely GloVe and word2vec, mapped every single word to a vector. Since this approach looked at each word in isolation, it meant that it could not capture the full contextual meaning of the sentence.
Let’s unpick the limitations of these traditional methods by looking at an example: “I miss you” is a simple sentence made up of three words. You cannot determine who misses who by looking at one word in isolation. You need all three words to know that I am the one missing you. A language model must understand the basic structure between words in order to understand the full meaning of a sentence, no matter how simple the sentence might be.
RNN & LSTMs
The next wave of NLP techniques looked at the relationship between the words in a sentence to determine the meaning. At the time, the most commonly used approach to capture dependencies between words were deep learning approaches, namely Recurrent Neural Networks (RNN), or more specifically LSTMs (Long Short-Term Memory).
This state of the art algorithm is used to process sequential data, most notably in the creation of Siri and Google’s voice search. It is the first algorithm built with short-term memory, meaning it remembers prior information that it has been exposed to. While these advances are important, these techniques can only transfer information in a certain order, so the longer your sentence, the less likely it is that the information it contains will be fully retained.
Let’s go back to our previous example: Imagine you have two sentences now: ‘I miss you’ and ‘you miss me’. In these examples, the word ‘you’ takes on two different contextual meanings, being both the subject of the sentence and the object. We need to train our LTSM model for both these instances because the meaning of ‘you’ cannot be learned independently using the same LTSM model.
So what does this mean? It means that a significant amount of data is required for very simple tasks, such as sentiment analysis (determining if a sentence is positive, neutral or negative). Fortunately, the academic world was quick to acknowledge this inherent flaw and a breakthrough came from teams working on language models. They saw the way that numerical outputs could represent meaning and maintain contextual dependencies between words. In 2018, a paper was published introducing the language model ‘ELMo”. This paper effectively established that it was possible to improve model efficiency by implementing an overall understanding of language.
If we refer back to our previous examples, ‘I miss you’ and ‘you miss me’, ELMo could understand the word ‘you’ in its own context without having to retrain the model each time. This enabled better results for most NLP tasks.
The problem with the ELMo model was that it was built using LSTM and as previously mentioned, all the data had to be transferred in a specific order and calculations relied on the previous results. This meant that ELMo could not take advantage of parallel processing (i.e. multiple computers training the model at the same time) which significantly slowed down the ability to train the model.
Who or what is BERT?
In 2017, Google introduced Transformers, which was deemed to be a significant improvement on RNN because it could process data in any order. This was significant because the Transformer can process any word in relation to all other words in a sentence, rather than processing them one at a time.
Then the breakthrough happened. Google used the academic’s language mode intuition, but developed a language model based on Transformers; this model is called BERT (Bidirectional Encoder Representations from Transformers). It is a machine learning technique applying the bidirectional training of Transformer’s encoder.
By looking at all surrounding words, the Transformer enables BERT to understand the full context of the word. The revolutionary approach takes the input of any arbitrary text and creates an output of a fixed length vector (a numeric representation of the text). Think of this like a translation step into a secret machine language.
Using Transformer’s bidirectional capability, BERT is pre-trained on two different NLP tasks:
Masked Language Modeling
The aim of Masked Language Modeling is to generate accurate numerical representations of words. It does this by randomly hiding one word (or many words) in a sentence and then having the programme predict what the ‘masked’ word is based on the context of the sentence. BERT is forced to identify the masked word based on context alone, rather than by any pre-fixed identity. For example: ‘I need to call my boss, my ___ was late this week’. By looking at the entire context of the sentence, you might argue that the most likely word to fit here would be ‘paycheck’.
Next Sentence Prediction
The next sentence prediction forces the model to understand how the combinations of words in sentences fit together. The programme predicts whether any two sentences have a logical, sequential connection or if their relationship is random. Let’s a take a look at two sentences:
- I’m really enjoying this film; and
- I’m worried about work.
Which of the two has a logical, sequential connection to ‘I need to call my boss, my paycheck was late this week’? BERT works by clustering concepts based on sentiment. So, imagine ‘love’ and ‘hate’ plotted on a graph, you would assume that these concepts would be far apart. However, sentences that express similar sentiments, like ‘I am so disappointed in this corporation’ and ‘I don’t like this company’ would be clustered together, despite the fact that different words are used.
BERT is the rocket booster to kickstart your NLP projects
BERT addresses the flaws of previous NLP techniques, namely the need to spend hours training models on data to make them sufficiently smart and, in most cases, solving the issue of having enough training data to provide the models with a well-rounded education.
In contrast, the BERT framework was pre-trained using the entire Wikipedia corpus. This means that BERT has already had exposure to an enormous variety of language structures, nuances and ambiguous language. The best bit is that BERT can be used as a base layer of knowledge in your NLP projects. You can build upon it and fine-tune it for your specific needs. This process is called transfer learning and it means that rather than starting from scratch, you can get your model up to speed on a lot less data. It is cheaper, faster and ultimately smarter because you are building upon an existing layer of intelligence.
Making language models legal-savvy
Despite the fact that using AI to analyse text has never been easier, the revolution has not yet arrived in the legal world. Legal technology has emerged in recent years to provide solutions to help lawyers read contracts. Using language models to analyse content-rich documentation, like contracts, is making waves in the most traditional of industries.
Surprisingly, many of the key players on the market still rely on training individual small models for specific tasks. This training process requires large amounts of niche legal data and human supervision to get the model up to speed. It is expensive and tedious. But the crucial point is that this type of contract analysis technology isn’t true AI. The lawyer still has to manually review the clause and spend time finding the answer to the specific question. To reach true AI, we want a tool that does more than just ‘highlight’ the clause that you need to review but actually finds the answers to your questions for you, thereby providing you with actionable information quickly.
So why are legal technologies falling into the legacy trap when there is an easier, cheaper and smarter way to do it?
If you think about it, massive language models, like BERT, are pre-trained with such enormous amounts of data, that the base level of natural language is already at a very high level to begin with. If you take this model and fine-tune it with exposure to very specific legal terminology, it will be able to incorporate it quickly into its pre-trained ‘brain’.
By way of comparison, if we think back to my experience in Spain: if I already had a C1 level knowledge of Spanish prior to moving to Madrid, I would have found it a lot easier to fine-tune my linguistic skills with examples of more colloquial ways of speaking or even incorporate very specific vocabulary.
The point is that these massive language models have already learned from so many examples, that they require far fewer example contracts to be able to pick up ‘legal language’ than small models that are built from scratch.
At Della, we take full advantage of the advances made in NLP models like transfer learning and we regularly take time to analyse the latest research to apply it to our model. So the question is, why aren’t the other providers taking advantage of these developments? There is the possibility that the current generation of legal service providers may fall into the same trap that they would have told their customers to avoid; attachment to legacy techniques.