Introduction
With the release of ChatGPT last November generative AI has captured the attention and imagination of the general public and made a big splash in the business world. Share prices of the Big Five tech companies Alphabet, Amazon, Apple, Meta, and Microsoft, and Nvidia have skyrocketed. It even prompted the Economist to come up with a dedicated Early-Adapters Index.
In this blog post, which consists of two parts that will be released independently, we briefly put generative AI in perspective, discuss its strengths as well as its limitations against challenges encountered in the legal world.
Large Language Models
The generative AI discussed here has its roots in language models. In its simplest form, a language model predicts the next word, given the previous words in a sentence. Borrowing ideas from machine translation, Machine Learning (ML) researchers developed a neural network architecture called the Transformer that is uniquely suited to perform this task. Since training such a language model merely involves randomly masking words in a sentence and instructing the model to predict these words, any text corpus out there (Common Crawl containing billions of Internet pages, the Book Corpus of more than 10 thousand books, Wikipedia) can be used for training – hence the rise of Large Language Models, or LLMs for short. This type of training is referred to as unsupervised pre-training as it does not require human supervision and the training is for no specific downstream task. During pre-training, LLMs acquire a general language understanding, which improves as the models become larger and more training data is provided to them. Having taken inspiration from machine translation, the original Transformer architecture consists of two parts: an encoder and a decoder. When, say, a German sentence is translated into English, the source sentence is first encoded by the encoder. The decoder then takes the encoded German sentence as input to decode it and generate the English translation in a second step.
Since the introduction of the Transformer model, it has been established that in addition to the original architecture, the two parts can also be trained independently as stand-alone. This resulted in encoder-only and decoder-only language models, each with their own strengths in carrying out different downstream Natural Language Processing (NLP) tasks. For example, encoder-only language models such as BERT excel in:
- Information Retrieval (the technology at the core of search engines),
- Named Entity Recognition (NER) (used to recognize, for example, persons, organisations, or locations in documents),
- document classification (used to determine the type of document based on its content),
- extractive Q&A (used to answer questions by text snippets contained – verbatim – in a corpus of documents),
whereas decoder-only language or generative AI (GenAI) models such as GPT or LLama excel in
- text summarization,
- text generation,
- mimicking human conversation (chatbots).
Although pre-trained simply by predicting masked words, LLMs from both branches have shown surprising capabilities in carrying out downstream tasks by providing them with just a few examples, or even no example at all – a phenomenon known as few- or zero-shot learning. Such unforeseen capabilities include determining the sentiment of a sentence, classifying documents, reasoning, and translation. How exactly such models achieve these feats is presently unknown and the subject of active research.
The practice of coercing an LLM in carrying out a specific task is known as prompt engineering. In this approach, the model itself is not updated, i.e., its parameters remain frozen. This differs fundamentally from the more established ML approach of fine-tuning a model. In that approach, sufficient training examples of a specific task are provided to a pre-trained model, so that it can learn to carry out that task by tweaking its parameters. After fine-tuning, the model parameters differ from what they were before. Since fine-tuning a model requires labeled data as training examples which in turn requires humans to create, fine-tuning is referred to as supervised training. Because of the huge costs of human resources involved in labelling data, this forms the bottleneck of contemporary ML applications, sometimes jokingly called AI’s human bottleneck. One of the main advantages of pre-trained LLMs is that they substantially reduce the need for labeled data.
A recent trend in LLM research is to fine-tune a pre-trained LLM on many different tasks in parallel, to arrive at a single model with state-of-the art performance on a myriad of NLP downstream tasks. To this end, labeled NLP datasets, that have been compiled and shared by the ML community over the years, are presented to the model together with a description of the tasks in natural language.
Convergent vs. Divergent Thinking
The distinction between encoder-only vs. decoder-only language models with their complementary capabilities is not unlike a similar distinction made in psychology. In the 1950s the American Psychologist J.P. Guilford introduced the concepts of convergent and divergent thinking in the context of problem solving. In non-technical terms, convergent thinking can be described as
the ability to give the "correct" answer to standard questions that do not require significant creativity, for instance in most tasks in school and on standardized multiple-choice tests for intelligence
whereas divergent thinking can be described as
a thought process used to generate creative ideas by exploring many possible solutions. It typically occurs in a spontaneous, free-flowing manner,
or visually as
Given this distinction one should not expect a physicist, who is typically on the extreme convergent side of the convergent-to-divergent thinking scale, when put in front of the painting "Girl with a Pearl Earring" by Johannes Vermeer to produce a sensible expose about that painting, nor should one expect an art historian, who typically is on the other, extreme divergent side of the convergent-to-divergent thinking scale, to come up with Keppler's laws when put in front of a dataset of planetary orbits.