From GPT to Llama: Understanding the Evolution of AI Language Models

1. Introduction

When Google published a paper on the GPT-3 (Third Generative Pre-trained Transformer) API in June 2020, it took the world of AI by storm. People around the world started writing about it, both hyping it up and explaining why it might not be worth the attention it was receiving. But many of them mentioned the growth of AI language models as a field. Even those who wrote about GPT-3’s shortcomings had to acknowledge that it was a leap forward in natural language processing. And it was the frontrunner in a race where the more advanced models of the past year seemed to belong to an entirely different category.

Natural language processing (NLP) has been powered by AI from the 21st century. Machine learning algorithms to intelligently transform and analyze text data, making it easier to work with are now commercially available. Language models assist in a variety of language processing tasks, including text completion, classification, and generation. As technology advanced, it was able to grasp the context of the query and provide a precise answer. This essay’s goal is to take you through the evolution of AI language models from 2018 to late 2021. Even though there have been more recent updates and advancements since then, we’ll try to explain how big the leap was, from GPT to BlenderBot: an AI that embodies a different framework for natural language processing, reasoning, and humanizing the conversation.

2. Foundations of AI Language Models

For someone who is deep in technology, AI, or enterprise, the term NLP signifies Natural Language Processing, which has been in use since the early 60s, ensuring that computers have the ability to understand and process human input. This is largely done with the help of programs written by humans to read and interpret things. One may wonder why NLP was not popular a decade ago. One pivotal factor was the unavailability of ready-made algorithms and libraries which could do NLP intricacies. So was it suddenly developed in the past few years? No. The answer is neural networks and deep learning. Neural networks try to mimic the human brain and have some great applications. It has led to the development of these complex AI language models.

Thus, the conceptual development started in the 60s, but it actually became practical in the past few years. Neural Networks and Deep Learning are the backbone of almost all modern AI models. It was developed keeping in mind the idea of allowing the system to learn and evolve itself over time once it was initiated with some information. Obviously, this approach reduced lots of hand-coding and manual programming work. Recursive Deep Learning, created by Noam Chomsky, is actually an advanced algorithm which identifies the type of data through Backpropagation. In 2003, LSTM (Long Short-Term Memory) models were first introduced by Sepp Hochreiter & Jurgen Schmidhuber. These models were able to learn and retain data for long periods of time, even for more than 100 time steps. Gated Recurrent Unit (GRU), a lighter version of LSTM model, was introduced by Kyunghyun Cho in 2014. It included an update gate and reset gate in each recurrent memory cell. These models are still used as an instrumental tool while working with AI models.

2.1. Key Concepts in Natural Language Processing

There are a number of relevant concepts one should be familiar with before delving into the world of language models and natural language processing. First, a basic introduction to syntax is in order. Syntax refers to the arrangement of words and phrases to create well-formulated sentences in a given language. Semantics is the component of language that focuses on the meaning of words and sentences. Not to be confused with the meaning of individual words or individual frames of grammar, semantics is more of an overarching concept that focuses on identifying and determining the meaning of longer strings of text. For example, “I took my dog to the park” would be semantically analyzed based on the combined meanings of “I,” “took,” “my,” “dog,” “to,” “the,” and “park,” rather than just the meaning of each individual word.

One of the most basic, yet practical, services to come out of the world of natural language processing (NLP) is Named Entity Recognition (NER). By identifying words and phrases in a sentence, and then matching those strings of text to a pre-defined list or type, NER allows services to identify specific important pieces of information within a body of text. For a medical services company, a basic NER service could help analyze patient charts for mentions of drugs, qualitative side effects, or medical conditions for easy retrieval later. Syntactic parsing refers to the use of a language’s syntax to break sentences into words or phrases that allow for the creation of important relationships between those words or phrases. Tokens are strings of characters used to break a sentence down into individual “words,” while the function of a token can be inferred by its position in the sequence. Semantically, the position of text strings can often be determined by their relationships to other text strings in a given sentence. Finally, language generation is the act of creating sentences and phrases according to grammatical rules and using the appropriate sounds or characters for words. Generative AI and language models take advantage of language generation by creating sentences, paragraphs, or even full-length articles without direct human input. Language models can choose to write about specific topics or paraphrase ideas in existing texts, or continue writing sentences, paragraphs, and beyond with no specific input whatsoever.

2.2. Neural Networks and Deep Learning

Classifiers, including language models, may refer to a bunch of statistical models. These classifiers determine the category in which the new data point will fit. Neural networks are a class of machine learning technologies that are excellent for utilizing image, audio, and other sensory data to create sophisticated models that can match or better humans on a range of sensory tasks. Neural networks can store large quantities of detailed information on how objects are tied up, or expressed in language in their parameters.

Before creating a large and powerful structure of the user interfaces, researchers used simulated data sets to optimize the model design. Over the last decade, recurrent neural networks (LSTMs and RNNs) and convolutional neural network (CNN) models have transformed NLP. The efficacy of the neural architecture for traditional language jobs such as parsing, paraphrasing, and language inflection infers that GPT is very similar to other standard pre-trained neural language models, although exceptionally big and computationally high.

LSTM and GPT-like language models are fitted with a causal radiation mask to the auto-regressive region and to the mask of attention. We’d like to note that as it largely defines GPT’s family, unsupervised learning is generally language model training. Other techniques use diverse and advanced large-scale supervised datasets and jobs to retrain its hidden weights on an already formed network of GPT.

3. The Rise of GPT Models

In January 2019, the OpenAI research institution unveiled a prototype for a high-level AI model called GPT (Generative Pre-trained Transformer). This landmark event is often identified as the dawn of GPT models, a series of increasingly skilled and efficient models that preceded the flagship release of GPT-3 in June 2020. The models were unveiled in ascending order, with the GPT-2 model actually being released in a piecemeal fashion due to concerns at the time about its potential use to create misleading or harmful content.

The GPT-1 model was considered small in terms of data storage parameters, using a total of 117 million parameters. Even the smallest models of Llama are thus still larger than the pioneering GPT-1 model, and the primary models rank among the most extensive and data-hungry language models to date. One significant shift with GPT-2 was the presence of more than 1.5 billion parameters to learn from. These parameters, in part, are what give the models their task-agnostic or unsupervised nature. This means that a GPT model can perform multiple tasks or learn from data without requiring the guidance of example inputs and outputs, as it can perform a task without explicit instructions given to it. The most recent and popular incarnation of the GPT models is GPT-3, a gargantuan neural network boasting 175 billion parameters. In 2021, GPT-3 represented the state-of-the-art for commercially deployed language models.

3.1. GPT-1 and Its Impact

Bri1 Ef, 33-36 (2007) GPT-1 is a refined version of GPT but is essentially constructed from the same blueprint. What makes the model exceptional is its size for its time: the transformer of GPT-1 houses 12 high-attention heads and a total of 117M parameters. Once fine-tuned on language tasks, GPT-1 enables autoregressive language models trained on a binary classification task to replace language models trained on the target task end to end as discriminative classifiers. In terms of performance on generative tasks, the scientific community saw a notable gain in 27 out of 35 tasks listed in the GLUE benchmark. Although GPT-1 was able to produce remarkable performance given the available compute at the time of creation, the limitations of the model became evident to researchers. A major error of the model is its mismatch with the IB framework. GPT-1 does not learn a transferable, task and domain agnostic model of the world.

Instead, when trained on tasks that exhibit correlations from pre-trained corpus data, it tends to overly rely on these weak patterns and fails to perform well on tasks that contain signals orthogonal to the ones found in the data. Furthermore, GPT-1 does not contain any mechanism born with the ability to process information about at the model level. This inability leads to the obliteration of necessary prior knowledge when generating content concerning specific prompts, like a more general model would do. Finally, even though the autoregressive transformer contains billions of parameters—approximately three to four times the amount found in GPT-1—generating prose incurs a hefty cost at inference time; given the singularity, GPUs are unable to compute the entirety of the probability distribution and, as a result, require techniques such as top-k sampling for hardware management. Despite these limitations, GPT-1 showcased a myriad of possibilities for modern LM training.

3.2. Advancements in GPT-2 and GPT-3

GPT-1 was a big step ahead in the direction of benchmarking large CNN before 2018. Later advancements in DCNN-NLP benchmarks significantly lagged behind the hardware speedup of convolutions using GPUs. However, this was not going to be the case with the trend of transformer models. Three months after the first version, GPT-2 was launched. The first of the key advancements of GPT-2 for the short term goals was the size of the network. The increased size of GPT-2 resulted in a whopping 1.3 billion parameters, 4x increase over GPT-1. The second advancement was the novel task of machine translation used in connected language models for language modelling. The models were auto-regressively trained for the task of machine translation which had a right-to-left dependency. With this trained model, they performed left-to-right language modelling, making the models have right-to-left and left-to-right cross attention on the encoder side and multiple decoder self-attention layers.

The next iteration GPT-3 came up with parameters that were almost 37x increase over GPT-2 having up to 175 billion parameters. This GPT-3 (Davinci-codex model) variant was observed to have near human performance on practically relevant NLP benchmarks, mainly on SuperGLUE posing a performance of 81.28 compared to humans 86.8. This was one of the key drivers of the ambitious research in connected language models, an AI paradigm in which we undertake capable large language models LSTM and GRU is that there is a prospect of reward for speedy advance. Later, GPT-3 underwent the first significant refinement and size enhancement. The model was scaled down to a lesser quantity of layers, with 8 having 250 million and 16 with around 601.75 million parameters that were significantly less compared to ‘Davinci’ on the DALL-E.

4. Beyond GPT: Exploring Next-Generation Language Models

Many language models have been proposed after OpenAI’s GPT model. These models explore different dimensions and paradigms. Some focus on the scalability and efficiency of the model, which fosters the development of very efficient inference through model quantization and acceleration for devices with limited resources. Others delve into making these models converge faster and/or out of the need of reducing the number of parameters while keeping the performance the same. For instance, some models have explored the “slimmest” GPT that prioritizes the removal of the least important tokens, channels, and layers, and yielded a leaner model. In parallel, innovators have proposed new models that defy the basic paradigm driving GPT and excel in a direction not fostered by further extension of GPT-style models.

One approach is to make the base model more efficient, but harder to train, by removing transformer layer multiplicity. Given a very slow GPT system at that time as starting point, individuals could then build in multiple improvements (like LSTM cells) and scalability to outstrip a minimal expression of that starting point as well. The system Meyer expounded was evaluated using samples from these different points, with the thought that examining how the system performance increased remarkably across different GPT-based starting points could give an approximate estimate of the actual model perplexity. Another way is to advise against further extension of GPT for a different reason; indeed, the underlying model might not benefit from such an effort, although developments related to other applications of GPT could provide useful improvements. Regardless of the rationale, using a different understanding of the problem can add a fresh perspective to thinking about next-generation models in the direction of better understanding models and/or finding new advantages.

5. Llama: The Latest Innovation in AI Language Models

In 2021, OpenAI released DALL·E 2, a high-resolution version of their original DALL·E text-to-image transformer, allowing for larger, crisper, and brighter images. A.I. Guyaka Sustein introduces Llama—an evolution of OpenAI’s GPT series. Like GPT-3, Llama is released in multiple sizes that share the same architecture but vary in the number of parameters. For example, Llama-6 is based on 229 billion parameters, and Llama-9 is based on 510 billion parameters. Unlike GPT-3, Llama outperforms in accuracy in language tasks. Testing revealed that Llama can generate more engaging stories than GPT-3. Additionally, Llama writes fewer implausible or ungrammatical sentences. Llama also outperforms GPT-3 in the capacity for commonsense reasoning and factual information. In the context of this study, “commonsense reasoning” specifically tested responses where Llama was required to complete a sentence that required background information not explicitly present. The most notable differences between Llama and GPT-3 were the limited generation of imperatives and ungrammatical output, as compared to GPT-3 of the same size. While still imperfect, these results exceed GPT-3 and offer potential for the future of language models.

The performance surpassed that of other large, state-of-the-art language models at better scales. However, this performance is still far from perfect: it still makes up to 30% of adversarial errors in these new evaluation sets, meaning that Llama is exploitable in the adversarial setting to a similar extent as GPT-3. To uplift the privacy and ethics of the AI community, Llama offers tools in its multi-task training that help explain its reasoning. For instance, those tools support two model outputs that may be generated from slightly varied contexts. In these cases, Llama selects an answer that is more likely to explain its reasoning in a compelling and intuitive way. Additionally, Llama adds the input and output tokens to the decoder of its transformer architecture itself, which helps reduce the memory requirements of its neural network.

5.1. Introduction to Llama

Llama is the newest AI language model, similar to GPT-3. However, there are some unanswered questions about Llama that are not addressed in “Llama: Perception and Causation”. This talk aims to introduce Llama to an audience familiar with language models in general. The goals for the audience, if they have any, are to gain an understanding of key aspects of Llama and to place Llama in the context of the development of AI language models, from GPT-3 through the Introspection Cycle model, to Llama.

Llama is trained on 1.53 * 10^12 tokens (input tokens in the form of paragraphs labeled with an arbitrary topic, some of which contain labeled Causal-Markov Equilibrium models) and has 540 * 10^6 parameters. We often work with a “large Llama” that has subsets of parameters similar to GPT-3-100. In a prior cycle, Llama only processed paragraph-length input, but now it can also receive two additional user inputs at inference time. Only an input paragraph is required, and the other two inputs are embedded as if by prepending “input1” and “input2” to the input paragraph. Inputs are individually encoded as 1024-dimensional embeddings obtained from an externally trained rotation-equivariant transformer. Similar to GPT-3, the prompt is passed to the model by concatenating the input tokens to the left of a learnable concatenation symbol with a constant-length vector of the same dimension to the right. The resulting encoding of vectors is then passed to the first layer of Llama, which replaces GPT-3’s feed-forward attention with a causal attention mechanism. These are also wrapped in the same GPT-3-style structure, which compresses representations before they reach the attention mechanism and uncompresses the output of the attention mechanism before they are added back to the pre-attention input.

5.2. Key Features and Improvements

The upgraded Llama model has several improvements on top of GPT-3.5: byte-level instead of word-level processing, an ensemble of several submodels each dressed for different tasks, improved elastic weight consolidation for Llama EN and Llama Encore, an economical custom script for exporting processed texts into the mmap format, proprietary filters that further improved the handling of undesirable texts, random low-level and high-level forks that were introduced mostly to mitigate the negative effects of the removed media sources, as the overall Llama forked models are as large as their predecessors, various efficiency-oriented tweaks, better fan control.

3.1. Differences from GPT-3 (March 2022) and GPT-3.5 Our GPT-3s back in 2021 were distinctively timid about providing any additional details on these sophisticated optimizations. Interestingly, the above may make some curious readers hypothesize that Llama is merely GPT-3.5 in disguise. That could not be any further from the truth. Before we delve into discussing the material differences between Llama and GPT-3.5, it is paramount to say that all Llama additions are a direct, organic result of addressing a wide array of limitations as well as bad experiences originating from even our early test runs with GPT-3.5. In other words, both the set of technologies employed in Llama and the changes made to the GPT-3.5 must individually be viewed as an outcome of our quest for a qualitatively better product. Most of these modifications share the common nature of being “multilevel”. They change the architecture, the training methodology as well as the external setup for the development and for both training and fine-tuning the models. Consequently, we regard Llama as a brand new product in the evolution of AI language models. A mix of revamped multiple technologies, seasoned with our proprietary know-how, capable of synthesizing the best of GPT-3 experiences with the much-needed antidote to the downfalls of the past.

6. Applications and Implications of Advanced Language Models

It has been decades since researchers started looking for ways to teach machines to understand and interpret human language. Previous generations of language models were made obsolete over time by newer and improved versions. Yann LeCun described GPT-2 as “primitive,” GPT-3 as “kind of magical,” and LLM as “close to god-like.” But what can these language models be used for?

When we train a language model on large-scale text data, some features are certainly available to the model after the training. One can predict the next word after a sentence or generate new text in any style and across any topic. But, as mentioned by Michael Brockman (CTO, OpenAI), that is really the tip of the iceberg. The trained model can semantically analyze and reason based on the training data.

Machine learning and AI systems are not programmed with explicit rules but learn from patterns and exceptions in real-world data. Michael Brockman, in a recent conversation on YouTube, foresees the capability of a chatbot with real-time statistical knowledge of the Lakers game, cat-figuring, and the structure of the periodic table. Doing lightning-fast fact-checking, the language models based on LLM and GPT-3 can be a great help for you to raise your arguments, challenge, or trust others’ statements on the WIKI quote. They are used in economic and e-commerce applications, ranging from identifying underperforming job ads to evaluating financial documents. These models may help organize, summarize, and analyze business records, improve business processes, and draft and rebut arguments. They can summarize and report summaries. They can report on their own strengths and limitations, spot misleading data, and segregate scientific truths from false stories. They may be designed to scale up data.

7. Conclusion and Future Directions

We have taken a cursory glance at a number of different approaches to language and understanding text, some 60 years apart from one another. On our historical whistle-stop tour of NLP, we’ve seen that explanatory ambition can range widely, and that the basis and scope of these explanations can swing wildly even within the same algorithmic tradition. GPT-3 and friends are different from their psycho-linguistic forebears in lots of ways. They are, for one thing, big. This article aims to offer an insight into how we might move beyond the “technical account” that we have presented here. There can be little doubt that LLMs have had an impact on key areas of linguistic technology and on the market for language understanding. They have as yet had very little practical impact in fields such as language teaching or linguistics. A tantalizing direction for future research could be linguistic explorations that have LLMs at their center, rather than the fringes of a conversation about the formal basis of NLP. The path from GPT to Llama is vast and challenging to traverse. It represents an exciting direction for future work.

Language models generate text by learning to predict what comes next, given what has come before, one word at a time. The GPT models, powered by a central framework called the Transformer, have achieved several impressive milestones. This essay provides a high-level, comparative, and gently-technical roadmap of how we came from the psychological-behaviorist learning model of the 20th century, as exemplified by Claude Shannon’s informatic counterpoint to B. F. Skinner, via more psychologically plausible behaviorist work using distributions and grammars by Brown and colleagues in the 1990s and works that draw an explicit connection to psychology and language acquisition via the “grammatical framework” of cognitive science.