Beyond the Hype: A Realistic Look at Large Language Models • Jodie Burchell • GOTO 2024

Introduction

Jodie Burchell
GOTO 2024

Thank you for the lovely introduction. I presented a version of this talk in Porto last year, and it's evolved since then. I appreciate you all joining me for this session. As Adele mentioned, my name is Jodie Burchell, and I'm currently a Developer Advocate at JetBrains. I've worked as a data scientist for nearly a decade, focusing significantly on natural language processing (NLP). Interestingly, one of my previous roles involved the early stages of large language models, such as BERT. I've stayed connected with my former colleagues and have watched the rapid advancements in AI with both excitement and concern.

One of my major concerns lies in the messaging surrounding large language models (LLMs). Today, I aim to provide a measured perspective on LLMs and cut through the hype that surrounds them.

Over the past two years, we've witnessed an intense AI hype cycle. Claims range from assertions that models like LaMDA exhibit sentience to predictions that generative models will replace significant portions of the white-collar job market. Even more sensational claims suggest we are on the brink of an AI apocalypse. This deluge of opinions can make it difficult for the average individual to discern the true utility and implications of these models—whether they are practical tools or just gimmicks.

Our focus over the next 40 minutes will be on the actual applications and limitations of LLMs. We'll explore the context and scientific foundations of these models while challenging some of the more extravagant claims, like the idea that we are nearing artificial general intelligence (AGI).

The recent emergence of models such as Chat GPT 3.5 and 4 may seem abrupt, but they have roots in a long history of research in NLP. Early language models aimed to automate text-based tasks that previously required extensive manual labor, such as text classification and summarization.

LLMs fall under a category of models known as neural networks, which were first proposed in the 1940s as a means to mimic the human brain. Although there were significant advances in the 1980s, the practical application of neural networks remained limited until the 21st century due to their substantial computational power requirements. Researchers discovered that larger models can learn from data more effectively, leading to improved accuracy in predictions. However, this also necessitated the development of efficient processing units adept at matrix multiplication.

The breakthrough came with the development of CUDA (Compute Unified Device Architecture), which transformed GPUs into machines optimized for matrix calculations, thereby facilitating the training of large neural networks. As models increased in size, they also became more data-hungry. This led to the development of the Common Crawl dataset, a massive collection of indexed web pages, which provided the necessary text data for training sophisticated language models.

The modeling architecture also experienced key innovations. For instance, during the 2007 development of Long Short-Term Memory (LSTM) networks, researchers were able to capture the complex relationships between words, dramatically improving performance on various NLP tasks. However, traditional LSTMs process words in a sequential manner, which limits their scalability.

The introduction of Transformer models marked a significant departure from this model architecture by allowing parallel processing of data. This innovative approach enabled researchers to create larger models that could learn richer representations of language. The Generative Pre-Trained Transformers (GPTs) emerged from the Transformer family, with GPT-1 debuting in 2018. Many modern LLMs, including ChatGPT, leverage this architecture.

Examining GPT models designed by OpenAI, we see that the potential for LLMs extends beyond pure text generation. The initial focus on machine translation required a combination of an encoder model (to learn the source language) and a decoder (to generate the target language). Researchers soon realized that decoders could be effective on their own for various tasks. By training a model to predict the next word in a sequence from a single language input, they created a scalable training dataset that required less manual preparation.

Models have continued to evolve and grow, with parameters in GPT architectures skyrocketing from 120 million in GPT-1 to approximately 1 trillion in GPT-4. The evolution in LLM responses reflects these advancements; GPT-1 generates text with no real context, while the more recent models provide richer and more contextually appropriate outputs, such as detailed essays.

The perception of LLMs in both tech and broader communities currently grapples with claims of AGI. These claims often confuse surface-level performance with properties of actual intelligence. For instance, when chess champion Garry Kasparov lost to IBM's Deep Blue in 1997, many speculated AGI was on the horizon. However, the success of AI in narrow tasks does not imply they possess true intelligence.

Rather than assuming intelligence is represented in the direct outputs, we must acknowledge that models excel by optimizing their training goals and may resort to shortcuts. This is known as the "kaggle effect," where tailored algorithms outperform humans in specific tasks but struggle with tasks outside the scope of their training.

To assess whether LLMs display real intelligence, we can look at a hierarchy of generalization proposed by AI researcher François Chol. The layers range from systems with narrow skill-based generalization—which can memorize solutions and provide limited responses—to systems capable of broader, human-level generalization. Many current LLMs fall into the narrow category, lacking true AGI.

A recent surge in claims about LLMs solving medical exams and legal issues has sparked concern about their possible role in replacing professionals in those fields. One researcher tested GPT-4 on coding problems, finding it performed well on problems from its training dataset but failed on equivalent problems released after its training cut-off, revealing LLMs' limitations in generalizing.

Despite these limitations, LLMs remain effective tools for natural language tasks. As I've emphasized, they should be deployed within their problem domains, such as translation, summarization, and question-answering. The question-answering capabilities of LLMs can be enhanced through techniques like retrieval-augmented generation (RAG), where the model retrieves supplementary information from external sources to answer queries accurately.

For a practical demonstration of RAG, using a substantial PDF document of PyCharm's documentation, we could build an application with LangChain, which allows users to query the text effectively. The process involves selecting an LLM, loading in the documentation, chunking the text, converting the chunks into embeddings, and retrieving relevant chunks to answer questions.

However, deploying LLMs presents challenges. The performance of RAG applications relies on fine-tuning parameters, such as chunk size and the number of retrieved pieces. Not all LLMs are suitable for every task, and understanding their strengths and weaknesses is vital for optimal outcomes. To ascertain the appropriate model for a given use case, developers should refer to established benchmarks or create their own assessment datasets tailored to their specific domains.

In conclusion, LLMs are powerful yet limited tools. While they do not signify the advent of AGI, they have practical applications that require careful consideration. Addressing challenges such as tuning and performance measurement mirrors the long-standing issues we have faced within software development and machine learning. Thank you for your attention.

Keywords

Large Language Models (LLMs)
Natural Language Processing (NLP)
Artificial General Intelligence (AGI)
Machine Translation
Generative Pre-Trained Transformers (GPT)
Retrieval Augmented Generation (RAG)
Neural Networks
CUDA
Common Crawl
Chunking

FAQ

Q: What are Large Language Models?
A: Large Language Models (LLMs) are sophisticated AI models designed for processing and generating human language. They are primarily based on neural network architectures.

Q: How do LLMs differ from traditional AI systems?
A: LLMs use deep learning to generate context-aware text and perform tasks like translation and summarization, while traditional AI systems might rely on rule-based programming and are less adaptable.

Q: What is the significance of the term artificial general intelligence (AGI)?
A: AGI refers to a theoretical form of AI capable of understanding or learning any intellectual task that a human can do, which LLMs are currently not close to achieving.

Q: How does retrieval-augmented generation (RAG) work?
A: RAG combines the capabilities of LLMs with external data retrieval systems to answer questions more accurately by providing up-to-date context.

Q: What challenges do developers face when deploying LLMs?
A: Developers must fine-tune hyperparameters, understand the specific capabilities of different models, and evaluate performance against relevant benchmarks to ensure effective deployment.