class: impact

The LLM Course

Chapter 1 Introduction to NLP, LLMs & Transformers

.center[]

??? Welcome to the first chapter of the LLM Course! In this session, we'll dive into the foundational concepts of Natural Language Processing (NLP), explore the capabilities of Large Language Models (LLMs), and understand the revolutionary Transformer architecture that underpins many of these powerful tools.

Introduction to NLP, LLMs & Transformers

Understanding the Basics
Exploring Capabilities
How They Work

??? Today, we'll cover three main points: understanding the basics of these technologies, exploring their impressive capabilities, and getting an initial look at their inner workings.

What is Natural Language Processing (NLP)?

Definition: Field blending linguistics and machine learning to enable computers to understand/process human language.

Common Tasks:

Classifying sentences (sentiment, spam)
Classifying words (POS tagging, NER)
Generating text (autocomplete, translation, summarization)

??? So, what exactly is Natural Language Processing? It's an interdisciplinary field combining linguistics and machine learning, aiming to equip computers with the ability to understand, interpret, and process human language effectively. Common NLP tasks include classifying sentences (like sentiment analysis or spam detection), classifying individual words (such as Part-of-Speech (POS) tagging or Named Entity Recognition (NER) to identify entities like people, locations, and organizations), and generating text (for applications like autocomplete, machine translation, or document summarization).

The Rise of Large Language Models (LLMs)

Shift: From task-specific models to versatile, large-scale models.

Definition: Massive models trained on vast text data.

Key Characteristics:

Versatility: Handle multiple NLP tasks.
Power: Learn intricate language patterns.
Emergent Abilities: Show unexpected capabilities at scale.

Examples: GPT, Llama, BERT

??? Historically, NLP often involved creating specialized models for each specific task. However, the landscape shifted dramatically with the emergence of LLMs. These are exceptionally large neural networks trained on enormous datasets of text and code. Key characteristics distinguish LLMs: they exhibit remarkable versatility, capable of handling diverse NLP tasks within one framework; they possess immense power, learning complex language patterns that lead to state-of-the-art performance; and they often display emergent abilities – unexpected capabilities that arise as model size increases. Prominent examples include models like the Generative Pre-trained Transformer (GPT) series, Llama, and Bidirectional Encoder Representations from Transformers (BERT). These represent a paradigm shift towards leveraging single, powerful, adaptable models for a multitude of language challenges.

The Growth of Transformer Models

.center[]

??? This chart illustrates the exponential growth in the size of Transformer models, measured by their number of parameters, over recent years. While there are exceptions like DistilBERT (intentionally designed to be smaller and faster), the dominant trend has been towards larger models to achieve higher performance. This scaling has unlocked significant capabilities but also brings challenges related to computational demands and environmental footprint. The clear trajectory from millions to billions, and even trillions, of parameters highlights how model scale has been a primary driver of recent advancements in NLP.

Environmental Impact of LLMs

.center[]

??? Training these massive LLMs comes with a substantial environmental cost, as depicted in this chart showing the carbon footprint associated with training a large model. A single training run can have a carbon footprint comparable to the lifetime emissions of several cars. This underscores the critical importance of sharing pretrained models. Platforms like the Hugging Face Hub facilitate this sharing, allowing the community to build upon existing models instead of repeatedly training from scratch. This collaborative approach significantly reduces the collective computational burden and environmental impact, distributing the cost across numerous users and applications.

Why is Language Processing Hard?

Human Understanding: Context, nuance, sarcasm are easy for humans, hard for machines.

Machine Needs: Text requires processing into a machine-understandable format.

LLM Challenges: Despite advances, LLMs struggle with full complexity and can inherit biases from training data.

??? Why does processing human language pose such a challenge for computers? Humans effortlessly grasp context, nuance, implied meanings, and even sarcasm. Machines, however, require language to be converted into a structured, numerical format they can process. While LLMs represent a huge leap forward, they still grapple with the full richness and complexity of human communication. We rely on shared knowledge, cultural context, and subtle cues that are difficult to encode explicitly. Even today's most advanced models can struggle with these aspects and may also inherit biases from the vast datasets they are trained on.

The Hugging Face Ecosystem

.center[]

Thousands of pretrained models
Datasets for various tasks
Spaces for demos
Community collaboration
Used by major companies and organizations

??? The Hugging Face ecosystem plays a pivotal role in the modern NLP and AI landscape, serving as a central hub for collaboration and resource sharing. Many leading companies and research organizations utilize and contribute to this platform. Key components include the Model Hub, hosting thousands of freely downloadable pretrained models; extensive Datasets for training and evaluation; Spaces for hosting interactive demos; and libraries like 🤗 Transformers. This open and collaborative environment has significantly accelerated progress by making state-of-the-art AI accessible to a global community.

Transformers: What Can They Do?

Architecture: Introduced in 2017, powers most modern LLMs.

Accessibility: Libraries like Hugging Face 🤗 Transformers make them easy to use.

Tool Highlight: The pipeline() function.

.center[]

??? Let's focus on the Transformer architecture. Introduced in the seminal 2017 paper 'Attention Is All You Need', this architecture forms the backbone of most contemporary LLMs. Libraries like Hugging Face 🤗 Transformers simplify the process of using these powerful models. Before Transformers, architectures like Long Short-Term Memory networks (LSTMs) were common for sequence tasks, but they struggled with long-range dependencies in text and were less efficient to train due to their sequential nature. Transformers addressed these limitations, particularly through the innovative 'attention mechanism', enabling better handling of long sequences and more parallelizable, efficient training on massive datasets.

The pipeline() Function

Simplicity: Perform complex tasks with minimal code.

Example: Sentiment Analysis

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('This is a great course!')

Output:

[{'label': 'POSITIVE', 'score': 0.9998}]

??? A great entry point into using 🤗 Transformers is the pipeline() function. It provides a high-level abstraction, allowing you to perform complex NLP tasks with minimal code. As shown in the example, sentiment analysis takes just three lines. The pipeline() handles the necessary steps behind the scenes: preprocessing the input text (tokenization), feeding it to the model for inference, and post-processing the model's output into a human-readable format. This ease of use significantly lowers the barrier to entry and has been instrumental in the broad adoption of Transformer models.

pipeline() Task Examples (Text)

Sentiment Analysis: Determine if text is positive/negative
Named Entity Recognition: Identify people, organizations, locations
Question Answering: Extract answers from context
Text Generation: Complete prompts with new text
Summarization: Condense long documents
Translation: Convert text between languages

??? The pipeline() function is incredibly versatile, supporting numerous NLP tasks out-of-the-box. For text-based applications, this includes: Sentiment Analysis (determining positive/negative tone), Named Entity Recognition (identifying entities like people, places, organizations), Question Answering (extracting answers from context), Text Generation (completing prompts or creating new text), Summarization (condensing documents), and Translation (converting between languages). Each task typically leverages a model fine-tuned specifically for that purpose, all accessible through the consistent pipeline() interface.

pipeline() Task Examples (Beyond Text)

Image Classification:

from transformers import pipeline

img_classifier = pipeline('image-classification')
img_classifier('path/to/your/image.jpg')

Automatic Speech Recognition:

from transformers import pipeline

transcriber = pipeline('automatic-speech-recognition')
transcriber('path/to/your/audio.flac')

??? The power of Transformers extends beyond text. The pipeline() function also supports tasks in other modalities, demonstrating the architecture's flexibility. Examples include Image Classification (identifying objects in images) and Automatic Speech Recognition (ASR - transcribing spoken audio to text). This ability to handle multimodal data stems from adapting the core Transformer concepts, like the attention mechanism, to process sequences derived from images (e.g., patches) or audio (e.g., frames).

How Do Transformers Work?

.center[]

??? At its core, the original Transformer architecture, as shown in this diagram from the 'Attention Is All You Need' paper, comprises two main blocks: an Encoder (left side) and a Decoder (right side). The Encoder's role is to process the input sequence and build a rich representation of it. The Decoder then uses this representation, along with the sequence generated so far, to produce the output sequence. The groundbreaking element connecting these is the attention mechanism, which enables the model to dynamically weigh the importance of different parts of the input sequence when generating each part of the output.

The Attention Mechanism

Core Idea: Allow the model to focus on relevant parts of the input when processing each word.

Example: In translating "You like this course" to French:

When translating "like", attention focuses on "You" (for conjugation)
When translating "this", attention focuses on "course" (for gender agreement)

Key Innovation: Enables modeling of long-range dependencies in text.

??? The attention mechanism is arguably the most crucial innovation of the Transformer architecture. It empowers the model to selectively focus on the most relevant parts of the input sequence when processing or generating each element (like a word). Consider translating 'You like this course' to French: To correctly conjugate 'like' (aimer), the model must attend to 'You'. To choose the correct form of 'this' (ce/cet/cette), it needs to attend to 'course' to determine its gender. This dynamic weighting of input elements allows Transformers to effectively model long-range dependencies and contextual relationships in data, a significant advantage over earlier sequential models.

Transformer Architectures: Overview

Most models fall into three main categories:

Encoder-only: Best for understanding tasks (BERT)
Decoder-only: Best for generation tasks (GPT)
Encoder-Decoder: Best for transformation tasks (T5)

??? While based on the original design, most modern Transformer models utilize one of three main architectural patterns: 1. Encoder-only: Uses just the encoder stack. Excels at tasks requiring a deep understanding of the entire input sequence (e.g., BERT). 2. Decoder-only: Uses just the decoder stack. Ideal for generative tasks where the output is produced sequentially based on previous context (e.g., GPT). 3. Encoder-Decoder: Uses both stacks. Suited for sequence-to-sequence tasks that transform an input sequence into a new output sequence (e.g., T5, BART). Each architecture is optimized for different kinds of problems, so understanding their characteristics is key to selecting the appropriate model.

Architecture 1: Encoder-only

Examples: BERT, RoBERTa

How it works: Uses only the encoder part. Attention layers access the entire input sentence (bi-directional).

Best for: Tasks requiring full sentence understanding.

Sentence Classification
Named Entity Recognition (NER)
Extractive Question Answering

.center[]

??? Encoder-only models, exemplified by BERT and RoBERTa, leverage solely the encoder component of the Transformer. A key characteristic is their use of bi-directional attention, meaning that when processing any given word, the attention layers can access information from all other words in the input sentence, both preceding and succeeding. This allows the model to build a deep contextual understanding of the entire input. Consequently, these models excel at tasks demanding comprehensive input analysis, such as sentence classification, Named Entity Recognition (NER), and extractive question answering (where the answer is a span within the input text). They are generally not used for free-form text generation.

Architecture 2: Decoder-only

Examples: GPT, Llama

How it works: Uses only the decoder part. Predicts the next word based on preceding words (causal/auto-regressive attention).

Best for: Text generation tasks.

Auto-completion
Creative Writing
Chatbots

.center[]

??? Decoder-only models, such as those in the GPT family and Llama, utilize only the decoder stack. They operate differently from encoders: attention is causal or auto-regressive. This means that when predicting the next word (or token), the model can only attend to the words that came before it in the sequence, plus the current position. This inherent structure makes them exceptionally well-suited for text generation tasks, where the goal is to predict subsequent tokens based on preceding context. Applications include auto-completion, creative writing assistants, and chatbots. Indeed, many of the most prominent modern LLMs fall into this category.

Architecture 3: Encoder-Decoder

Examples: BART, T5, MarianMT

How it works: Uses both encoder (processes input) and decoder (generates output). Attention in the decoder can access encoder outputs.

Best for: Sequence-to-sequence tasks (transforming input to output).

Translation
Summarization
Generative Question Answering

.center[]

??? Encoder-Decoder models, also known as sequence-to-sequence models (e.g., BART, Text-to-Text Transfer Transformer (T5), MarianMT for translation), employ the complete original Transformer structure. The encoder first processes the entire input sequence to create a comprehensive representation. The decoder then generates the output sequence token by token, utilizing both its own previously generated tokens (causal attention) and the encoder's output representation (cross-attention). This architecture excels at tasks that involve transforming an input sequence into a different output sequence, such as machine translation, text summarization, or generative question answering (where the answer is generated, not just extracted). They effectively combine the input understanding strengths of encoders with the text generation capabilities of decoders.

Model Architectures and Tasks

Architecture	Examples	Best For
Encoder-only	BERT, RoBERTa	Classification, NER, Q&A
Decoder-only	GPT, LLaMA	Text generation, Chatbots
Encoder-Decoder	BART, T5	Translation, Summarization

??? This table provides a concise summary of the three primary Transformer architectures and their typical applications. To reiterate: Encoder-only models (like BERT) are strong for tasks requiring deep input understanding (classification, NER). Decoder-only models (like GPT, Llama) excel at generating text sequentially (chatbots, creative writing). Encoder-Decoder models (like BART, T5) are ideal for transforming input sequences into output sequences (translation, summarization). Selecting the most appropriate architecture is a critical initial decision in designing an effective NLP solution, as each is optimized for different strengths.

Causal Language Modeling

.col-6[ Key Characteristics:

Used by decoder models (GPT family)
Predicts next word based on previous words
Unidirectional attention (left-to-right)
Well-suited for text generation
Examples: GPT, LLaMA, Claude ] .col-6[ .center[] ]

??? Causal Language Modeling (CLM) is the training objective typically associated with decoder-only models (like the GPT family, Llama, Claude). The core task is to predict the next token in a sequence, given all the preceding tokens. This inherently relies on unidirectional or causal attention – the model can only 'look back' at previous tokens and the current position. The term 'causal' highlights that the prediction at any point depends only on past information, mirroring the natural flow of language generation. This auto-regressive process is fundamental to how these models generate coherent text, one token after another.

Masked Language Modeling

.col-6[ Key Characteristics:

Used by encoder models (BERT family)
Masks random words and predicts them
Bidirectional attention (full context)
Well-suited for understanding tasks
Examples: BERT, RoBERTa, DeBERTa ] .col-6[ .center[] ]

??? Masked Language Modeling (MLM) is the pretraining strategy characteristic of encoder-only models (like BERT, RoBERTa, DeBERTa). During training, a certain percentage of input tokens are randomly replaced with a special [MASK] token. The model's objective is then to predict the original identity of these masked tokens, using the surrounding unmasked context. This task necessitates bidirectional attention, as the model needs to consider context from both the left and the right of the mask to make an accurate prediction. MLM forces the model to develop a deep understanding of word meanings in context, making it highly effective for downstream tasks requiring rich text representations, such as classification or NER.

Transfer Learning in NLP

The key to efficient NLP:

Train once on general data
Adapt to specific tasks
Reduces computational costs
Democratizes access to advanced AI

??? Transfer learning is a cornerstone technique in modern NLP. The core idea is to leverage the knowledge captured by a model trained on a massive, general dataset (pretraining) and adapt it to a new, specific task (fine-tuning). This approach significantly reduces the need for task-specific data and computational resources compared to training from scratch. It has democratized access to powerful AI, enabling smaller teams and individuals to build sophisticated applications by standing on the shoulders of large pretrained models. Let's examine the two key phases: pretraining and fine-tuning.

Pretraining Phase

.col-6[ Characteristics:

Train on massive datasets (billions of words)
Learn general language patterns and representations
Computationally expensive (can cost millions)
Usually done once by research labs or companies
Foundation for many downstream tasks ] .col-6[ .center[] ]

??? The pretraining phase involves training a model from scratch on an enormous corpus of text (often billions or trillions of words). The goal is to learn general statistical patterns, grammar, and world knowledge embedded in the language. This is achieved using self-supervised objectives like Masked Language Modeling (MLM) or Causal Language Modeling (CLM), which don't require manual labels. Pretraining is extremely computationally intensive, often costing millions of dollars and requiring specialized hardware. It's typically performed once by large research labs or companies, creating foundational models that others can then adapt.

Fine-tuning Phase

.col-6[ Characteristics:

Adapt pretrained model to specific tasks
Use smaller, task-specific datasets
Much less expensive than pretraining
Can be done on consumer hardware
Preserves general knowledge while adding specialized capabilities ] .col-6[ .center[] ]

??? The fine-tuning phase takes a pretrained model and further trains it on a smaller, labeled dataset specific to a target task (e.g., sentiment analysis, medical text summarization). This adaptation process is significantly less computationally expensive than pretraining and often feasible on standard hardware. Fine-tuning allows the model to specialize its general language understanding, acquired during pretraining, for the nuances of the specific task. It effectively transfers the broad knowledge base and refines it for a particular application, requiring far less task-specific data than training from scratch.

Understanding LLM Inference

Inference: The process of using a trained model to generate predictions or outputs.

Key Components:

Input processing (tokenization)
Model computation
Output generation
Sampling strategies

??? Once a model is trained (either pretrained or fine-tuned), we use it for inference – the process of generating predictions or outputs for new, unseen inputs. For LLMs, inference typically means taking an input prompt and generating a textual response. Understanding the mechanics of inference is vital for deploying these models effectively, as it involves trade-offs between speed, cost, memory usage, and output quality. Key components include how the input is processed, the model's computations, how the output is generated step-by-step, and the strategies used to select the next token.

The Two-Phase Inference Process

1. Prefill Phase:

Process the entire input prompt
Compute initial hidden states
Computationally intensive for long prompts
Only happens once per generation

2. Decode Phase:

Generate one token at a time
Use previous tokens as context
Repeat until completion criteria met
Most of the generation time is spent here

??? LLM inference typically proceeds in two distinct phases: 1. Prefill Phase: The model processes the entire input prompt simultaneously. This involves calculating initial internal states (often called hidden states or activations) based on the prompt. This phase is compute-bound, especially for long prompts, as it involves parallel processing across the input length, but it only occurs once per generation request. Think of it as the model 'reading and understanding' the prompt. 2. Decode Phase: The model generates the output token by token, auto-regressively. In each step, it uses the context of the prompt and all previously generated tokens to predict the next token. This phase is memory-bandwidth-bound, as it involves sequential generation and accessing previously computed states. This is the 'writing' part, repeated until a stopping condition is met.

Sampling Strategies

Temperature: Controls randomness (higher = more creative, lower = more deterministic)

Top-k Sampling: Only consider the k most likely next tokens

Top-p (Nucleus) Sampling: Consider tokens covering p probability mass

Beam Search: Explore multiple possible continuations in parallel

??? During the decode phase, selecting the next token isn't always about picking the single most probable one. Various sampling strategies allow control over the generation process: Temperature: Adjusts the randomness of predictions. Lower values (<1.0) make generation more focused and deterministic; higher values (>1.0) increase diversity and creativity, potentially at the cost of coherence. Top-k Sampling: Limits the selection pool to the 'k' most probable next tokens. Top-p (Nucleus) Sampling: Selects from the smallest set of tokens whose cumulative probability exceeds a threshold 'p'. Beam Search: Explores multiple potential sequences (beams) in parallel, choosing the overall most probable sequence at the end. These strategies enable tuning the output for different needs, balancing predictability and creativity.

Optimizing Inference

KV Cache: Store key-value pairs from previous tokens to avoid recomputation

Quantization: Reduce precision of model weights (e.g., FP16, INT8)

Batching: Process multiple requests together for better hardware utilization

Model Pruning: Remove less important weights to reduce model size

??? Making LLM inference efficient is critical for practical deployment. Several optimization techniques are commonly used: KV Cache: Stores intermediate results (Key and Value projections from attention layers) for previously processed tokens, avoiding redundant computations during the decode phase. This significantly speeds up generation but increases memory usage. Quantization: Reduces the numerical precision of model weights (e.g., from 32-bit floats to 8-bit integers). This shrinks model size and speeds up computation, often with minimal impact on accuracy. Batching: Processing multiple input prompts simultaneously to maximize hardware (GPU) utilization. Model Pruning/Sparsification: Removing less important weights or structures from the model to reduce its size and computational cost. These techniques are essential for reducing latency, cost, and memory footprint in production.

Bias and Limitations

Source: Models learn from vast internet data, including harmful stereotypes (sexist, racist, etc.).

Problem: LLMs can replicate and amplify these biases.

Caution: Fine-tuning does not automatically remove underlying bias. Awareness is crucial.

??? Despite their impressive capabilities, LLMs have significant limitations and potential harms. A major concern is bias. Trained on vast datasets scraped from the internet, models inevitably learn and can replicate harmful societal biases present in that data – including sexist, racist, and other prejudiced stereotypes. They learn statistical patterns, not true understanding or ethical reasoning. Consequently, their outputs can reflect and even amplify these biases. It's crucial to remember that fine-tuning on specific data doesn't automatically eliminate these underlying biases learned during pretraining.

Bias Example: BERT Fill-Mask

Code:

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
print(unmasker("This man works as a [MASK]."))
print(unmasker("This woman works as a [MASK]."))

Results:

Man: lawyer, carpenter, doctor, waiter, mechanic
Woman: nurse, waitress, teacher, maid, prostitute

??? This code snippet provides a concrete example of gender bias using the BERT model in a fill-mask task. When prompted to predict occupations for 'This man works as a...' versus 'This woman works as a...', the model suggests stereotypical roles (e.g., 'lawyer', 'carpenter' for men; 'nurse', 'waitress' for women). This occurs even though BERT was pretrained on seemingly neutral sources like Wikipedia and BookCorpus. It starkly illustrates how models absorb and reflect societal biases present in their training data, highlighting the need for careful evaluation and mitigation when using them in real-world scenarios.

Sources of Bias in Models

Three main sources:

Training Data: Biased data leads to biased models
Pretrained Models: Bias persists through transfer learning
Optimization Metrics: What we optimize for shapes model behavior

Mitigation approaches:

Careful dataset curation
Bias evaluation and monitoring
Fine-tuning with debiased data
Post-processing techniques

??? Bias in AI models can originate from several sources throughout the development lifecycle: 1. Training Data: The most direct source; biased data leads to biased models. If the data reflects societal stereotypes, the model will learn them. 2. Pretrained Models: Bias embedded in foundational models persists even after fine-tuning on new data (negative transfer). 3. Optimization Metrics & Objectives: The very definition of 'good performance' can inadvertently favor biased outcomes if not carefully designed. Addressing bias requires a multi-faceted approach, including careful dataset curation and filtering, rigorous evaluation and monitoring for bias, specialized fine-tuning techniques using debiased data or objectives, and potentially post-processing model outputs.

Summary & Recap

NLP & LLMs: Defined the field and the impact of large models.
Transformer Capabilities: Explored tasks solvable via pipeline().
Attention: Introduced the core mechanism.
Architectures: Covered Encoder-only, Decoder-only, Encoder-Decoder.
Transfer Learning: Pretraining and fine-tuning approach.
Inference: Understanding how models generate text.
Bias: Highlighted the importance of awareness.

??? Let's quickly recap the key topics we've covered in this chapter: We defined Natural Language Processing (NLP) and saw how Large Language Models (LLMs) represent a major advancement. We explored the diverse capabilities of Transformers using the simple pipeline() function for tasks ranging from classification to generation. We introduced the core concept of the attention mechanism. We differentiated between the three main Transformer architectures (Encoder-only, Decoder-only, Encoder-Decoder) and their typical use cases. We explained the crucial transfer learning paradigm involving pretraining and fine-tuning. We looked into the mechanics of LLM inference, including sampling and optimization. Finally, we highlighted the critical issues of bias and limitations in these models. This provides a solid foundation for the rest of the course.

Next Steps

Deeper dive into the 🤗 Transformers library.
Data processing techniques.
Fine-tuning models for specific needs.
Exploring advanced concepts.

??? Building on this foundation, the upcoming chapters will delve into more practical aspects. We will explore the 🤗 Transformers library in greater detail, learn essential data processing techniques for preparing text for models, walk through the process of fine-tuning pretrained models for specific tasks, and touch upon more advanced concepts and deployment strategies. Get ready to gain hands-on skills for effectively working with these powerful models.

class: center, middle

Thank You!

??? Thank you for joining this first chapter. We hope this introduction has sparked your interest, and we look forward to seeing you in the next lesson!