Spaces:
Running
Running
burtenshaw
commited on
Commit
·
985b2b6
1
Parent(s):
a5ddf9d
first commit
Browse files- .DS_Store +0 -0
- .gitignore +18 -0
- chapter1/material/1.md +121 -0
- chapter1/material/10.md +258 -0
- chapter1/material/2.md +67 -0
- chapter1/material/3.md +410 -0
- chapter1/material/4.md +178 -0
- chapter1/material/5.md +250 -0
- chapter1/material/6.md +219 -0
- chapter1/material/7.md +241 -0
- chapter1/material/8.md +32 -0
- chapter1/material/9.md +66 -0
- chapter1/presentation.md +518 -0
- chapter1/template/index.html +22 -0
- chapter1/template/remark.min.js +0 -0
- chapter1/template/style.scss +259 -0
- pyproject.toml +7 -0
- scripts/create_video.py +218 -0
- scripts/transcription_to_audio.py +318 -0
.DS_Store
ADDED
Binary file (6.15 kB). View file
|
|
.gitignore
CHANGED
@@ -172,3 +172,21 @@ cython_debug/
|
|
172 |
|
173 |
# PyPI configuration file
|
174 |
.pypirc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
172 |
|
173 |
# PyPI configuration file
|
174 |
.pypirc
|
175 |
+
|
176 |
+
# block media files
|
177 |
+
*.mp4
|
178 |
+
*.mov
|
179 |
+
*.avi
|
180 |
+
*.mkv
|
181 |
+
*.flv
|
182 |
+
*.wmv
|
183 |
+
*.wav
|
184 |
+
|
185 |
+
# block images and pdf
|
186 |
+
*.png
|
187 |
+
*.jpg
|
188 |
+
*.jpeg
|
189 |
+
*.gif
|
190 |
+
*.bmp
|
191 |
+
*.tiff
|
192 |
+
*.pdf
|
chapter1/material/1.md
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Introduction[[introduction]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner
|
4 |
+
chapter={1}
|
5 |
+
classNames="absolute z-10 right-0 top-0"
|
6 |
+
/>
|
7 |
+
|
8 |
+
## Welcome to the 🤗 Course![[welcome-to-the-course]]
|
9 |
+
|
10 |
+
<Youtube id="00GKzGyWFEs" />
|
11 |
+
|
12 |
+
This course will teach you about large language models (LLMs) and natural language processing (NLP) using libraries from the [Hugging Face](https://huggingface.co/) ecosystem — [🤗 Transformers](https://github.com/huggingface/transformers), [🤗 Datasets](https://github.com/huggingface/datasets), [🤗 Tokenizers](https://github.com/huggingface/tokenizers), and [🤗 Accelerate](https://github.com/huggingface/accelerate) — as well as the [Hugging Face Hub](https://huggingface.co/models). It's completely free and without ads.
|
13 |
+
|
14 |
+
## Understanding NLP and LLMs[[understanding-nlp-and-llms]]
|
15 |
+
|
16 |
+
While this course was originally focused on NLP (Natural Language Processing), it has evolved to emphasize Large Language Models (LLMs), which represent the latest advancement in the field.
|
17 |
+
|
18 |
+
**What's the difference?**
|
19 |
+
- **NLP (Natural Language Processing)** is the broader field focused on enabling computers to understand, interpret, and generate human language. NLP encompasses many techniques and tasks such as sentiment analysis, named entity recognition, and machine translation.
|
20 |
+
- **LLMs (Large Language Models)** are a powerful subset of NLP models characterized by their massive size, extensive training data, and ability to perform a wide range of language tasks with minimal task-specific training. Models like the Llama, GPT, or Claude series are examples of LLMs that have revolutionized what's possible in NLP.
|
21 |
+
|
22 |
+
Throughout this course, you'll learn about both traditional NLP concepts and cutting-edge LLM techniques, as understanding the foundations of NLP is crucial for working effectively with LLMs.
|
23 |
+
|
24 |
+
## What to expect?[[what-to-expect]]
|
25 |
+
|
26 |
+
Here is a brief overview of the course:
|
27 |
+
|
28 |
+
<div class="flex justify-center">
|
29 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/summary.svg" alt="Brief overview of the chapters of the course.">
|
30 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/summary-dark.svg" alt="Brief overview of the chapters of the course.">
|
31 |
+
</div>
|
32 |
+
|
33 |
+
- Chapters 1 to 4 provide an introduction to the main concepts of the 🤗 Transformers library. By the end of this part of the course, you will be familiar with how Transformer models work and will know how to use a model from the [Hugging Face Hub](https://huggingface.co/models), fine-tune it on a dataset, and share your results on the Hub!
|
34 |
+
- Chapters 5 to 8 teach the basics of 🤗 Datasets and 🤗 Tokenizers before diving into classic NLP tasks and LLM techniques. By the end of this part, you will be able to tackle the most common language processing challenges by yourself.
|
35 |
+
- Chapter 9 goes beyond NLP to cover how to build and share demos of your models on the 🤗 Hub. By the end of this part, you will be ready to showcase your 🤗 Transformers application to the world!
|
36 |
+
- Chapters 10 to 12 dive into advanced LLM topics like fine-tuning, curating high-quality datasets, and building reasoning models.
|
37 |
+
|
38 |
+
This course:
|
39 |
+
|
40 |
+
* Requires a good knowledge of Python
|
41 |
+
* Is better taken after an introductory deep learning course, such as [fast.ai's](https://www.fast.ai/) [Practical Deep Learning for Coders](https://course.fast.ai/) or one of the programs developed by [DeepLearning.AI](https://www.deeplearning.ai/)
|
42 |
+
* Does not expect prior [PyTorch](https://pytorch.org/) or [TensorFlow](https://www.tensorflow.org/) knowledge, though some familiarity with either of those will help
|
43 |
+
|
44 |
+
After you've completed this course, we recommend checking out DeepLearning.AI's [Natural Language Processing Specialization](https://www.coursera.org/specializations/natural-language-processing?utm_source=deeplearning-ai&utm_medium=institutions&utm_campaign=20211011-nlp-2-hugging_face-page-nlp-refresh), which covers a wide range of traditional NLP models like naive Bayes and LSTMs that are well worth knowing about!
|
45 |
+
|
46 |
+
## Who are we?[[who-are-we]]
|
47 |
+
|
48 |
+
About the authors:
|
49 |
+
|
50 |
+
[**Abubakar Abid**](https://huggingface.co/abidlabs) completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead.
|
51 |
+
|
52 |
+
[**Ben Burtenshaw**](https://huggingface.co/burtenshaw) is a Machine Learning Engineer at Hugging Face. He completed his PhD in Natural Language Processing at the University of Antwerp, where he applied Transformer models to generate children stories for the purpose of improving literacy skills. Since then, he has focused on educational materials and tools for the wider community.
|
53 |
+
|
54 |
+
[**Matthew Carrigan**](https://huggingface.co/Rocketknight1) is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless.
|
55 |
+
|
56 |
+
[**Lysandre Debut**](https://huggingface.co/lysandre) is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API.
|
57 |
+
|
58 |
+
[**Sylvain Gugger**](https://huggingface.co/sgugger) is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources.
|
59 |
+
|
60 |
+
[**Dawood Khan**](https://huggingface.co/dawoodkhan82) is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face.
|
61 |
+
|
62 |
+
[**Merve Noyan**](https://huggingface.co/merve) is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone.
|
63 |
+
|
64 |
+
[**Lucile Saulnier**](https://huggingface.co/SaulLu) is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience.
|
65 |
+
|
66 |
+
[**Lewis Tunstall**](https://huggingface.co/lewtun) is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O'Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/).
|
67 |
+
|
68 |
+
[**Leandro von Werra**](https://huggingface.co/lvwerra) is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O'Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack..
|
69 |
+
|
70 |
+
## FAQ[[faq]]
|
71 |
+
|
72 |
+
Here are some answers to frequently asked questions:
|
73 |
+
|
74 |
+
- **Does taking this course lead to a certification?**
|
75 |
+
Currently we do not have any certification for this course. However, we are working on a certification program for the Hugging Face ecosystem -- stay tuned!
|
76 |
+
|
77 |
+
- **How much time should I spend on this course?**
|
78 |
+
Each chapter in this course is designed to be completed in 1 week, with approximately 6-8 hours of work per week. However, you can take as much time as you need to complete the course.
|
79 |
+
|
80 |
+
- **Where can I ask a question if I have one?**
|
81 |
+
If you have a question about any section of the course, just click on the "*Ask a question*" banner at the top of the page to be automatically redirected to the right section of the [Hugging Face forums](https://discuss.huggingface.co/):
|
82 |
+
|
83 |
+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/forum-button.png" alt="Link to the Hugging Face forums" width="75%">
|
84 |
+
|
85 |
+
Note that a list of [project ideas](https://discuss.huggingface.co/c/course/course-event/25) is also available on the forums if you wish to practice more once you have completed the course.
|
86 |
+
|
87 |
+
- **Where can I get the code for the course?**
|
88 |
+
For each section, click on the banner at the top of the page to run the code in either Google Colab or Amazon SageMaker Studio Lab:
|
89 |
+
|
90 |
+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/notebook-buttons.png" alt="Link to the Hugging Face course notebooks" width="75%">
|
91 |
+
|
92 |
+
The Jupyter notebooks containing all the code from the course are hosted on the [`huggingface/notebooks`](https://github.com/huggingface/notebooks) repo. If you wish to generate them locally, check out the instructions in the [`course`](https://github.com/huggingface/course#-jupyter-notebooks) repo on GitHub.
|
93 |
+
|
94 |
+
|
95 |
+
- **How can I contribute to the course?**
|
96 |
+
There are many ways to contribute to the course! If you find a typo or a bug, please open an issue on the [`course`](https://github.com/huggingface/course) repo. If you would like to help translate the course into your native language, check out the instructions [here](https://github.com/huggingface/course#translating-the-course-into-your-language).
|
97 |
+
|
98 |
+
- ** What were the choices made for each translation?**
|
99 |
+
Each translation has a glossary and `TRANSLATING.txt` file that details the choices that were made for machine learning jargon etc. You can find an example for German [here](https://github.com/huggingface/course/blob/main/chapters/de/TRANSLATING.txt).
|
100 |
+
|
101 |
+
|
102 |
+
- **Can I reuse this course?**
|
103 |
+
Of course! The course is released under the permissive [Apache 2 license](https://www.apache.org/licenses/LICENSE-2.0.html). This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you would like to cite the course, please use the following BibTeX:
|
104 |
+
|
105 |
+
```
|
106 |
+
@misc{huggingfacecourse,
|
107 |
+
author = {Hugging Face},
|
108 |
+
title = {The Hugging Face Course, 2022},
|
109 |
+
howpublished = "\url{https://huggingface.co/course}",
|
110 |
+
year = {2022},
|
111 |
+
note = "[Online; accessed <today>]"
|
112 |
+
}
|
113 |
+
```
|
114 |
+
|
115 |
+
## Let's Go
|
116 |
+
Are you ready to roll? In this chapter, you will learn:
|
117 |
+
|
118 |
+
* How to use the `pipeline()` function to solve NLP tasks such as text generation and classification
|
119 |
+
* About the Transformer architecture
|
120 |
+
* How to distinguish between encoder, decoder, and encoder-decoder architectures and use cases
|
121 |
+
|
chapter1/material/10.md
ADDED
@@ -0,0 +1,258 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!-- DISABLE-FRONTMATTER-SECTIONS -->
|
2 |
+
|
3 |
+
# End-of-chapter quiz[[end-of-chapter-quiz]]
|
4 |
+
|
5 |
+
<CourseFloatingBanner
|
6 |
+
chapter={1}
|
7 |
+
classNames="absolute z-10 right-0 top-0"
|
8 |
+
/>
|
9 |
+
|
10 |
+
This chapter covered a lot of ground! Don't worry if you didn't grasp all the details; the next chapters will help you understand how things work under the hood.
|
11 |
+
|
12 |
+
First, though, let's test what you learned in this chapter!
|
13 |
+
|
14 |
+
|
15 |
+
### 1. Explore the Hub and look for the `roberta-large-mnli` checkpoint. What task does it perform?
|
16 |
+
|
17 |
+
|
18 |
+
<Question
|
19 |
+
choices={[
|
20 |
+
{
|
21 |
+
text: "Summarization",
|
22 |
+
explain: "Look again on the <a href=\"https://huggingface.co/roberta-large-mnli\">roberta-large-mnli page</a>."
|
23 |
+
},
|
24 |
+
{
|
25 |
+
text: "Text classification",
|
26 |
+
explain: "More precisely, it classifies if two sentences are logically linked across three labels (contradiction, neutral, entailment) — a task also called <em>natural language inference</em>.",
|
27 |
+
correct: true
|
28 |
+
},
|
29 |
+
{
|
30 |
+
text: "Text generation",
|
31 |
+
explain: "Look again on the <a href=\"https://huggingface.co/roberta-large-mnli\">roberta-large-mnli page</a>."
|
32 |
+
}
|
33 |
+
]}
|
34 |
+
/>
|
35 |
+
|
36 |
+
### 2. What will the following code return?
|
37 |
+
|
38 |
+
```py
|
39 |
+
from transformers import pipeline
|
40 |
+
|
41 |
+
ner = pipeline("ner", grouped_entities=True)
|
42 |
+
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
|
43 |
+
```
|
44 |
+
|
45 |
+
<Question
|
46 |
+
choices={[
|
47 |
+
{
|
48 |
+
text: "It will return classification scores for this sentence, with labels \"positive\" or \"negative\".",
|
49 |
+
explain: "This is incorrect — this would be a <code>sentiment-analysis</code> pipeline."
|
50 |
+
},
|
51 |
+
{
|
52 |
+
text: "It will return a generated text completing this sentence.",
|
53 |
+
explain: "This is incorrect — it would be a <code>text-generation</code> pipeline.",
|
54 |
+
},
|
55 |
+
{
|
56 |
+
text: "It will return the words representing persons, organizations or locations.",
|
57 |
+
explain: "Furthermore, with <code>grouped_entities=True</code>, it will group together the words belonging to the same entity, like \"Hugging Face\".",
|
58 |
+
correct: true
|
59 |
+
}
|
60 |
+
]}
|
61 |
+
/>
|
62 |
+
|
63 |
+
### 3. What should replace ... in this code sample?
|
64 |
+
|
65 |
+
```py
|
66 |
+
from transformers import pipeline
|
67 |
+
|
68 |
+
filler = pipeline("fill-mask", model="bert-base-cased")
|
69 |
+
result = filler("...")
|
70 |
+
```
|
71 |
+
|
72 |
+
<Question
|
73 |
+
choices={[
|
74 |
+
{
|
75 |
+
text: "This <mask> has been waiting for you.",
|
76 |
+
explain: "This is incorrect. Check out the <code>bert-base-cased</code> model card and try to spot your mistake."
|
77 |
+
},
|
78 |
+
{
|
79 |
+
text: "This [MASK] has been waiting for you.",
|
80 |
+
explain: "Correct! This model's mask token is [MASK].",
|
81 |
+
correct: true
|
82 |
+
},
|
83 |
+
{
|
84 |
+
text: "This man has been waiting for you.",
|
85 |
+
explain: "This is incorrect. This pipeline fills in masked words, so it needs a mask token somewhere."
|
86 |
+
}
|
87 |
+
]}
|
88 |
+
/>
|
89 |
+
|
90 |
+
### 4. Why will this code fail?
|
91 |
+
|
92 |
+
```py
|
93 |
+
from transformers import pipeline
|
94 |
+
|
95 |
+
classifier = pipeline("zero-shot-classification")
|
96 |
+
result = classifier("This is a course about the Transformers library")
|
97 |
+
```
|
98 |
+
|
99 |
+
<Question
|
100 |
+
choices={[
|
101 |
+
{
|
102 |
+
text: "This pipeline requires that labels be given to classify this text.",
|
103 |
+
explain: "Right — the correct code needs to include <code>candidate_labels=[...]</code>.",
|
104 |
+
correct: true
|
105 |
+
},
|
106 |
+
{
|
107 |
+
text: "This pipeline requires several sentences, not just one.",
|
108 |
+
explain: "This is incorrect, though when properly used, this pipeline can take a list of sentences to process (like all other pipelines)."
|
109 |
+
},
|
110 |
+
{
|
111 |
+
text: "The 🤗 Transformers library is broken, as usual.",
|
112 |
+
explain: "We won't dignify this answer with a comment!"
|
113 |
+
},
|
114 |
+
{
|
115 |
+
text: "This pipeline requires longer inputs; this one is too short.",
|
116 |
+
explain: "This is incorrect. Note that a very long text will be truncated when processed by this pipeline."
|
117 |
+
}
|
118 |
+
]}
|
119 |
+
/>
|
120 |
+
|
121 |
+
### 5. What does "transfer learning" mean?
|
122 |
+
|
123 |
+
<Question
|
124 |
+
choices={[
|
125 |
+
{
|
126 |
+
text: "Transferring the knowledge of a pretrained model to a new model by training it on the same dataset.",
|
127 |
+
explain: "No, that would be two versions of the same model."
|
128 |
+
},
|
129 |
+
{
|
130 |
+
text: "Transferring the knowledge of a pretrained model to a new model by initializing the second model with the first model's weights.",
|
131 |
+
explain: "Correct: when the second model is trained on a new task, it *transfers* the knowledge of the first model.",
|
132 |
+
correct: true
|
133 |
+
},
|
134 |
+
{
|
135 |
+
text: "Transferring the knowledge of a pretrained model to a new model by building the second model with the same architecture as the first model.",
|
136 |
+
explain: "The architecture is just the way the model is built; there is no knowledge shared or transferred in this case."
|
137 |
+
}
|
138 |
+
]}
|
139 |
+
/>
|
140 |
+
|
141 |
+
### 6. True or false? A language model usually does not need labels for its pretraining.
|
142 |
+
|
143 |
+
<Question
|
144 |
+
choices={[
|
145 |
+
{
|
146 |
+
text: "True",
|
147 |
+
explain: "The pretraining is usually <em>self-supervised</em>, which means the labels are created automatically from the inputs (like predicting the next word or filling in some masked words).",
|
148 |
+
correct: true
|
149 |
+
},
|
150 |
+
{
|
151 |
+
text: "False",
|
152 |
+
explain: "This is not the correct answer."
|
153 |
+
}
|
154 |
+
]}
|
155 |
+
/>
|
156 |
+
|
157 |
+
### 7. Select the sentence that best describes the terms "model", "architecture", and "weights".
|
158 |
+
|
159 |
+
<Question
|
160 |
+
choices={[
|
161 |
+
{
|
162 |
+
text: "If a model is a building, its architecture is the blueprint and the weights are the people living inside.",
|
163 |
+
explain: "Following this metaphor, the weights would be the bricks and other materials used to construct the building."
|
164 |
+
},
|
165 |
+
{
|
166 |
+
text: "An architecture is a map to build a model and its weights are the cities represented on the map.",
|
167 |
+
explain: "The problem with this metaphor is that a map usually represents one existing reality (there is only one city in France named Paris). For a given architecture, multiple weights are possible."
|
168 |
+
},
|
169 |
+
{
|
170 |
+
text: "An architecture is a succession of mathematical functions to build a model and its weights are those functions parameters.",
|
171 |
+
explain: "The same set of mathematical functions (architecture) can be used to build different models by using different parameters (weights).",
|
172 |
+
correct: true
|
173 |
+
}
|
174 |
+
]}
|
175 |
+
/>
|
176 |
+
|
177 |
+
|
178 |
+
### 8. Which of these types of models would you use for completing prompts with generated text?
|
179 |
+
|
180 |
+
<Question
|
181 |
+
choices={[
|
182 |
+
{
|
183 |
+
text: "An encoder model",
|
184 |
+
explain: "An encoder model generates a representation of the whole sentence that is better suited for tasks like classification."
|
185 |
+
},
|
186 |
+
{
|
187 |
+
text: "A decoder model",
|
188 |
+
explain: "Decoder models are perfectly suited for text generation from a prompt.",
|
189 |
+
correct: true
|
190 |
+
},
|
191 |
+
{
|
192 |
+
text: "A sequence-to-sequence model",
|
193 |
+
explain: "Sequence-to-sequence models are better suited for tasks where you want to generate sentences in relation to the input sentences, not a given prompt."
|
194 |
+
}
|
195 |
+
]}
|
196 |
+
/>
|
197 |
+
|
198 |
+
### 9. Which of those types of models would you use for summarizing texts?
|
199 |
+
|
200 |
+
<Question
|
201 |
+
choices={[
|
202 |
+
{
|
203 |
+
text: "An encoder model",
|
204 |
+
explain: "An encoder model generates a representation of the whole sentence that is better suited for tasks like classification."
|
205 |
+
},
|
206 |
+
{
|
207 |
+
text: "A decoder model",
|
208 |
+
explain: "Decoder models are good for generating output text (like summaries), but they don't have the ability to exploit a context like the whole text to summarize."
|
209 |
+
},
|
210 |
+
{
|
211 |
+
text: "A sequence-to-sequence model",
|
212 |
+
explain: "Sequence-to-sequence models are perfectly suited for a summarization task.",
|
213 |
+
correct: true
|
214 |
+
}
|
215 |
+
]}
|
216 |
+
/>
|
217 |
+
|
218 |
+
### 10. Which of these types of models would you use for classifying text inputs according to certain labels?
|
219 |
+
|
220 |
+
<Question
|
221 |
+
choices={[
|
222 |
+
{
|
223 |
+
text: "An encoder model",
|
224 |
+
explain: "An encoder model generates a representation of the whole sentence which is perfectly suited for a task like classification.",
|
225 |
+
correct: true
|
226 |
+
},
|
227 |
+
{
|
228 |
+
text: "A decoder model",
|
229 |
+
explain: "Decoder models are good for generating output texts, not extracting a label out of a sentence."
|
230 |
+
},
|
231 |
+
{
|
232 |
+
text: "A sequence-to-sequence model",
|
233 |
+
explain: "Sequence-to-sequence models are better suited for tasks where you want to generate text based on an input sentence, not a label.",
|
234 |
+
}
|
235 |
+
]}
|
236 |
+
/>
|
237 |
+
|
238 |
+
### 11. What possible source can the bias observed in a model have?
|
239 |
+
|
240 |
+
<Question
|
241 |
+
choices={[
|
242 |
+
{
|
243 |
+
text: "The model is a fine-tuned version of a pretrained model and it picked up its bias from it.",
|
244 |
+
explain: "When applying Transfer Learning, the bias in the pretrained model used persists in the fine-tuned model.",
|
245 |
+
correct: true
|
246 |
+
},
|
247 |
+
{
|
248 |
+
text: "The data the model was trained on is biased.",
|
249 |
+
explain: "This is the most obvious source of bias, but not the only one.",
|
250 |
+
correct: true
|
251 |
+
},
|
252 |
+
{
|
253 |
+
text: "The metric the model was optimizing for is biased.",
|
254 |
+
explain: "A less obvious source of bias is the way the model is trained. Your model will blindly optimize for whatever metric you chose, without any second thoughts.",
|
255 |
+
correct: true
|
256 |
+
}
|
257 |
+
]}
|
258 |
+
/>
|
chapter1/material/2.md
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Natural Language Processing and Large Language Models[[natural-language-processing-and-large-language-models]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner
|
4 |
+
chapter={1}
|
5 |
+
classNames="absolute z-10 right-0 top-0"
|
6 |
+
/>
|
7 |
+
|
8 |
+
Before jumping into Transformer models, let's do a quick overview of what natural language processing is, how large language models have transformed the field, and why we care about it.
|
9 |
+
|
10 |
+
## What is NLP?[[what-is-nlp]]
|
11 |
+
|
12 |
+
NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.
|
13 |
+
|
14 |
+
The following is a list of common NLP tasks, with some examples of each:
|
15 |
+
|
16 |
+
- **Classifying whole sentences**: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
|
17 |
+
- **Classifying each word in a sentence**: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
|
18 |
+
- **Generating text content**: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
|
19 |
+
- **Extracting an answer from a text**: Given a question and a context, extracting the answer to the question based on the information provided in the context
|
20 |
+
- **Generating a new sentence from an input text**: Translating a text into another language, summarizing a text
|
21 |
+
|
22 |
+
NLP isn't limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.
|
23 |
+
|
24 |
+
## What are Large Language Models (LLMs)?[[what-are-llms]]
|
25 |
+
|
26 |
+
Large Language Models (LLMs) are a type of artificial intelligence model designed to understand, generate, and manipulate human language. They represent a significant advancement in the field of natural language processing (NLP).
|
27 |
+
|
28 |
+
<Tip>
|
29 |
+
|
30 |
+
A large language model is an AI system trained on massive amounts of text data that can understand and generate human-like text, recognize patterns in language, and perform a wide variety of language tasks without task-specific training.
|
31 |
+
|
32 |
+
</Tip>
|
33 |
+
|
34 |
+
LLMs are generalist models that can perform an impressive range of tasks within a single model:
|
35 |
+
|
36 |
+
- Generate human-like text for creative writing, emails, or reports
|
37 |
+
- Answer questions based on their training data
|
38 |
+
- Summarize long documents
|
39 |
+
- Translate between languages
|
40 |
+
- Write and debug computer code
|
41 |
+
- Reason through complex problems
|
42 |
+
|
43 |
+
However, they also have important limitations:
|
44 |
+
|
45 |
+
- They can generate incorrect information confidently (hallucinations)
|
46 |
+
- They lack true understanding of the world and operate purely on statistical patterns
|
47 |
+
- They may reproduce biases present in their training data or inputs.
|
48 |
+
- They have limited context windows (though this is improving)
|
49 |
+
- They require significant computational resources
|
50 |
+
|
51 |
+
## The Rise of Large Language Models (LLMs)[[rise-of-llms]]
|
52 |
+
|
53 |
+
In recent years, the field of NLP has been revolutionized by Large Language Models (LLMs). These models, which include architectures like GPT (Generative Pre-trained Transformer) and [Llama](https://huggingface.co/meta-llama), have transformed what's possible in language processing.
|
54 |
+
|
55 |
+
LLMs are characterized by:
|
56 |
+
- **Scale**: They contain millions, billions, or even hundreds of billions of parameters
|
57 |
+
- **General capabilities**: They can perform multiple tasks without task-specific training
|
58 |
+
- **In-context learning**: They can learn from examples provided in the prompt
|
59 |
+
- **Emergent abilities**: As these models grow in size, they demonstrate capabilities that weren't explicitly programmed or anticipated
|
60 |
+
|
61 |
+
The advent of LLMs has shifted the paradigm from building specialized models for specific NLP tasks to using a single, large model that can be prompted or fine-tuned to address a wide range of language tasks. This has made sophisticated language processing more accessible while also introducing new challenges in areas like efficiency, ethics, and deployment.
|
62 |
+
|
63 |
+
## Why is language processing challenging?[[why-is-it-challenging]]
|
64 |
+
|
65 |
+
Computers don't process information in the same way as humans. For example, when we read the sentence "I am hungry," we can easily understand its meaning. Similarly, given two sentences such as "I am hungry" and "I am sad," we're able to easily determine how similar they are. For machine learning (ML) models, such tasks are more difficult. The text needs to be processed in a way that enables the model to learn from it. And because language is complex, we need to think carefully about how this processing must be done. There has been a lot of research done on how to represent text, and we will look at some methods in the next chapter.
|
66 |
+
|
67 |
+
Even with the advances in LLMs, many fundamental challenges remain. These include understanding ambiguity, cultural context, sarcasm, and humor. LLMs address these challenges through massive training on diverse datasets, but still often fall short of human-level understanding in many complex scenarios.
|
chapter1/material/3.md
ADDED
@@ -0,0 +1,410 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Transformers, what can they do?[[transformers-what-can-they-do]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner chapter={1}
|
4 |
+
classNames="absolute z-10 right-0 top-0"
|
5 |
+
notebooks={[
|
6 |
+
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter1/section3.ipynb"},
|
7 |
+
{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter1/section3.ipynb"},
|
8 |
+
]} />
|
9 |
+
|
10 |
+
In this section, we will look at what Transformer models can do and use our first tool from the 🤗 Transformers library: the `pipeline()` function.
|
11 |
+
|
12 |
+
<Tip>
|
13 |
+
👀 See that <em>Open in Colab</em> button on the top right? Click on it to open a Google Colab notebook with all the code samples of this section. This button will be present in any section containing code examples.
|
14 |
+
|
15 |
+
If you want to run the examples locally, we recommend taking a look at the <a href="/course/chapter0">setup</a>.
|
16 |
+
</Tip>
|
17 |
+
|
18 |
+
## Transformers are everywhere![[transformers-are-everywhere]]
|
19 |
+
|
20 |
+
Transformer models are used to solve all kinds of tasks across different modalities, including natural language processing (NLP), computer vision, audio processing, and more. Here are some of the companies and organizations using Hugging Face and Transformer models, who also contribute back to the community by sharing their models:
|
21 |
+
|
22 |
+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/companies.PNG" alt="Companies using Hugging Face" width="100%">
|
23 |
+
|
24 |
+
The [🤗 Transformers library](https://github.com/huggingface/transformers) provides the functionality to create and use those shared models. The [Model Hub](https://huggingface.co/models) contains thousands of pretrained models that anyone can download and use. You can also upload your own models to the Hub!
|
25 |
+
|
26 |
+
<Tip>
|
27 |
+
⚠️ The Hugging Face Hub is not limited to Transformer models. Anyone can share any kind of models or datasets they want! <a href="https://huggingface.co/join">Create a huggingface.co</a> account to benefit from all available features!
|
28 |
+
</Tip>
|
29 |
+
|
30 |
+
Before diving into how Transformer models work under the hood, let's look at a few examples of how they can be used to solve some interesting NLP problems.
|
31 |
+
|
32 |
+
## Working with pipelines[[working-with-pipelines]]
|
33 |
+
|
34 |
+
<Youtube id="tiZFewofSLM" />
|
35 |
+
|
36 |
+
The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:
|
37 |
+
|
38 |
+
```python
|
39 |
+
from transformers import pipeline
|
40 |
+
|
41 |
+
classifier = pipeline("sentiment-analysis")
|
42 |
+
classifier("I've been waiting for a HuggingFace course my whole life.")
|
43 |
+
```
|
44 |
+
|
45 |
+
```python out
|
46 |
+
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]
|
47 |
+
```
|
48 |
+
|
49 |
+
We can even pass several sentences!
|
50 |
+
|
51 |
+
```python
|
52 |
+
classifier(
|
53 |
+
["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
|
54 |
+
)
|
55 |
+
```
|
56 |
+
|
57 |
+
```python out
|
58 |
+
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
|
59 |
+
{'label': 'NEGATIVE', 'score': 0.9994558095932007}]
|
60 |
+
```
|
61 |
+
|
62 |
+
By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the `classifier` object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.
|
63 |
+
|
64 |
+
There are three main steps involved when you pass some text to a pipeline:
|
65 |
+
|
66 |
+
1. The text is preprocessed into a format the model can understand.
|
67 |
+
2. The preprocessed inputs are passed to the model.
|
68 |
+
3. The predictions of the model are post-processed, so you can make sense of them.
|
69 |
+
|
70 |
+
## Available pipelines for different modalities
|
71 |
+
|
72 |
+
The `pipeline()` function supports multiple modalities, allowing you to work with text, images, audio, and even multimodal tasks. In this course we'll focus on text tasks, but it's useful to understand the transformer's architectures potential, so we'll briefly outline it.
|
73 |
+
|
74 |
+
Here's an overview of what's available:
|
75 |
+
|
76 |
+
### Text pipelines
|
77 |
+
|
78 |
+
- `sentiment-analysis`: Analyze whether text is positive or negative
|
79 |
+
- `text-classification`: Classify text into predefined categories
|
80 |
+
- `text-generation`: Generate text from a prompt
|
81 |
+
- `fill-mask`: Fill in missing words in text
|
82 |
+
- `summarization`: Create a shorter version of a text while preserving key information
|
83 |
+
- `translation`: Translate text from one language to another
|
84 |
+
- `question-answering`: Answer questions based on context
|
85 |
+
- `zero-shot-classification`: Classify text without prior training on specific labels
|
86 |
+
- `ner` (named entity recognition): Identify entities like people, locations, organizations
|
87 |
+
- `feature-extraction`: Extract vector representations of text
|
88 |
+
|
89 |
+
### Image pipelines
|
90 |
+
|
91 |
+
- `image-classification`: Identify what's in an image
|
92 |
+
- `object-detection`: Locate and identify objects in images
|
93 |
+
- `image-segmentation`: Identify which pixels belong to which object
|
94 |
+
- `depth-estimation`: Estimate the depth of objects in an image
|
95 |
+
- `image-to-text`: Generate text descriptions of images
|
96 |
+
|
97 |
+
### Audio pipelines
|
98 |
+
|
99 |
+
- `automatic-speech-recognition`: Convert speech to text
|
100 |
+
- `audio-classification`: Classify audio into categories
|
101 |
+
- `text-to-speech`: Convert text to spoken audio
|
102 |
+
|
103 |
+
### Multimodal pipelines
|
104 |
+
|
105 |
+
- `document-question-answering`: Answer questions about documents with text and images
|
106 |
+
- `visual-question-answering`: Answer questions about images
|
107 |
+
|
108 |
+
Let's explore some of these pipelines in more detail!
|
109 |
+
|
110 |
+
## Zero-shot classification[[zero-shot-classification]]
|
111 |
+
|
112 |
+
We'll start by tackling a more challenging task where we need to classify texts that haven't been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the `zero-shot-classification` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don't have to rely on the labels of the pretrained model. You've already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like.
|
113 |
+
|
114 |
+
```python
|
115 |
+
from transformers import pipeline
|
116 |
+
|
117 |
+
classifier = pipeline("zero-shot-classification")
|
118 |
+
classifier(
|
119 |
+
"This is a course about the Transformers library",
|
120 |
+
candidate_labels=["education", "politics", "business"],
|
121 |
+
)
|
122 |
+
```
|
123 |
+
|
124 |
+
```python out
|
125 |
+
{'sequence': 'This is a course about the Transformers library',
|
126 |
+
'labels': ['education', 'business', 'politics'],
|
127 |
+
'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}
|
128 |
+
```
|
129 |
+
|
130 |
+
This pipeline is called _zero-shot_ because you don't need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!
|
131 |
+
|
132 |
+
<Tip>
|
133 |
+
|
134 |
+
✏️ **Try it out!** Play around with your own sequences and labels and see how the model behaves.
|
135 |
+
|
136 |
+
</Tip>
|
137 |
+
|
138 |
+
|
139 |
+
## Text generation[[text-generation]]
|
140 |
+
|
141 |
+
Now let's see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it's normal if you don't get the same results as shown below.
|
142 |
+
|
143 |
+
```python
|
144 |
+
from transformers import pipeline
|
145 |
+
|
146 |
+
generator = pipeline("text-generation")
|
147 |
+
generator("In this course, we will teach you how to")
|
148 |
+
```
|
149 |
+
|
150 |
+
```python out
|
151 |
+
[{'generated_text': 'In this course, we will teach you how to understand and use '
|
152 |
+
'data flow and data interchange when handling user data. We '
|
153 |
+
'will be working with one or more of the most commonly used '
|
154 |
+
'data flows — data flows of various types, as seen by the '
|
155 |
+
'HTTP'}]
|
156 |
+
```
|
157 |
+
|
158 |
+
You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`.
|
159 |
+
|
160 |
+
<Tip>
|
161 |
+
|
162 |
+
✏️ **Try it out!** Use the `num_return_sequences` and `max_length` arguments to generate two sentences of 15 words each.
|
163 |
+
|
164 |
+
</Tip>
|
165 |
+
|
166 |
+
|
167 |
+
## Using any model from the Hub in a pipeline[[using-any-model-from-the-hub-in-a-pipeline]]
|
168 |
+
|
169 |
+
The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the [Model Hub](https://huggingface.co/models) and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like [this one](https://huggingface.co/models?pipeline_tag=text-generation).
|
170 |
+
|
171 |
+
Let's try the [`distilgpt2`](https://huggingface.co/distilgpt2) model! Here's how to load it in the same pipeline as before:
|
172 |
+
|
173 |
+
```python
|
174 |
+
from transformers import pipeline
|
175 |
+
|
176 |
+
generator = pipeline("text-generation", model="distilgpt2")
|
177 |
+
generator(
|
178 |
+
"In this course, we will teach you how to",
|
179 |
+
max_length=30,
|
180 |
+
num_return_sequences=2,
|
181 |
+
)
|
182 |
+
```
|
183 |
+
|
184 |
+
```python out
|
185 |
+
[{'generated_text': 'In this course, we will teach you how to manipulate the world and '
|
186 |
+
'move your mental and physical capabilities to your advantage.'},
|
187 |
+
{'generated_text': 'In this course, we will teach you how to become an expert and '
|
188 |
+
'practice realtime, and with a hands on experience on both real '
|
189 |
+
'time and real'}]
|
190 |
+
```
|
191 |
+
|
192 |
+
You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages.
|
193 |
+
|
194 |
+
Once you select a model by clicking on it, you'll see that there is a widget enabling you to try it directly online. This way you can quickly test the model's capabilities before downloading it.
|
195 |
+
|
196 |
+
<Tip>
|
197 |
+
|
198 |
+
✏️ **Try it out!** Use the filters to find a text generation model for another language. Feel free to play with the widget and use it in a pipeline!
|
199 |
+
|
200 |
+
</Tip>
|
201 |
+
|
202 |
+
### The Inference API[[the-inference-api]]
|
203 |
+
|
204 |
+
All the models can be tested directly through your browser using the Inference API, which is available on the Hugging Face [website](https://huggingface.co/). You can play with the model directly on this page by inputting custom text and watching the model process the input data.
|
205 |
+
|
206 |
+
The Inference API that powers the widget is also available as a paid product, which comes in handy if you need it for your workflows. See the [pricing page](https://huggingface.co/pricing) for more details.
|
207 |
+
|
208 |
+
## Mask filling[[mask-filling]]
|
209 |
+
|
210 |
+
The next pipeline you'll try is `fill-mask`. The idea of this task is to fill in the blanks in a given text:
|
211 |
+
|
212 |
+
```python
|
213 |
+
from transformers import pipeline
|
214 |
+
|
215 |
+
unmasker = pipeline("fill-mask")
|
216 |
+
unmasker("This course will teach you all about <mask> models.", top_k=2)
|
217 |
+
```
|
218 |
+
|
219 |
+
```python out
|
220 |
+
[{'sequence': 'This course will teach you all about mathematical models.',
|
221 |
+
'score': 0.19619831442832947,
|
222 |
+
'token': 30412,
|
223 |
+
'token_str': ' mathematical'},
|
224 |
+
{'sequence': 'This course will teach you all about computational models.',
|
225 |
+
'score': 0.04052725434303284,
|
226 |
+
'token': 38163,
|
227 |
+
'token_str': ' computational'}]
|
228 |
+
```
|
229 |
+
|
230 |
+
The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `<mask>` word, which is often referred to as a *mask token*. Other mask-filling models might have different mask tokens, so it's always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget.
|
231 |
+
|
232 |
+
<Tip>
|
233 |
+
|
234 |
+
✏️ **Try it out!** Search for the `bert-base-cased` model on the Hub and identify its mask word in the Inference API widget. What does this model predict for the sentence in our `pipeline` example above?
|
235 |
+
|
236 |
+
</Tip>
|
237 |
+
|
238 |
+
## Named entity recognition[[named-entity-recognition]]
|
239 |
+
|
240 |
+
Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let's look at an example:
|
241 |
+
|
242 |
+
```python
|
243 |
+
from transformers import pipeline
|
244 |
+
|
245 |
+
ner = pipeline("ner", grouped_entities=True)
|
246 |
+
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
|
247 |
+
```
|
248 |
+
|
249 |
+
```python out
|
250 |
+
[{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18},
|
251 |
+
{'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45},
|
252 |
+
{'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57}
|
253 |
+
]
|
254 |
+
```
|
255 |
+
|
256 |
+
Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC).
|
257 |
+
|
258 |
+
We pass the option `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped "Hugging" and "Face" as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, `Sylvain` is split into four pieces: `S`, `##yl`, `##va`, and `##in`. In the post-processing step, the pipeline successfully regrouped those pieces.
|
259 |
+
|
260 |
+
<Tip>
|
261 |
+
|
262 |
+
✏️ **Try it out!** Search the Model Hub for a model able to do part-of-speech tagging (usually abbreviated as POS) in English. What does this model predict for the sentence in the example above?
|
263 |
+
|
264 |
+
</Tip>
|
265 |
+
|
266 |
+
## Question answering[[question-answering]]
|
267 |
+
|
268 |
+
The `question-answering` pipeline answers questions using information from a given context:
|
269 |
+
|
270 |
+
```python
|
271 |
+
from transformers import pipeline
|
272 |
+
|
273 |
+
question_answerer = pipeline("question-answering")
|
274 |
+
question_answerer(
|
275 |
+
question="Where do I work?",
|
276 |
+
context="My name is Sylvain and I work at Hugging Face in Brooklyn",
|
277 |
+
)
|
278 |
+
```
|
279 |
+
|
280 |
+
```python out
|
281 |
+
{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
|
282 |
+
```
|
283 |
+
|
284 |
+
Note that this pipeline works by extracting information from the provided context; it does not generate the answer.
|
285 |
+
|
286 |
+
## Summarization[[summarization]]
|
287 |
+
|
288 |
+
Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here's an example:
|
289 |
+
|
290 |
+
```python
|
291 |
+
from transformers import pipeline
|
292 |
+
|
293 |
+
summarizer = pipeline("summarization")
|
294 |
+
summarizer(
|
295 |
+
"""
|
296 |
+
America has changed dramatically during recent years. Not only has the number of
|
297 |
+
graduates in traditional engineering disciplines such as mechanical, civil,
|
298 |
+
electrical, chemical, and aeronautical engineering declined, but in most of
|
299 |
+
the premier American universities engineering curricula now concentrate on
|
300 |
+
and encourage largely the study of engineering science. As a result, there
|
301 |
+
are declining offerings in engineering subjects dealing with infrastructure,
|
302 |
+
the environment, and related issues, and greater concentration on high
|
303 |
+
technology subjects, largely supporting increasingly complex scientific
|
304 |
+
developments. While the latter is important, it should not be at the expense
|
305 |
+
of more traditional engineering.
|
306 |
+
|
307 |
+
Rapidly developing economies such as China and India, as well as other
|
308 |
+
industrial countries in Europe and Asia, continue to encourage and advance
|
309 |
+
the teaching of engineering. Both China and India, respectively, graduate
|
310 |
+
six and eight times as many traditional engineers as does the United States.
|
311 |
+
Other industrial countries at minimum maintain their output, while America
|
312 |
+
suffers an increasingly serious decline in the number of engineering graduates
|
313 |
+
and a lack of well-educated engineers.
|
314 |
+
"""
|
315 |
+
)
|
316 |
+
```
|
317 |
+
|
318 |
+
```python out
|
319 |
+
[{'summary_text': ' America has changed dramatically during recent years . The '
|
320 |
+
'number of engineering graduates in the U.S. has declined in '
|
321 |
+
'traditional engineering disciplines such as mechanical, civil '
|
322 |
+
', electrical, chemical, and aeronautical engineering . Rapidly '
|
323 |
+
'developing economies such as China and India, as well as other '
|
324 |
+
'industrial countries in Europe and Asia, continue to encourage '
|
325 |
+
'and advance engineering .'}]
|
326 |
+
```
|
327 |
+
|
328 |
+
Like with text generation, you can specify a `max_length` or a `min_length` for the result.
|
329 |
+
|
330 |
+
|
331 |
+
## Translation[[translation]]
|
332 |
+
|
333 |
+
For translation, you can use a default model if you provide a language pair in the task name (such as `"translation_en_to_fr"`), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models). Here we'll try translating from French to English:
|
334 |
+
|
335 |
+
```python
|
336 |
+
from transformers import pipeline
|
337 |
+
|
338 |
+
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
|
339 |
+
translator("Ce cours est produit par Hugging Face.")
|
340 |
+
```
|
341 |
+
|
342 |
+
```python out
|
343 |
+
[{'translation_text': 'This course is produced by Hugging Face.'}]
|
344 |
+
```
|
345 |
+
|
346 |
+
Like with text generation and summarization, you can specify a `max_length` or a `min_length` for the result.
|
347 |
+
|
348 |
+
<Tip>
|
349 |
+
|
350 |
+
✏️ **Try it out!** Search for translation models in other languages and try to translate the previous sentence into a few different languages.
|
351 |
+
|
352 |
+
</Tip>
|
353 |
+
|
354 |
+
## Image and audio pipelines
|
355 |
+
|
356 |
+
Beyond text, Transformer models can also work with images and audio. Here are a few examples:
|
357 |
+
|
358 |
+
### Image classification
|
359 |
+
|
360 |
+
```python
|
361 |
+
from transformers import pipeline
|
362 |
+
|
363 |
+
image_classifier = pipeline("image-classification")
|
364 |
+
result = image_classifier(
|
365 |
+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
|
366 |
+
)
|
367 |
+
print(result)
|
368 |
+
```
|
369 |
+
|
370 |
+
```python out
|
371 |
+
[{'score': 0.4585989117622375, 'label': 'lynx, catamount'},
|
372 |
+
{'score': 0.10480582714080811, 'label': 'Egyptian cat'},
|
373 |
+
{'score': 0.08250780403614044, 'label': 'tabby, tabby cat'},
|
374 |
+
{'score': 0.0732262134552002, 'label': 'tiger cat'},
|
375 |
+
{'score': 0.04910798370838165, 'label': 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor'}]
|
376 |
+
```
|
377 |
+
|
378 |
+
### Automatic speech recognition
|
379 |
+
|
380 |
+
```python
|
381 |
+
from transformers import pipeline
|
382 |
+
|
383 |
+
transcriber = pipeline("automatic-speech-recognition")
|
384 |
+
result = transcriber(
|
385 |
+
"https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
|
386 |
+
)
|
387 |
+
print(result)
|
388 |
+
```
|
389 |
+
|
390 |
+
```python out
|
391 |
+
{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS CREED'}
|
392 |
+
```
|
393 |
+
|
394 |
+
## Combining data from multiple sources
|
395 |
+
|
396 |
+
One powerful application of Transformer models is their ability to combine and process data from multiple sources. This is especially useful when you need to:
|
397 |
+
|
398 |
+
1. Search across multiple databases or repositories
|
399 |
+
2. Consolidate information from different formats (text, images, audio)
|
400 |
+
3. Create a unified view of related information
|
401 |
+
|
402 |
+
For example, you could build a system that:
|
403 |
+
- Searches for information across databases in multiple modalities like text and image.
|
404 |
+
- Combines results from different sources into a single coherent response. For example, from an audio file and text description.
|
405 |
+
- Presents the most relevant information from a database of documents and metadata.
|
406 |
+
|
407 |
+
|
408 |
+
## Conclusion
|
409 |
+
|
410 |
+
The pipelines shown in this chapter are mostly for demonstrative purposes. They were programmed for specific tasks and cannot perform variations of them. In the next chapter, you'll learn what's inside a `pipeline()` function and how to customize its behavior.
|
chapter1/material/4.md
ADDED
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# How do Transformers work?[[how-do-transformers-work]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner
|
4 |
+
chapter={1}
|
5 |
+
classNames="absolute z-10 right-0 top-0"
|
6 |
+
/>
|
7 |
+
|
8 |
+
In this section, we will take a high-level look at the architecture of Transformer models.
|
9 |
+
|
10 |
+
## A bit of Transformer history[[a-bit-of-transformer-history]]
|
11 |
+
|
12 |
+
Here are some reference points in the (short) history of Transformer models:
|
13 |
+
|
14 |
+
<div class="flex justify-center">
|
15 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_chrono.svg" alt="A brief chronology of Transformers models.">
|
16 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_chrono-dark.svg" alt="A brief chronology of Transformers models.">
|
17 |
+
</div>
|
18 |
+
|
19 |
+
The [Transformer architecture](https://arxiv.org/abs/1706.03762) was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:
|
20 |
+
|
21 |
+
- **June 2018**: [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
|
22 |
+
|
23 |
+
- **October 2018**: [BERT](https://arxiv.org/abs/1810.04805), another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
|
24 |
+
|
25 |
+
- **February 2019**: [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
|
26 |
+
|
27 |
+
- **October 2019**: [DistilBERT](https://arxiv.org/abs/1910.01108), a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT's performance
|
28 |
+
|
29 |
+
- **October 2019**: [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683), two large pretrained models using the same architecture as the original Transformer model (the first to do so)
|
30 |
+
|
31 |
+
- **May 2020**, [GPT-3](https://arxiv.org/abs/2005.14165), an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called _zero-shot learning_)
|
32 |
+
|
33 |
+
This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:
|
34 |
+
|
35 |
+
- GPT-like (also called _auto-regressive_ Transformer models)
|
36 |
+
- BERT-like (also called _auto-encoding_ Transformer models)
|
37 |
+
- BART/T5-like (also called _sequence-to-sequence_ Transformer models)
|
38 |
+
|
39 |
+
We will dive into these families in more depth later on.
|
40 |
+
|
41 |
+
## Transformers are language models[[transformers-are-language-models]]
|
42 |
+
|
43 |
+
All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as *language models*. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!
|
44 |
+
|
45 |
+
This type of model develops a statistical understanding of the language it has been trained on, but it's not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called *transfer learning*. During this process, the model is fine-tuned in a supervised way -- that is, using human-annotated labels -- on a given task.
|
46 |
+
|
47 |
+
An example of a task is predicting the next word in a sentence having read the *n* previous words. This is called *causal language modeling* because the output depends on the past and present inputs, but not the future ones.
|
48 |
+
|
49 |
+
<div class="flex justify-center">
|
50 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling.svg" alt="Example of causal language modeling in which the next word from a sentence is predicted.">
|
51 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/causal_modeling-dark.svg" alt="Example of causal language modeling in which the next word from a sentence is predicted.">
|
52 |
+
</div>
|
53 |
+
|
54 |
+
Another example is *masked language modeling*, in which the model predicts a masked word in the sentence.
|
55 |
+
|
56 |
+
<div class="flex justify-center">
|
57 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/masked_modeling.svg" alt="Example of masked language modeling in which a masked word from a sentence is predicted.">
|
58 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/masked_modeling-dark.svg" alt="Example of masked language modeling in which a masked word from a sentence is predicted.">
|
59 |
+
</div>
|
60 |
+
|
61 |
+
## Transformers are big models[[transformers-are-big-models]]
|
62 |
+
|
63 |
+
Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models' sizes as well as the amount of data they are pretrained on.
|
64 |
+
|
65 |
+
<div class="flex justify-center">
|
66 |
+
<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/model_parameters.png" alt="Number of parameters of recent Transformers models" width="90%">
|
67 |
+
</div>
|
68 |
+
|
69 |
+
Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph.
|
70 |
+
|
71 |
+
<div class="flex justify-center">
|
72 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/carbon_footprint.svg" alt="The carbon footprint of a large language model.">
|
73 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/carbon_footprint-dark.svg" alt="The carbon footprint of a large language model.">
|
74 |
+
</div>
|
75 |
+
|
76 |
+
<Youtube id="ftWlj4FBHTg"/>
|
77 |
+
|
78 |
+
And this is showing a project for a (very big) model led by a team consciously trying to reduce the environmental impact of pretraining. The footprint of running lots of trials to get the best hyperparameters would be even higher.
|
79 |
+
|
80 |
+
Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs!
|
81 |
+
|
82 |
+
This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community.
|
83 |
+
|
84 |
+
By the way, you can evaluate the carbon footprint of your models' training through several tools. For example [ML CO2 Impact](https://mlco2.github.io/impact/) or [Code Carbon]( https://codecarbon.io/) which is integrated in 🤗 Transformers. To learn more about this, you can read this [blog post](https://huggingface.co/blog/carbon-emissions-on-the-hub) which will show you how to generate an `emissions.csv` file with an estimate of the footprint of your training, as well as the [documentation](https://huggingface.co/docs/hub/model-cards-co2) of 🤗 Transformers addressing this topic.
|
85 |
+
|
86 |
+
|
87 |
+
## Transfer Learning[[transfer-learning]]
|
88 |
+
|
89 |
+
<Youtube id="BqqfQnyjmgg" />
|
90 |
+
|
91 |
+
*Pretraining* is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge.
|
92 |
+
|
93 |
+
<div class="flex justify-center">
|
94 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/pretraining.svg" alt="The pretraining of a language model is costly in both time and money.">
|
95 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/pretraining-dark.svg" alt="The pretraining of a language model is costly in both time and money.">
|
96 |
+
</div>
|
97 |
+
|
98 |
+
This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks.
|
99 |
+
|
100 |
+
*Fine-tuning*, on the other hand, is the training done **after** a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait -- why not simply train the model for your final use case from the start (**scratch**)? There are a couple of reasons:
|
101 |
+
|
102 |
+
* The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task).
|
103 |
+
* Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results.
|
104 |
+
* For the same reason, the amount of time and resources needed to get good results are much lower.
|
105 |
+
|
106 |
+
For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is "transferred," hence the term *transfer learning*.
|
107 |
+
|
108 |
+
<div class="flex justify-center">
|
109 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/finetuning.svg" alt="The fine-tuning of a language model is cheaper than pretraining in both time and money.">
|
110 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/finetuning-dark.svg" alt="The fine-tuning of a language model is cheaper than pretraining in both time and money.">
|
111 |
+
</div>
|
112 |
+
|
113 |
+
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.
|
114 |
+
|
115 |
+
This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model -- one as close as possible to the task you have at hand -- and fine-tune it.
|
116 |
+
|
117 |
+
## General architecture[[general-architecture]]
|
118 |
+
|
119 |
+
In this section, we'll go over the general architecture of the Transformer model. Don't worry if you don't understand some of the concepts; there are detailed sections later covering each of the components.
|
120 |
+
|
121 |
+
<Youtube id="H39Z_720T5s" />
|
122 |
+
|
123 |
+
## Introduction[[introduction]]
|
124 |
+
|
125 |
+
The model is primarily composed of two blocks:
|
126 |
+
|
127 |
+
* **Encoder (left)**: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
|
128 |
+
* **Decoder (right)**: The decoder uses the encoder's representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.
|
129 |
+
|
130 |
+
<div class="flex justify-center">
|
131 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks.svg" alt="Architecture of a Transformers models">
|
132 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers_blocks-dark.svg" alt="Architecture of a Transformers models">
|
133 |
+
</div>
|
134 |
+
|
135 |
+
Each of these parts can be used independently, depending on the task:
|
136 |
+
|
137 |
+
* **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.
|
138 |
+
* **Decoder-only models**: Good for generative tasks such as text generation.
|
139 |
+
* **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization.
|
140 |
+
|
141 |
+
We will dive into those architectures independently in later sections.
|
142 |
+
|
143 |
+
## Attention layers[[attention-layers]]
|
144 |
+
|
145 |
+
A key feature of Transformer models is that they are built with special layers called *attention layers*. In fact, the title of the paper introducing the Transformer architecture was ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)! We will explore the details of attention layers later in the course; for now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.
|
146 |
+
|
147 |
+
To put this into context, consider the task of translating text from English to French. Given the input "You like this course", a translation model will need to also attend to the adjacent word "You" to get the proper translation for the word "like", because in French the verb "like" is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating "this" the model will also need to pay attention to the word "course", because "this" translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of "course". With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.
|
148 |
+
|
149 |
+
The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied.
|
150 |
+
|
151 |
+
Now that you have an idea of what attention layers are all about, let's take a closer look at the Transformer architecture.
|
152 |
+
|
153 |
+
## The original architecture[[the-original-architecture]]
|
154 |
+
|
155 |
+
The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word.
|
156 |
+
|
157 |
+
To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3.
|
158 |
+
|
159 |
+
The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right:
|
160 |
+
|
161 |
+
<div class="flex justify-center">
|
162 |
+
<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers.svg" alt="Architecture of a Transformers models">
|
163 |
+
<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter1/transformers-dark.svg" alt="Architecture of a Transformers models">
|
164 |
+
</div>
|
165 |
+
|
166 |
+
Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word.
|
167 |
+
|
168 |
+
The *attention mask* can also be used in the encoder/decoder to prevent the model from paying attention to some special words -- for instance, the special padding word used to make all the inputs the same length when batching together sentences.
|
169 |
+
|
170 |
+
## Architectures vs. checkpoints[[architecture-vs-checkpoints]]
|
171 |
+
|
172 |
+
As we dive into Transformer models in this course, you'll see mentions of *architectures* and *checkpoints* as well as *models*. These terms all have slightly different meanings:
|
173 |
+
|
174 |
+
* **Architecture**: This is the skeleton of the model -- the definition of each layer and each operation that happens within the model.
|
175 |
+
* **Checkpoints**: These are the weights that will be loaded in a given architecture.
|
176 |
+
* **Model**: This is an umbrella term that isn't as precise as "architecture" or "checkpoint": it can mean both. This course will specify *architecture* or *checkpoint* when it matters to reduce ambiguity.
|
177 |
+
|
178 |
+
For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say "the BERT model" and "the `bert-base-cased` model."
|
chapter1/material/5.md
ADDED
@@ -0,0 +1,250 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# How 🤗 Transformers solve tasks
|
2 |
+
|
3 |
+
In [🤗 Transformers can do](/course/chapter1/2), you learned about natural language processing (NLP), speech and audio, computer vision tasks, and some important applications of them. This page will look closely at how models solve these tasks and explain what's happening under the hood. There are many ways to solve a given task, some models may implement certain techniques or even approach the task from a new angle, but for Transformer models, the general idea is the same. Owing to its flexible architecture, most models are a variant of an encoder, a decoder, or an encoder-decoder structure.
|
4 |
+
|
5 |
+
<Tip>
|
6 |
+
|
7 |
+
Before diving into specific architectural variants, it's helpful to understand that most tasks follow a similar pattern: input data is processed through a model, and the output is interpreted for a specific task. The differences lie in how the data is prepared, what model architecture variant is used, and how the output is processed.
|
8 |
+
|
9 |
+
</Tip>
|
10 |
+
|
11 |
+
To explain how tasks are solved, we'll walk through what goes on inside the model to output useful predictions. We'll cover the following models and their corresponding tasks:
|
12 |
+
|
13 |
+
- [Wav2Vec2](model_doc/wav2vec2) for audio classification and automatic speech recognition (ASR)
|
14 |
+
- [Vision Transformer (ViT)](model_doc/vit) and [ConvNeXT](model_doc/convnext) for image classification
|
15 |
+
- [DETR](model_doc/detr) for object detection
|
16 |
+
- [Mask2Former](model_doc/mask2former) for image segmentation
|
17 |
+
- [GLPN](model_doc/glpn) for depth estimation
|
18 |
+
- [BERT](model_doc/bert) for NLP tasks like text classification, token classification and question answering that use an encoder
|
19 |
+
- [GPT2](model_doc/gpt2) for NLP tasks like text generation that use a decoder
|
20 |
+
- [BART](model_doc/bart) for NLP tasks like summarization and translation that use an encoder-decoder
|
21 |
+
|
22 |
+
<Tip>
|
23 |
+
|
24 |
+
Before you go further, it is good to have some basic knowledge of the original Transformer architecture. Knowing how encoders, decoders, and attention work will aid you in understanding how different Transformer models work. Be sure to check out our [the previous section](https://huggingface.co/course/chapter1/4?fw=pt) for more information!
|
25 |
+
|
26 |
+
</Tip>
|
27 |
+
|
28 |
+
## Transformer models for Language
|
29 |
+
|
30 |
+
Language models are at the heart of modern NLP. They're designed to understand and generate human language by learning the statistical patterns and relationships between words or tokens in text.
|
31 |
+
|
32 |
+
The Transformer was initially designed for machine translation, and since then, it has become the default architecture for solving all AI tasks. Some tasks lend themselves to the Transformer's encoder structure, while others are better suited for the decoder. Still, other tasks make use of both the Transformer's encoder-decoder structure.
|
33 |
+
|
34 |
+
<Tip>
|
35 |
+
A language model is a statistical model that predicts the probability of a sequence of words. Modern language models can both understand and generate text, making them versatile tools for many NLP tasks.
|
36 |
+
</Tip>
|
37 |
+
|
38 |
+
### How language models work
|
39 |
+
|
40 |
+
Language models work by being trained to predict the probability of a word given the context of surrounding words. This gives them a foundational understanding of language that can generalize to other tasks.
|
41 |
+
|
42 |
+
There are two main approaches for training a transformer model:
|
43 |
+
|
44 |
+
1. **Masked language modeling (MLM)**: Used by encoder models like BERT, this approach randomly masks some tokens in the input and trains the model to predict the original tokens based on the surrounding context. This allows the model to learn bidirectional context (looking at words both before and after the masked word).
|
45 |
+
|
46 |
+
2. **Causal language modeling (CLM)**: Used by decoder models like GPT, this approach predicts the next token based on all previous tokens in the sequence. The model can only use context from the left (previous tokens) to predict the next token.
|
47 |
+
|
48 |
+
### Types of language models
|
49 |
+
|
50 |
+
In the Transformers library, language models generally fall into three architectural categories:
|
51 |
+
|
52 |
+
1. **Encoder-only models** (like BERT): These models use a bidirectional approach to understand context from both directions. They're best suited for tasks that require deep understanding of text, such as classification, named entity recognition, and question answering.
|
53 |
+
|
54 |
+
2. **Decoder-only models** (like GPT): These models process text from left to right and are particularly good at text generation tasks. They can complete sentences, write essays, or even generate code based on a prompt.
|
55 |
+
|
56 |
+
3. **Encoder-decoder models** (like T5, BART): These models combine both approaches, using an encoder to understand the input and a decoder to generate output. They excel at sequence-to-sequence tasks like translation, summarization, and question answering.
|
57 |
+
|
58 |
+
Language models are typically pretrained on large amounts of text data in a self-supervised manner (without human annotations), then fine-tuned on specific tasks. This approach, known as transfer learning, allows these models to adapt to many different NLP tasks with relatively small amounts of task-specific data.
|
59 |
+
|
60 |
+
In the following sections, we'll explore specific model architectures and how they're applied to various tasks across speech, vision, and text domains.
|
61 |
+
|
62 |
+
<Tip>
|
63 |
+
Understanding which part of the Transformer architecture (encoder, decoder, or both) is best suited for a particular NLP task is key to choosing the right model. Generally, tasks requiring bidirectional context use encoders, tasks generating text use decoders, and tasks converting one sequence to another use encoder-decoders.
|
64 |
+
</Tip>
|
65 |
+
|
66 |
+
### Text classification
|
67 |
+
|
68 |
+
Text classification involves assigning predefined categories to text documents, such as sentiment analysis, topic classification, or spam detection.
|
69 |
+
|
70 |
+
[BERT](model_doc/bert) is an encoder-only model and is the first model to effectively implement deep bidirectionality to learn richer representations of the text by attending to words on both sides.
|
71 |
+
|
72 |
+
1. BERT uses [WordPiece](tokenizer_summary#wordpiece) tokenization to generate a token embedding of the text. To tell the difference between a single sentence and a pair of sentences, a special `[SEP]` token is added to differentiate them. A special `[CLS]` token is added to the beginning of every sequence of text. The final output with the `[CLS]` token is used as the input to the classification head for classification tasks. BERT also adds a segment embedding to denote whether a token belongs to the first or second sentence in a pair of sentences.
|
73 |
+
|
74 |
+
2. BERT is pretrained with two objectives: masked language modeling and next-sentence prediction. In masked language modeling, some percentage of the input tokens are randomly masked, and the model needs to predict these. This solves the issue of bidirectionality, where the model could cheat and see all the words and "predict" the next word. The final hidden states of the predicted mask tokens are passed to a feedforward network with a softmax over the vocabulary to predict the masked word.
|
75 |
+
|
76 |
+
The second pretraining object is next-sentence prediction. The model must predict whether sentence B follows sentence A. Half of the time sentence B is the next sentence, and the other half of the time, sentence B is a random sentence. The prediction, whether it is the next sentence or not, is passed to a feedforward network with a softmax over the two classes (`IsNext` and `NotNext`).
|
77 |
+
|
78 |
+
3. The input embeddings are passed through multiple encoder layers to output some final hidden states.
|
79 |
+
|
80 |
+
To use the pretrained model for text classification, add a sequence classification head on top of the base BERT model. The sequence classification head is a linear layer that accepts the final hidden states and performs a linear transformation to convert them into logits. The cross-entropy loss is calculated between the logits and target to find the most likely label.
|
81 |
+
|
82 |
+
Ready to try your hand at text classification? Check out our complete [text classification guide](tasks/sequence_classification) to learn how to finetune DistilBERT and use it for inference!
|
83 |
+
|
84 |
+
### Token classification
|
85 |
+
|
86 |
+
Token classification involves assigning a label to each token in a sequence, such as in named entity recognition or part-of-speech tagging.
|
87 |
+
|
88 |
+
To use BERT for token classification tasks like named entity recognition (NER), add a token classification head on top of the base BERT model. The token classification head is a linear layer that accepts the final hidden states and performs a linear transformation to convert them into logits. The cross-entropy loss is calculated between the logits and each token to find the most likely label.
|
89 |
+
|
90 |
+
Ready to try your hand at token classification? Check out our complete [token classification guide](tasks/token_classification) to learn how to finetune DistilBERT and use it for inference!
|
91 |
+
|
92 |
+
### Question answering
|
93 |
+
|
94 |
+
Question answering involves finding the answer to a question within a given context or passage.
|
95 |
+
|
96 |
+
To use BERT for question answering, add a span classification head on top of the base BERT model. This linear layer accepts the final hidden states and performs a linear transformation to compute the `span` start and end logits corresponding to the answer. The cross-entropy loss is calculated between the logits and the label position to find the most likely span of text corresponding to the answer.
|
97 |
+
|
98 |
+
Ready to try your hand at question answering? Check out our complete [question answering guide](tasks/question_answering) to learn how to finetune DistilBERT and use it for inference!
|
99 |
+
|
100 |
+
<Tip>
|
101 |
+
💡 Notice how easy it is to use BERT for different tasks once it's been pretrained. You only need to add a specific head to the pretrained model to manipulate the hidden states into your desired output!
|
102 |
+
</Tip>
|
103 |
+
|
104 |
+
### Text generation
|
105 |
+
|
106 |
+
Text generation involves creating coherent and contextually relevant text based on a prompt or input.
|
107 |
+
|
108 |
+
[GPT-2](model_doc/gpt2) is a decoder-only model pretrained on a large amount of text. It can generate convincing (though not always true!) text given a prompt and complete other NLP tasks like question answering despite not being explicitly trained to.
|
109 |
+
|
110 |
+
<div class="flex justify-center">
|
111 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/gpt2_architecture.png"/>
|
112 |
+
</div>
|
113 |
+
|
114 |
+
1. GPT-2 uses [byte pair encoding (BPE)](tokenizer_summary#bytepair-encoding-bpe) to tokenize words and generate a token embedding. Positional encodings are added to the token embeddings to indicate the position of each token in the sequence. The input embeddings are passed through multiple decoder blocks to output some final hidden state. Within each decoder block, GPT-2 uses a *masked self-attention* layer which means GPT-2 can't attend to future tokens. It is only allowed to attend to tokens on the left. This is different from BERT's [`mask`] token because, in masked self-attention, an attention mask is used to set the score to `0` for future tokens.
|
115 |
+
|
116 |
+
2. The output from the decoder is passed to a language modeling head, which performs a linear transformation to convert the hidden states into logits. The label is the next token in the sequence, which are created by shifting the logits to the right by one. The cross-entropy loss is calculated between the shifted logits and the labels to output the next most likely token.
|
117 |
+
|
118 |
+
GPT-2's pretraining objective is based entirely on [causal language modeling](glossary#causal-language-modeling), predicting the next word in a sequence. This makes GPT-2 especially good at tasks that involve generating text.
|
119 |
+
|
120 |
+
Ready to try your hand at text generation? Check out our complete [causal language modeling guide](tasks/language_modeling#causal-language-modeling) to learn how to finetune DistilGPT-2 and use it for inference!
|
121 |
+
|
122 |
+
<Tip>
|
123 |
+
For more information about text generation, check out the [text generation strategies](generation_strategies) guide!
|
124 |
+
</Tip>
|
125 |
+
|
126 |
+
### Summarization
|
127 |
+
|
128 |
+
Summarization involves condensing a longer text into a shorter version while preserving its key information and meaning.
|
129 |
+
|
130 |
+
Encoder-decoder models like [BART](model_doc/bart) and [T5](model_doc/t5) are designed for the sequence-to-sequence pattern of a summarization task. We'll explain how BART works in this section, and then you can try finetuning T5 at the end.
|
131 |
+
|
132 |
+
<div class="flex justify-center">
|
133 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bart_architecture.png"/>
|
134 |
+
</div>
|
135 |
+
|
136 |
+
1. BART's encoder architecture is very similar to BERT and accepts a token and positional embedding of the text. BART is pretrained by corrupting the input and then reconstructing it with the decoder. Unlike other encoders with specific corruption strategies, BART can apply any type of corruption. The *text infilling* corruption strategy works the best though. In text infilling, a number of text spans are replaced with a **single** [`mask`] token. This is important because the model has to predict the masked tokens, and it teaches the model to predict the number of missing tokens. The input embeddings and masked spans are passed through the encoder to output some final hidden states, but unlike BERT, BART doesn't add a final feedforward network at the end to predict a word.
|
137 |
+
|
138 |
+
2. The encoder's output is passed to the decoder, which must predict the masked tokens and any uncorrupted tokens from the encoder's output. This gives additional context to help the decoder restore the original text. The output from the decoder is passed to a language modeling head, which performs a linear transformation to convert the hidden states into logits. The cross-entropy loss is calculated between the logits and the label, which is just the token shifted to the right.
|
139 |
+
|
140 |
+
Ready to try your hand at summarization? Check out our complete [summarization guide](tasks/summarization) to learn how to finetune T5 and use it for inference!
|
141 |
+
|
142 |
+
<Tip>
|
143 |
+
For more information about text generation, check out the [text generation strategies](generation_strategies) guide!
|
144 |
+
</Tip>
|
145 |
+
|
146 |
+
### Translation
|
147 |
+
|
148 |
+
Translation involves converting text from one language to another while preserving its meaning.
|
149 |
+
|
150 |
+
Translation is another example of a sequence-to-sequence task, which means you can use an encoder-decoder model like [BART](model_doc/bart) or [T5](model_doc/t5) to do it. We'll explain how BART works in this section, and then you can try finetuning T5 at the end.
|
151 |
+
|
152 |
+
BART adapts to translation by adding a separate randomly initialized encoder to map a source language to an input that can be decoded into the target language. This new encoder's embeddings are passed to the pretrained encoder instead of the original word embeddings. The source encoder is trained by updating the source encoder, positional embeddings, and input embeddings with the cross-entropy loss from the model output. The model parameters are frozen in this first step, and all the model parameters are trained together in the second step.
|
153 |
+
|
154 |
+
BART has since been followed up by a multilingual version, mBART, intended for translation and pretrained on many different languages.
|
155 |
+
|
156 |
+
Ready to try your hand at translation? Check out our complete [translation guide](tasks/translation) to learn how to finetune T5 and use it for inference!
|
157 |
+
|
158 |
+
<Tip>
|
159 |
+
As you've seen throughout this guide, many models follow similar patterns despite addressing different tasks. Understanding these common patterns can help you quickly grasp how new models work and how to adapt existing models to your specific needs.
|
160 |
+
</Tip>
|
161 |
+
|
162 |
+
## Modalities beyond text
|
163 |
+
|
164 |
+
Transformers are not limited to text. They can also be applied to other modalities like speech and audio, images, and video. Of course, on this course we will focus on text, but we can briefly introduce the other modalities.
|
165 |
+
|
166 |
+
### Speech and audio
|
167 |
+
|
168 |
+
Let's start by exploring how Transformer models handle speech and audio data, which presents unique challenges compared to text or images.
|
169 |
+
|
170 |
+
[Wav2Vec2](model_doc/wav2vec2) is a self-supervised model pretrained on unlabeled speech data and finetuned on labeled data for audio classification and automatic speech recognition.
|
171 |
+
|
172 |
+
<div class="flex justify-center">
|
173 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/wav2vec2_architecture.png"/>
|
174 |
+
</div>
|
175 |
+
|
176 |
+
This model has four main components:
|
177 |
+
|
178 |
+
1. A *feature encoder* takes the raw audio waveform, normalizes it to zero mean and unit variance, and converts it into a sequence of feature vectors that are each 20ms long.
|
179 |
+
|
180 |
+
2. Waveforms are continuous by nature, so they can't be divided into separate units like a sequence of text can be split into words. That's why the feature vectors are passed to a *quantization module*, which aims to learn discrete speech units. The speech unit is chosen from a collection of codewords, known as a *codebook* (you can think of this as the vocabulary). From the codebook, the vector or speech unit, that best represents the continuous audio input is chosen and forwarded through the model.
|
181 |
+
|
182 |
+
3. About half of the feature vectors are randomly masked, and the masked feature vector is fed to a *context network*, which is a Transformer encoder that also adds relative positional embeddings.
|
183 |
+
|
184 |
+
4. The pretraining objective of the context network is a *contrastive task*. The model has to predict the true quantized speech representation of the masked prediction from a set of false ones, encouraging the model to find the most similar context vector and quantized speech unit (the target label).
|
185 |
+
|
186 |
+
Now that wav2vec2 is pretrained, you can finetune it on your data for audio classification or automatic speech recognition!
|
187 |
+
|
188 |
+
<Tip>
|
189 |
+
The key innovation in Wav2Vec2 is how it converts continuous audio signals into discrete units that can be processed like text tokens. This bridging between continuous and discrete representations is what makes it so powerful for speech tasks.
|
190 |
+
</Tip>
|
191 |
+
|
192 |
+
### Audio classification
|
193 |
+
|
194 |
+
To use the pretrained model for audio classification, add a sequence classification head on top of the base Wav2Vec2 model. The classification head is a linear layer that accepts the encoder's hidden states. The hidden states represent the learned features from each audio frame which can have varying lengths. To create one vector of fixed-length, the hidden states are pooled first and then transformed into logits over the class labels. The cross-entropy loss is calculated between the logits and target to find the most likely class.
|
195 |
+
|
196 |
+
Ready to try your hand at audio classification? Check out our complete [audio classification guide](tasks/audio_classification) to learn how to finetune Wav2Vec2 and use it for inference!
|
197 |
+
|
198 |
+
### Automatic speech recognition
|
199 |
+
|
200 |
+
To use the pretrained model for automatic speech recognition, add a language modeling head on top of the base Wav2Vec2 model for [connectionist temporal classification (CTC)](glossary#connectionist-temporal-classification-ctc). The language modeling head is a linear layer that accepts the encoder's hidden states and transforms them into logits. Each logit represents a token class (the number of tokens comes from the task vocabulary). The CTC loss is calculated between the logits and targets to find the most likely sequence of tokens, which are then decoded into a transcription.
|
201 |
+
|
202 |
+
Ready to try your hand at automatic speech recognition? Check out our complete [automatic speech recognition guide](tasks/asr) to learn how to finetune Wav2Vec2 and use it for inference!
|
203 |
+
|
204 |
+
### Computer vision
|
205 |
+
|
206 |
+
Now let's move on to computer vision tasks, which deal with understanding and interpreting visual information from images or videos.
|
207 |
+
|
208 |
+
There are two ways to approach computer vision tasks:
|
209 |
+
|
210 |
+
1. Split an image into a sequence of patches and process them in parallel with a Transformer.
|
211 |
+
2. Use a modern CNN, like [ConvNeXT](model_doc/convnext), which relies on convolutional layers but adopts modern network designs.
|
212 |
+
|
213 |
+
<Tip>
|
214 |
+
|
215 |
+
A third approach mixes Transformers with convolutions (for example, [Convolutional Vision Transformer](model_doc/cvt) or [LeViT](model_doc/levit)). We won't discuss those because they just combine the two approaches we examine here.
|
216 |
+
|
217 |
+
</Tip>
|
218 |
+
|
219 |
+
ViT and ConvNeXT are commonly used for image classification, but for other vision tasks like object detection, segmentation, and depth estimation, we'll look at DETR, Mask2Former and GLPN, respectively; these models are better suited for those tasks.
|
220 |
+
|
221 |
+
### Image classification
|
222 |
+
|
223 |
+
Image classification is one of the fundamental computer vision tasks. Let's see how different model architectures approach this problem.
|
224 |
+
|
225 |
+
ViT and ConvNeXT can both be used for image classification; the main difference is that ViT uses an attention mechanism while ConvNeXT uses convolutions.
|
226 |
+
|
227 |
+
|
228 |
+
[ViT](model_doc/vit) replaces convolutions entirely with a pure Transformer architecture. If you're familiar with the original Transformer, then you're already most of the way toward understanding ViT.
|
229 |
+
|
230 |
+
<div class="flex justify-center">
|
231 |
+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg"/>
|
232 |
+
</div>
|
233 |
+
|
234 |
+
The main change ViT introduced was in how images are fed to a Transformer:
|
235 |
+
|
236 |
+
1. An image is split into square non-overlapping patches, each of which gets turned into a vector or *patch embedding*. The patch embeddings are generated from a convolutional 2D layer which creates the proper input dimensions (which for a base Transformer is 768 values for each patch embedding). If you had a 224x224 pixel image, you could split it into 196 16x16 image patches. Just like how text is tokenized into words, an image is "tokenized" into a sequence of patches.
|
237 |
+
|
238 |
+
2. A *learnable embedding* - a special `[CLS]` token - is added to the beginning of the patch embeddings just like BERT. The final hidden state of the `[CLS]` token is used as the input to the attached classification head; other outputs are ignored. This token helps the model learn how to encode a representation of the image.
|
239 |
+
|
240 |
+
3. The last thing to add to the patch and learnable embeddings are the *position embeddings* because the model doesn't know how the image patches are ordered. The position embeddings are also learnable and have the same size as the patch embeddings. Finally, all of the embeddings are passed to the Transformer encoder.
|
241 |
+
|
242 |
+
4. The output, specifically only the output with the `[CLS]` token, is passed to a multilayer perceptron head (MLP). ViT's pretraining objective is simply classification. Like other classification heads, the MLP head converts the output into logits over the class labels and calculates the cross-entropy loss to find the most likely class.
|
243 |
+
|
244 |
+
Ready to try your hand at image classification? Check out our complete [image classification guide](tasks/image_classification) to learn how to finetune ViT and use it for inference!
|
245 |
+
|
246 |
+
<Tip>
|
247 |
+
Notice the parallel between ViT and BERT: both use a special token (`[CLS]`) to capture the overall representation, both add position information to their embeddings, and both use a Transformer encoder to process the sequence of tokens/patches.
|
248 |
+
</Tip>
|
249 |
+
|
250 |
+
|
chapter1/material/6.md
ADDED
@@ -0,0 +1,219 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<CourseFloatingBanner
|
2 |
+
chapter={1}
|
3 |
+
classNames="absolute z-10 right-0 top-0"
|
4 |
+
/>
|
5 |
+
|
6 |
+
# Transformer Architectures[[transformer-architectures]]
|
7 |
+
|
8 |
+
In the previous sections, we introduced the general Transformer architecture and explored how these models can solve various tasks. Now, let's take a closer look at the three main architectural variants of Transformer models and understand when to use each one.
|
9 |
+
|
10 |
+
<Tip>
|
11 |
+
|
12 |
+
Remember that most Transformer models use one of three architectures: encoder-only, decoder-only, or encoder-decoder (sequence-to-sequence). Understanding these differences will help you choose the right model for your specific task.
|
13 |
+
|
14 |
+
</Tip>
|
15 |
+
|
16 |
+
## Encoder models[[encoder-models]]
|
17 |
+
|
18 |
+
<Youtube id="MUqNwgPjJvQ" />
|
19 |
+
|
20 |
+
Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having "bi-directional" attention, and are often called *auto-encoding models*.
|
21 |
+
|
22 |
+
The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.
|
23 |
+
|
24 |
+
Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering.
|
25 |
+
|
26 |
+
<Tip>
|
27 |
+
As we saw in [How 🤗 Transformers solve tasks](chapter1/4a), encoder models like BERT excel at understanding text because they can look at the entire context in both directions. This makes them perfect for tasks where comprehension of the whole input is important.
|
28 |
+
</Tip>
|
29 |
+
|
30 |
+
Representatives of this family of models include:
|
31 |
+
|
32 |
+
- [BERT](https://huggingface.co/docs/transformers/model_doc/bert)
|
33 |
+
- [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)
|
34 |
+
- [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert)
|
35 |
+
|
36 |
+
## Decoder models[[decoder-models]]
|
37 |
+
|
38 |
+
<Youtube id="d_ixlCubqQw" />
|
39 |
+
|
40 |
+
Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called *auto-regressive models*.
|
41 |
+
|
42 |
+
The pretraining of decoder models usually revolves around predicting the next word in the sentence.
|
43 |
+
|
44 |
+
These models are best suited for tasks involving text generation.
|
45 |
+
|
46 |
+
<Tip>
|
47 |
+
Decoder models like GPT are designed to generate text by predicting one token at a time. As we explored in [How 🤗 Transformers solve tasks](chapter1/4a), they can only see previous tokens, which makes them excellent for creative text generation but less ideal for tasks requiring bidirectional understanding.
|
48 |
+
</Tip>
|
49 |
+
|
50 |
+
Representatives of this family of models include:
|
51 |
+
|
52 |
+
- [Hugging Face SmolLM Series](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)
|
53 |
+
- [Meta's Llama Series](https://huggingface.co/docs/transformers/en/model_doc/llama4)
|
54 |
+
- [Google's Gemma Series](https://huggingface.co/docs/transformers/main/en/model_doc/gemma3)
|
55 |
+
- [DeepSeek's V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)
|
56 |
+
|
57 |
+
### Modern Large Language Models (LLMs)
|
58 |
+
|
59 |
+
Most modern Large Language Models (LLMs) use the decoder-only architecture. These models have grown dramatically in size and capabilities over the past few years, with some of the largest models containing hundreds of billions of parameters.
|
60 |
+
|
61 |
+
Modern LLMs are typically trained in two phases:
|
62 |
+
1. **Pretraining**: The model learns to predict the next token on vast amounts of text data
|
63 |
+
2. **Instruction tuning**: The model is fine-tuned to follow instructions and generate helpful responses
|
64 |
+
|
65 |
+
This approach has led to models that can understand and generate human-like text across a wide range of topics and tasks.
|
66 |
+
|
67 |
+
#### Key capabilities of modern LLMs
|
68 |
+
|
69 |
+
Modern decoder-based LLMs have demonstrated impressive capabilities:
|
70 |
+
|
71 |
+
| Capability | Description | Example |
|
72 |
+
|------------|-------------|---------|
|
73 |
+
| Text generation | Creating coherent and contextually relevant text | Writing essays, stories, or emails |
|
74 |
+
| Summarization | Condensing long documents into shorter versions | Creating executive summaries of reports |
|
75 |
+
| Translation | Converting text between languages | Translating English to Spanish |
|
76 |
+
| Question answering | Providing answers to factual questions | "What is the capital of France?" |
|
77 |
+
| Code generation | Writing or completing code snippets | Creating a function based on a description |
|
78 |
+
| Reasoning | Working through problems step by step | Solving math problems or logical puzzles |
|
79 |
+
| Few-shot learning | Learning from a few examples in the prompt | Classifying text after seeing just 2-3 examples |
|
80 |
+
|
81 |
+
You can experiment with decoder-based LLMs directly in your browser. Here's an interactive demo of a large language model:
|
82 |
+
|
83 |
+
<iframe
|
84 |
+
src="https://huggingface.co/spaces/course-demos/llm-demo"
|
85 |
+
frameborder="0"
|
86 |
+
width="100%"
|
87 |
+
height="450"
|
88 |
+
></iframe>
|
89 |
+
|
90 |
+
## Sequence-to-sequence models[[sequence-to-sequence-models]]
|
91 |
+
|
92 |
+
<Youtube id="0_4KEb08xrE" />
|
93 |
+
|
94 |
+
Encoder-decoder models (also called *sequence-to-sequence models*) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input.
|
95 |
+
|
96 |
+
The pretraining of these models can take different forms, but it often involves reconstructing a sentence for which the input has been somehow corrupted (for instance by masking random words). The pretraining of the T5 model consists of replacing random spans of text (that can contain several words) with a single mask special token, and the task is then to predict the text that this mask token replaces.
|
97 |
+
|
98 |
+
Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering.
|
99 |
+
|
100 |
+
<Tip>
|
101 |
+
As we saw in [How 🤗 Transformers solve tasks](chapter1/4a), encoder-decoder models like BART and T5 combine the strengths of both architectures. The encoder provides deep bidirectional understanding of the input, while the decoder generates appropriate output text. This makes them perfect for tasks that transform one sequence into another, like translation or summarization.
|
102 |
+
</Tip>
|
103 |
+
|
104 |
+
### Practical applications
|
105 |
+
|
106 |
+
Sequence-to-sequence models excel at tasks that require transforming one form of text into another while preserving meaning. Some practical applications include:
|
107 |
+
|
108 |
+
| Application | Description | Example Model |
|
109 |
+
|-------------|-------------|---------------|
|
110 |
+
| Machine translation | Converting text between languages | Marian, T5 |
|
111 |
+
| Text summarization | Creating concise summaries of longer texts | BART, T5 |
|
112 |
+
| Data-to-text generation | Converting structured data into natural language | T5 |
|
113 |
+
| Grammar correction | Fixing grammatical errors in text | T5 |
|
114 |
+
| Question answering | Generating answers based on context | BART, T5 |
|
115 |
+
|
116 |
+
Here's an interactive demo of a sequence-to-sequence model for translation:
|
117 |
+
|
118 |
+
<iframe
|
119 |
+
src="https://huggingface.co/spaces/course-demos/translation-demo"
|
120 |
+
frameborder="0"
|
121 |
+
width="100%"
|
122 |
+
height="450"
|
123 |
+
></iframe>
|
124 |
+
|
125 |
+
Representatives of this family of models include:
|
126 |
+
|
127 |
+
- [BART](https://huggingface.co/docs/transformers/model_doc/bart)
|
128 |
+
- [mBART](https://huggingface.co/docs/transformers/model_doc/mbart)
|
129 |
+
- [Marian](https://huggingface.co/docs/transformers/model_doc/marian)
|
130 |
+
- [T5](https://huggingface.co/docs/transformers/model_doc/t5)
|
131 |
+
|
132 |
+
## Choosing the right architecture[[choosing-the-right-architecture]]
|
133 |
+
|
134 |
+
When working on a specific NLP task, how do you decide which architecture to use? Here's a quick guide:
|
135 |
+
|
136 |
+
| Task | Suggested Architecture | Examples |
|
137 |
+
|------|------------------------|----------|
|
138 |
+
| Text classification (sentiment, topic) | Encoder | BERT, RoBERTa |
|
139 |
+
| Text generation (creative writing) | Decoder | GPT, LLaMA |
|
140 |
+
| Translation | Encoder-Decoder | T5, BART |
|
141 |
+
| Summarization | Encoder-Decoder | BART, T5 |
|
142 |
+
| Named entity recognition | Encoder | BERT, RoBERTa |
|
143 |
+
| Question answering (extractive) | Encoder | BERT, RoBERTa |
|
144 |
+
| Question answering (generative) | Encoder-Decoder or Decoder | T5, GPT |
|
145 |
+
| Conversational AI | Decoder | GPT, LLaMA |
|
146 |
+
|
147 |
+
<Tip>
|
148 |
+
When in doubt about which model to use, consider:
|
149 |
+
1. What kind of understanding does your task need? (Bidirectional or unidirectional)
|
150 |
+
2. Are you generating new text or analyzing existing text?
|
151 |
+
3. Do you need to transform one sequence into another?
|
152 |
+
|
153 |
+
The answers to these questions will guide you toward the right architecture.
|
154 |
+
</Tip>
|
155 |
+
|
156 |
+
## The evolution of LLMs
|
157 |
+
|
158 |
+
Large Language Models have evolved rapidly in recent years, with each generation bringing significant improvements in capabilities. Here's an interactive timeline showing this evolution:
|
159 |
+
|
160 |
+
<iframe
|
161 |
+
src="https://huggingface.co/spaces/course-demos/llm-timeline"
|
162 |
+
frameborder="0"
|
163 |
+
width="100%"
|
164 |
+
height="450"
|
165 |
+
></iframe>
|
166 |
+
|
167 |
+
## Attention mechanisms[[attention-mechanisms]]
|
168 |
+
|
169 |
+
Most transformer models use full attention in the sense that the attention matrix is square. It can be a big
|
170 |
+
computational bottleneck when you have long texts. Longformer and reformer are models that try to be more efficient and
|
171 |
+
use a sparse version of the attention matrix to speed up training.
|
172 |
+
|
173 |
+
<Tip>
|
174 |
+
Standard attention mechanisms have a computational complexity of O(n²), where n is the sequence length. This becomes problematic for very long sequences. The specialized attention mechanisms below help address this limitation.
|
175 |
+
</Tip>
|
176 |
+
|
177 |
+
### LSH attention
|
178 |
+
|
179 |
+
[Reformer](model_doc/reformer) uses LSH attention. In the softmax(QK^t), only the biggest elements (in the softmax
|
180 |
+
dimension) of the matrix QK^t are going to give useful contributions. So for each query q in Q, we can consider only
|
181 |
+
the keys k in K that are close to q. A hash function is used to determine if q and k are close. The attention mask is
|
182 |
+
modified to mask the current token (except at the first position), because it will give a query and a key equal (so
|
183 |
+
very similar to each other). Since the hash can be a bit random, several hash functions are used in practice
|
184 |
+
(determined by a n_rounds parameter) and then are averaged together.
|
185 |
+
|
186 |
+
### Local attention
|
187 |
+
|
188 |
+
[Longformer](model_doc/longformer) uses local attention: often, the local context (e.g., what are the two tokens to the
|
189 |
+
left and right?) is enough to take action for a given token. Also, by stacking attention layers that have a small
|
190 |
+
window, the last layer will have a receptive field of more than just the tokens in the window, allowing them to build a
|
191 |
+
representation of the whole sentence.
|
192 |
+
|
193 |
+
Some preselected input tokens are also given global attention: for those few tokens, the attention matrix can access
|
194 |
+
all tokens and this process is symmetric: all other tokens have access to those specific tokens (on top of the ones in
|
195 |
+
their local window). This is shown in Figure 2d of the paper, see below for a sample attention mask:
|
196 |
+
|
197 |
+
<div class="flex justify-center">
|
198 |
+
<img scale="50 %" align="center" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local_attention_mask.png"/>
|
199 |
+
</div>
|
200 |
+
|
201 |
+
Using those attention matrices with less parameters then allows the model to have inputs having a bigger sequence
|
202 |
+
length.
|
203 |
+
|
204 |
+
### Axial positional encodings
|
205 |
+
|
206 |
+
[Reformer](model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding
|
207 |
+
E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the
|
208 |
+
hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate
|
209 |
+
that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with
|
210 |
+
dimensions \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and
|
211 |
+
\\(d_{1} + d_{2} = d\\) (with the product for the lengths, this ends up being way smaller). The embedding for time
|
212 |
+
step \\(j\\) in E is obtained by concatenating the embeddings for timestep \\(j \% l1\\) in E1 and \\(j // l1\\)
|
213 |
+
in E2.
|
214 |
+
|
215 |
+
## Conclusion[[conclusion]]
|
216 |
+
|
217 |
+
In this section, we've explored the three main Transformer architectures and some specialized attention mechanisms. Understanding these architectural differences is crucial for selecting the right model for your specific NLP task.
|
218 |
+
|
219 |
+
As we move forward in the course, you'll get hands-on experience with these different architectures and learn how to fine-tune them for your specific needs. In the next section, we'll look at some of the limitations and biases present in these models that you should be aware of when deploying them.
|
chapter1/material/7.md
ADDED
@@ -0,0 +1,241 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Inference with LLMs[[inference-with-llms]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner
|
4 |
+
chapter={1}
|
5 |
+
classNames="absolute z-10 right-0 top-0"
|
6 |
+
/>
|
7 |
+
|
8 |
+
In this page, we'll explore the core concepts behind LLM inference, providing a comprehensive understanding of how these models generate text and the key components involved in the inference process.
|
9 |
+
|
10 |
+
## Understanding the Basics
|
11 |
+
|
12 |
+
Let's start with the fundamentals. Inference is the process of using a trained LLM to generate human-like text from a given input prompt. Language models use their knowledge from training to formulate responses one word at a time. The model leverages learned probabilities from billions of parameters to predict and generate the next token in a sequence. This sequential generation is what allows LLMs to produce coherent and contextually relevant text.
|
13 |
+
|
14 |
+
## The Role of Attention
|
15 |
+
|
16 |
+
The attention mechanism is what gives LLMs their ability to understand context and generate coherent responses. When predicting the next word, not every word in a sentence carries equal weight - for example, in the sentence *"The capital of France is ..."*, the words "France" and "capital" are crucial for determining that "Paris" should come next. This ability to focus on relevant information is what we call attention.
|
17 |
+
|
18 |
+
<img src="https://huggingface.co/datasets/agents-course/course-images/resolve/main/en/unit1/AttentionSceneFinal.gif" alt="Visual Gif of Attention" width="60%">
|
19 |
+
|
20 |
+
This process of identifying the most relevant words to predict the next token has proven to be incredibly effective. Although the basic principle of training LLMs—predicting the next token—has remained generally consistent since BERT and GPT-2, there have been significant advancements in scaling neural networks and making the attention mechanism work for longer and longer sequences, at lower and lower costs.
|
21 |
+
|
22 |
+
<Tip>
|
23 |
+
|
24 |
+
In short, the attention mechanism is the key to LLMs being able to generate text that is both coherent and context-aware. It sets modern LLMs apart from previous generations of language models.
|
25 |
+
|
26 |
+
</Tip>
|
27 |
+
|
28 |
+
### Context Length and Attention Span
|
29 |
+
|
30 |
+
Now that we understand attention, let's explore how much context an LLM can actually handle. This brings us to context length, or the model's 'attention span'.
|
31 |
+
|
32 |
+
The context length refers to the maximum number of tokens (words or parts of words) that the LLM can process at once. Think of it as the size of the model's working memory.
|
33 |
+
|
34 |
+
These capabilities are limited by several practical factors:
|
35 |
+
- The model's architecture and size
|
36 |
+
- Available computational resources
|
37 |
+
- The complexity of the input and desired output
|
38 |
+
|
39 |
+
In an ideal world, we could feed unlimited context to the model, but hardware constraints and computational costs make this impractical. This is why different models are designed with different context lengths to balance capability with efficiency.
|
40 |
+
|
41 |
+
<Tip>
|
42 |
+
|
43 |
+
The context length is the maximum number of tokens the model can consider at once when generating a response.
|
44 |
+
|
45 |
+
</Tip>
|
46 |
+
|
47 |
+
### The Art of Prompting
|
48 |
+
|
49 |
+
When we pass information to LLMs, we structure our input in a way that guides the generation of the LLM toward the desired output. This is called _prompting_.
|
50 |
+
|
51 |
+
Understanding how LLMs process information helps us craft better prompts. Since the model's primary task is to predict the next token by analyzing the importance of each input token, the wording of your input sequence becomes crucial.
|
52 |
+
|
53 |
+
<Tip>
|
54 |
+
|
55 |
+
Careful design of the prompt makes it easier **to guide the generation of the LLM toward the desired output**.
|
56 |
+
|
57 |
+
</Tip>
|
58 |
+
|
59 |
+
## The Two-Phase Inference Process
|
60 |
+
|
61 |
+
Now that we understand the basic components, let's dive into how LLMs actually generate text. The process can be broken down into two main phases: prefill and decode. These phases work together like an assembly line, each playing a crucial role in producing coherent text.
|
62 |
+
|
63 |
+
### The Prefill Phase
|
64 |
+
|
65 |
+
The prefill phase is like the preparation stage in cooking - it's where all the initial ingredients are processed and made ready. This phase involves three key steps:
|
66 |
+
|
67 |
+
1. **Tokenization**: Converting the input text into tokens (think of these as the basic building blocks the model understands)
|
68 |
+
2. **Embedding Conversion**: Transforming these tokens into numerical representations that capture their meaning
|
69 |
+
3. **Initial Processing**: Running these embeddings through the model's neural networks to create a rich understanding of the context
|
70 |
+
|
71 |
+
This phase is computationally intensive because it needs to process all input tokens at once. Think of it as reading and understanding an entire paragraph before starting to write a response.
|
72 |
+
|
73 |
+
### The Decode Phase
|
74 |
+
|
75 |
+
After the prefill phase has processed the input, we move to the decode phase - this is where the actual text generation happens. The model generates one token at a time in what we call an autoregressive process (where each new token depends on all previous tokens).
|
76 |
+
|
77 |
+
The decode phase involves several key steps that happen for each new token:
|
78 |
+
1. **Attention Computation**: Looking back at all previous tokens to understand context
|
79 |
+
2. **Probability Calculation**: Determining the likelihood of each possible next token
|
80 |
+
3. **Token Selection**: Choosing the next token based on these probabilities
|
81 |
+
4. **Continuation Check**: Deciding whether to continue or stop generation
|
82 |
+
|
83 |
+
This phase is memory-intensive because the model needs to keep track of all previously generated tokens and their relationships.
|
84 |
+
|
85 |
+
## Sampling Strategies
|
86 |
+
|
87 |
+
Now that we understand how the model generates text, let's explore the various ways we can control this generation process. Just like a writer might choose between being more creative or more precise, we can adjust how the model makes its token selections.
|
88 |
+
|
89 |
+
### Understanding Token Selection: From Probabilities to Token Choices
|
90 |
+
|
91 |
+
When the model needs to choose the next token, it starts with raw probabilities (called logits) for every word in its vocabulary. But how do we turn these probabilities into actual choices? Let's break down the process:
|
92 |
+
|
93 |
+
<img src="https://huggingface.co/reasoning-course/images/blob/main/inference/1.png" alt="Token Selection Process" />
|
94 |
+
|
95 |
+
1. **Raw Logits**: Think of these as the model's initial gut feelings about each possible next word
|
96 |
+
2. **Temperature Control**: Like a creativity dial - higher settings (>1.0) make choices more random and creative, lower settings (<1.0) make them more focused and deterministic
|
97 |
+
3. **Top-p (Nucleus) Sampling**: Instead of considering all possible words, we only look at the most likely ones that add up to our chosen probability threshold (e.g., top 90%)
|
98 |
+
4. **Top-k Filtering**: An alternative approach where we only consider the k most likely next words
|
99 |
+
|
100 |
+
### Managing Repetition: Keeping Output Fresh
|
101 |
+
|
102 |
+
One common challenge with LLMs is their tendency to repeat themselves - much like a speaker who keeps returning to the same points. To address this, we use two types of penalties:
|
103 |
+
|
104 |
+
1. **Presence Penalty**: A fixed penalty applied to any token that has appeared before, regardless of how often. This helps prevent the model from reusing the same words.
|
105 |
+
2. **Frequency Penalty**: A scaling penalty that increases based on how often a token has been used. The more a word appears, the less likely it is to be chosen again.
|
106 |
+
|
107 |
+
<img src="https://huggingface.co/reasoning-course/images/blob/main/inference/2.png" alt="Token Selection Process" />
|
108 |
+
|
109 |
+
These penalties are applied early in the token selection process, adjusting the raw probabilities before other sampling strategies are applied. Think of them as gentle nudges encouraging the model to explore new vocabulary.
|
110 |
+
|
111 |
+
### Controlling Generation Length: Setting Boundaries
|
112 |
+
|
113 |
+
Just as a good story needs proper pacing and length, we need ways to control how much text our LLM generates. This is crucial for practical applications - whether we're generating a tweet-length response or a full blog post.
|
114 |
+
|
115 |
+
We can control generation length in several ways:
|
116 |
+
1. **Token Limits**: Setting minimum and maximum token counts
|
117 |
+
2. **Stop Sequences**: Defining specific patterns that signal the end of generation
|
118 |
+
3. **End-of-Sequence Detection**: Letting the model naturally conclude its response
|
119 |
+
|
120 |
+
For example, if we want to generate a single paragraph, we might set a maximum of 100 tokens and use "\n\n" as a stop sequence. This ensures our output stays focused and appropriately sized for its purpose.
|
121 |
+
|
122 |
+
<img src="https://huggingface.co/reasoning-course/images/blob/main/inference/3.png" alt="Token Selection Process" />
|
123 |
+
|
124 |
+
### Beam Search: Looking Ahead for Better Coherence
|
125 |
+
|
126 |
+
While the strategies we've discussed so far make decisions one token at a time, beam search takes a more holistic approach. Instead of committing to a single choice at each step, it explores multiple possible paths simultaneously - like a chess player thinking several moves ahead.
|
127 |
+
|
128 |
+
<img src="https://huggingface.co/reasoning-course/images/blob/main/inference/4.png" alt="Beam Search" />
|
129 |
+
|
130 |
+
Here's how it works:
|
131 |
+
1. At each step, maintain multiple candidate sequences (typically 5-10)
|
132 |
+
2. For each candidate, compute probabilities for the next token
|
133 |
+
3. Keep only the most promising combinations of sequences and next tokens
|
134 |
+
4. Continue this process until reaching the desired length or stop condition
|
135 |
+
5. Select the sequence with the highest overall probability
|
136 |
+
|
137 |
+
This approach often produces more coherent and grammatically correct text, though it requires more computational resources than simpler methods.
|
138 |
+
|
139 |
+
## Practical Challenges and Optimization
|
140 |
+
|
141 |
+
As we wrap up our exploration of LLM inference, let's look at the practical challenges you'll face when deploying these models, and how to measure and optimize their performance.
|
142 |
+
|
143 |
+
### Key Performance Metrics
|
144 |
+
|
145 |
+
When working with LLMs, four critical metrics will shape your implementation decisions:
|
146 |
+
|
147 |
+
1. **Time to First Token (TTFT)**: How quickly can you get the first response? This is crucial for user experience and is primarily affected by the prefill phase.
|
148 |
+
2. **Time Per Output Token (TPOT)**: How fast can you generate subsequent tokens? This determines the overall generation speed.
|
149 |
+
3. **Throughput**: How many requests can you handle simultaneously? This affects scaling and cost efficiency.
|
150 |
+
4. **VRAM Usage**: How much GPU memory do you need? This often becomes the primary constraint in real-world applications.
|
151 |
+
|
152 |
+
### The Context Length Challenge
|
153 |
+
|
154 |
+
One of the most significant challenges in LLM inference is managing context length effectively. Longer contexts provide more information but come with substantial costs:
|
155 |
+
|
156 |
+
- **Memory Usage**: Grows quadratically with context length
|
157 |
+
- **Processing Speed**: Decreases linearly with longer contexts
|
158 |
+
- **Resource Allocation**: Requires careful balancing of VRAM usage
|
159 |
+
|
160 |
+
Recent models like Qwen2.5-1M offer impressive 1M token context windows, but this comes at the cost of significantly slower inference times. The key is finding the right balance for your specific use case.
|
161 |
+
|
162 |
+
|
163 |
+
<div style="max-width: 800px; margin: 20px auto; padding: 20px;
|
164 |
+
font-family: system-ui;">
|
165 |
+
<div style="border: 2px solid #ddd; border-radius: 8px;
|
166 |
+
padding: 20px; margin-bottom: 20px;">
|
167 |
+
<div style="display: flex; align-items: center;
|
168 |
+
margin-bottom: 15px;">
|
169 |
+
<div style="flex: 1; text-align: center; padding:
|
170 |
+
10px; background: #f0f0f0; border-radius: 4px;">
|
171 |
+
Input Text (Raw)
|
172 |
+
</div>
|
173 |
+
<div style="margin: 0 10px;">→</div>
|
174 |
+
<div style="flex: 1; text-align: center; padding:
|
175 |
+
10px; background: #e1f5fe; border-radius: 4px;">
|
176 |
+
Tokenized Input
|
177 |
+
</div>
|
178 |
+
</div>
|
179 |
+
<div style="display: flex; margin-bottom: 15px;">
|
180 |
+
<div style="flex: 1; border: 1px solid #ccc;
|
181 |
+
padding: 10px; margin: 5px; background: #e8f5e9;
|
182 |
+
border-radius: 4px; text-align: center;">
|
183 |
+
Context Window<br/>(e.g., 4K tokens)
|
184 |
+
<div style="display: flex; margin-top: 10px;">
|
185 |
+
<div style="flex: 1; background: #81c784;
|
186 |
+
margin: 2px; height: 20px; border-radius:
|
187 |
+
2px;"></div>
|
188 |
+
<div style="flex: 1; background: #81c784;
|
189 |
+
margin: 2px; height: 20px; border-radius:
|
190 |
+
2px;"></div>
|
191 |
+
<div style="flex: 1; background: #81c784;
|
192 |
+
margin: 2px; height: 20px; border-radius:
|
193 |
+
2px;"></div>
|
194 |
+
<div style="flex: 1; background: #81c784;
|
195 |
+
margin: 2px; height: 20px; border-radius:
|
196 |
+
2px;"></div>
|
197 |
+
</div>
|
198 |
+
</div>
|
199 |
+
</div>
|
200 |
+
<div style="display: flex; justify-content:
|
201 |
+
space-between; text-align: center; font-size: 0.9em;
|
202 |
+
color: #666;">
|
203 |
+
<div style="flex: 1;">
|
204 |
+
<div style="border: 1px solid #ffcc80; padding:
|
205 |
+
8px; margin: 5px; background: #fff3e0;
|
206 |
+
border-radius: 4px;">
|
207 |
+
Memory Usage<br/>∝ Length²
|
208 |
+
</div>
|
209 |
+
</div>
|
210 |
+
<div style="flex: 1;">
|
211 |
+
<div style="border: 1px solid #90caf9; padding:
|
212 |
+
8px; margin: 5px; background: #e3f2fd;
|
213 |
+
border-radius: 4px;">
|
214 |
+
Processing Time<br/>∝ Length
|
215 |
+
</div>
|
216 |
+
</div>
|
217 |
+
</div>
|
218 |
+
</div>
|
219 |
+
</div>
|
220 |
+
|
221 |
+
### The KV Cache Optimization
|
222 |
+
|
223 |
+
To address these challenges, one of the most powerful optimizations is KV (Key-Value) caching. This technique significantly improves inference speed by storing and reusing intermediate calculations. This optimization:
|
224 |
+
- Reduces repeated calculations
|
225 |
+
- Improves generation speed
|
226 |
+
- Makes long-context generation practical
|
227 |
+
|
228 |
+
The trade-off is additional memory usage, but the performance benefits usually far outweigh this cost.
|
229 |
+
|
230 |
+
## Conclusion
|
231 |
+
|
232 |
+
Understanding LLM inference is crucial for effectively deploying and optimizing these powerful models. We've covered the key components:
|
233 |
+
|
234 |
+
- The fundamental role of attention and context
|
235 |
+
- The two-phase inference process
|
236 |
+
- Various sampling strategies for controlling generation
|
237 |
+
- Practical challenges and optimizations
|
238 |
+
|
239 |
+
By mastering these concepts, you'll be better equipped to build applications that leverage LLMs effectively and efficiently.
|
240 |
+
|
241 |
+
Remember that the field of LLM inference is rapidly evolving, with new techniques and optimizations emerging regularly. Stay curious and keep experimenting with different approaches to find what works best for your specific use cases.
|
chapter1/material/8.md
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Bias and limitations[[bias-and-limitations]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner chapter={1}
|
4 |
+
classNames="absolute z-10 right-0 top-0"
|
5 |
+
notebooks={[
|
6 |
+
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter1/section8.ipynb"},
|
7 |
+
{label: "Aws Studio", value: "https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter1/section8.ipynb"},
|
8 |
+
]} />
|
9 |
+
|
10 |
+
If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet.
|
11 |
+
|
12 |
+
To give a quick illustration, let's go back the example of a `fill-mask` pipeline with the BERT model:
|
13 |
+
|
14 |
+
```python
|
15 |
+
from transformers import pipeline
|
16 |
+
|
17 |
+
unmasker = pipeline("fill-mask", model="bert-base-uncased")
|
18 |
+
result = unmasker("This man works as a [MASK].")
|
19 |
+
print([r["token_str"] for r in result])
|
20 |
+
|
21 |
+
result = unmasker("This woman works as a [MASK].")
|
22 |
+
print([r["token_str"] for r in result])
|
23 |
+
```
|
24 |
+
|
25 |
+
```python out
|
26 |
+
['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
|
27 |
+
['nurse', 'waitress', 'teacher', 'maid', 'prostitute']
|
28 |
+
```
|
29 |
+
|
30 |
+
When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waitress). The others are work occupations usually associated with one specific gender -- and yes, prostitute ended up in the top 5 possibilities the model associates with "woman" and "work." This happens even though BERT is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it's trained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus) datasets).
|
31 |
+
|
32 |
+
When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear.
|
chapter1/material/9.md
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Summary[[summary]]
|
2 |
+
|
3 |
+
<CourseFloatingBanner
|
4 |
+
chapter={1}
|
5 |
+
classNames="absolute z-10 right-0 top-0"
|
6 |
+
/>
|
7 |
+
|
8 |
+
In this chapter, you've been introduced to the fundamentals of Transformer models, Large Language Models (LLMs), and how they're revolutionizing AI and beyond.
|
9 |
+
|
10 |
+
## Key concepts covered
|
11 |
+
|
12 |
+
### Natural Language Processing and LLMs
|
13 |
+
|
14 |
+
We explored what NLP is and how Large Language Models have transformed the field. You learned that:
|
15 |
+
- NLP encompasses a wide range of tasks from classification to generation
|
16 |
+
- LLMs are powerful models trained on massive amounts of text data
|
17 |
+
- These models can perform multiple tasks within a single architecture
|
18 |
+
- Despite their capabilities, LLMs have limitations including hallucinations and bias
|
19 |
+
|
20 |
+
### Transformer capabilities
|
21 |
+
|
22 |
+
You saw how the `pipeline()` function from 🤗 Transformers makes it easy to use pre-trained models for various tasks:
|
23 |
+
- Text classification, token classification, and question answering
|
24 |
+
- Text generation and summarization
|
25 |
+
- Translation and other sequence-to-sequence tasks
|
26 |
+
- Speech recognition and image classification
|
27 |
+
|
28 |
+
### Transformer architecture
|
29 |
+
|
30 |
+
We discussed how Transformer models work at a high level, including:
|
31 |
+
- The importance of the attention mechanism
|
32 |
+
- How transfer learning enables models to adapt to specific tasks
|
33 |
+
- The three main architectural variants: encoder-only, decoder-only, and encoder-decoder
|
34 |
+
|
35 |
+
### Model architectures and their applications
|
36 |
+
A key aspect of this chapter was understanding which architecture to use for different tasks:
|
37 |
+
|
38 |
+
| Model | Examples | Tasks |
|
39 |
+
|-----------------|--------------------------------------------|----------------------------------------------------------------------------------|
|
40 |
+
| Encoder-only | BERT, DistilBERT, ModernBERT | Sentence classification, named entity recognition, extractive question answering |
|
41 |
+
| Decoder-only | GPT, LLaMA, Gemma, SmolLM | Text generation, conversational AI, creative writing |
|
42 |
+
| Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering |
|
43 |
+
|
44 |
+
### Modern LLM developments
|
45 |
+
You also learned about recent developments in the field:
|
46 |
+
- How LLMs have grown in size and capability over time
|
47 |
+
- The concept of scaling laws and how they guide model development
|
48 |
+
- Specialized attention mechanisms that help models process longer sequences
|
49 |
+
- The two-phase training approach of pretraining and instruction tuning
|
50 |
+
|
51 |
+
### Practical applications
|
52 |
+
Throughout the chapter, you've seen how these models can be applied to real-world problems:
|
53 |
+
- Using the Hugging Face Hub to find and use pre-trained models
|
54 |
+
- Leveraging the Inference API to test models directly in your browser
|
55 |
+
- Understanding which models are best suited for specific tasks
|
56 |
+
|
57 |
+
## Looking ahead
|
58 |
+
|
59 |
+
Now that you have a solid understanding of what Transformer models are and how they work at a high level, you're ready to dive deeper into how to use them effectively. In the next chapters, you'll learn how to:
|
60 |
+
|
61 |
+
- Use the Transformers library to load and fine-tune models
|
62 |
+
- Process different types of data for model input
|
63 |
+
- Adapt pre-trained models to your specific tasks
|
64 |
+
- Deploy models for practical applications
|
65 |
+
|
66 |
+
The foundation you've built in this chapter will serve you well as you explore more advanced topics and techniques in the coming sections.
|
chapter1/presentation.md
ADDED
@@ -0,0 +1,518 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
class: impact
|
2 |
+
|
3 |
+
# The LLM Course
|
4 |
+
## Chapter 1
|
5 |
+
## Introduction to NLP, LLMs & Transformers
|
6 |
+
|
7 |
+
.center[]
|
8 |
+
|
9 |
+
???
|
10 |
+
Welcome to the first chapter of the LLM Course! In this session, we'll dive into the foundational concepts of Natural Language Processing (NLP), explore the capabilities of Large Language Models (LLMs), and understand the revolutionary Transformer architecture that underpins many of these powerful tools.
|
11 |
+
|
12 |
+
---
|
13 |
+
|
14 |
+
# Introduction to NLP, LLMs & Transformers
|
15 |
+
|
16 |
+
- Understanding the Basics
|
17 |
+
- Exploring Capabilities
|
18 |
+
- How They Work
|
19 |
+
|
20 |
+
???
|
21 |
+
Today, we'll cover three main points: understanding the basics of these technologies, exploring their impressive capabilities, and getting an initial look at their inner workings.
|
22 |
+
|
23 |
+
---
|
24 |
+
|
25 |
+
# What is Natural Language Processing (NLP)?
|
26 |
+
|
27 |
+
**Definition**: Field blending linguistics and machine learning to enable computers to understand/process human language.
|
28 |
+
|
29 |
+
**Common Tasks**:
|
30 |
+
- Classifying sentences (sentiment, spam)
|
31 |
+
- Classifying words (POS tagging, NER)
|
32 |
+
- Generating text (autocomplete, translation, summarization)
|
33 |
+
|
34 |
+
???
|
35 |
+
So, what exactly is Natural Language Processing? It's an interdisciplinary field combining linguistics and machine learning, aiming to equip computers with the ability to understand, interpret, and process human language effectively. Common NLP tasks include classifying sentences (like sentiment analysis or spam detection), classifying individual words (such as Part-of-Speech (POS) tagging or Named Entity Recognition (NER) to identify entities like people, locations, and organizations), and generating text (for applications like autocomplete, machine translation, or document summarization).
|
36 |
+
|
37 |
+
---
|
38 |
+
|
39 |
+
# The Rise of Large Language Models (LLMs)
|
40 |
+
|
41 |
+
**Shift**: From task-specific models to versatile, large-scale models.
|
42 |
+
|
43 |
+
**Definition**: Massive models trained on vast text data.
|
44 |
+
|
45 |
+
**Key Characteristics**:
|
46 |
+
- Versatility: Handle multiple NLP tasks.
|
47 |
+
- Power: Learn intricate language patterns.
|
48 |
+
- Emergent Abilities: Show unexpected capabilities at scale.
|
49 |
+
|
50 |
+
**Examples**: GPT, Llama, BERT
|
51 |
+
|
52 |
+
???
|
53 |
+
Historically, NLP often involved creating specialized models for each specific task. However, the landscape shifted dramatically with the emergence of LLMs. These are exceptionally large neural networks trained on enormous datasets of text and code. Key characteristics distinguish LLMs: they exhibit remarkable *versatility*, capable of handling diverse NLP tasks within one framework; they possess immense *power*, learning complex language patterns that lead to state-of-the-art performance; and they often display *emergent abilities* – unexpected capabilities that arise as model size increases. Prominent examples include models like the Generative Pre-trained Transformer (GPT) series, Llama, and Bidirectional Encoder Representations from Transformers (BERT). These represent a paradigm shift towards leveraging single, powerful, adaptable models for a multitude of language challenges.
|
54 |
+
|
55 |
+
---
|
56 |
+
|
57 |
+
# The Growth of Transformer Models
|
58 |
+
|
59 |
+
.center[]
|
60 |
+
|
61 |
+
???
|
62 |
+
This chart illustrates the exponential growth in the size of Transformer models, measured by their number of parameters, over recent years. While there are exceptions like DistilBERT (intentionally designed to be smaller and faster), the dominant trend has been towards larger models to achieve higher performance. This scaling has unlocked significant capabilities but also brings challenges related to computational demands and environmental footprint. The clear trajectory from millions to billions, and even trillions, of parameters highlights how model scale has been a primary driver of recent advancements in NLP.
|
63 |
+
|
64 |
+
---
|
65 |
+
|
66 |
+
# Environmental Impact of LLMs
|
67 |
+
|
68 |
+
.center[]
|
69 |
+
|
70 |
+
???
|
71 |
+
Training these massive LLMs comes with a substantial environmental cost, as depicted in this chart showing the carbon footprint associated with training a large model. A single training run can have a carbon footprint comparable to the lifetime emissions of several cars. This underscores the critical importance of sharing pretrained models. Platforms like the Hugging Face Hub facilitate this sharing, allowing the community to build upon existing models instead of repeatedly training from scratch. This collaborative approach significantly reduces the collective computational burden and environmental impact, distributing the cost across numerous users and applications.
|
72 |
+
|
73 |
+
---
|
74 |
+
|
75 |
+
# Why is Language Processing Hard?
|
76 |
+
|
77 |
+
**Human Understanding**: Context, nuance, sarcasm are easy for humans, hard for machines.
|
78 |
+
|
79 |
+
**Machine Needs**: Text requires processing into a machine-understandable format.
|
80 |
+
|
81 |
+
**LLM Challenges**: Despite advances, LLMs struggle with full complexity and can inherit biases from training data.
|
82 |
+
|
83 |
+
???
|
84 |
+
Why does processing human language pose such a challenge for computers? Humans effortlessly grasp context, nuance, implied meanings, and even sarcasm. Machines, however, require language to be converted into a structured, numerical format they can process. While LLMs represent a huge leap forward, they still grapple with the full richness and complexity of human communication. We rely on shared knowledge, cultural context, and subtle cues that are difficult to encode explicitly. Even today's most advanced models can struggle with these aspects and may also inherit biases from the vast datasets they are trained on.
|
85 |
+
|
86 |
+
---
|
87 |
+
|
88 |
+
# The Hugging Face Ecosystem
|
89 |
+
|
90 |
+
.center[]
|
91 |
+
|
92 |
+
- Thousands of pretrained models
|
93 |
+
- Datasets for various tasks
|
94 |
+
- Spaces for demos
|
95 |
+
- Community collaboration
|
96 |
+
- Used by major companies and organizations
|
97 |
+
|
98 |
+
???
|
99 |
+
The Hugging Face ecosystem plays a pivotal role in the modern NLP and AI landscape, serving as a central hub for collaboration and resource sharing. Many leading companies and research organizations utilize and contribute to this platform. Key components include the Model Hub, hosting thousands of freely downloadable pretrained models; extensive Datasets for training and evaluation; Spaces for hosting interactive demos; and libraries like 🤗 Transformers. This open and collaborative environment has significantly accelerated progress by making state-of-the-art AI accessible to a global community.
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
+
# Transformers: What Can They Do?
|
104 |
+
|
105 |
+
**Architecture**: Introduced in 2017, powers most modern LLMs.
|
106 |
+
|
107 |
+
**Accessibility**: Libraries like Hugging Face 🤗 Transformers make them easy to use.
|
108 |
+
|
109 |
+
**Tool Highlight**: The pipeline() function.
|
110 |
+
|
111 |
+
.center[]
|
112 |
+
|
113 |
+
???
|
114 |
+
Let's focus on the Transformer architecture. Introduced in the seminal 2017 paper 'Attention Is All You Need', this architecture forms the backbone of most contemporary LLMs. Libraries like Hugging Face 🤗 Transformers simplify the process of using these powerful models. Before Transformers, architectures like Long Short-Term Memory networks (LSTMs) were common for sequence tasks, but they struggled with long-range dependencies in text and were less efficient to train due to their sequential nature. Transformers addressed these limitations, particularly through the innovative 'attention mechanism', enabling better handling of long sequences and more parallelizable, efficient training on massive datasets.
|
115 |
+
|
116 |
+
---
|
117 |
+
|
118 |
+
# The pipeline() Function
|
119 |
+
|
120 |
+
**Simplicity**: Perform complex tasks with minimal code.
|
121 |
+
|
122 |
+
**Example**: Sentiment Analysis
|
123 |
+
|
124 |
+
```python
|
125 |
+
from transformers import pipeline
|
126 |
+
classifier = pipeline('sentiment-analysis')
|
127 |
+
classifier('This is a great course!')
|
128 |
+
```
|
129 |
+
|
130 |
+
**Output**:
|
131 |
+
```
|
132 |
+
[{'label': 'POSITIVE', 'score': 0.9998}]
|
133 |
+
```
|
134 |
+
|
135 |
+
???
|
136 |
+
A great entry point into using 🤗 Transformers is the `pipeline()` function. It provides a high-level abstraction, allowing you to perform complex NLP tasks with minimal code. As shown in the example, sentiment analysis takes just three lines. The `pipeline()` handles the necessary steps behind the scenes: preprocessing the input text (tokenization), feeding it to the model for inference, and post-processing the model's output into a human-readable format. This ease of use significantly lowers the barrier to entry and has been instrumental in the broad adoption of Transformer models.
|
137 |
+
|
138 |
+
---
|
139 |
+
|
140 |
+
# pipeline() Task Examples (Text)
|
141 |
+
|
142 |
+
- **Sentiment Analysis**: Determine if text is positive/negative
|
143 |
+
- **Named Entity Recognition**: Identify people, organizations, locations
|
144 |
+
- **Question Answering**: Extract answers from context
|
145 |
+
- **Text Generation**: Complete prompts with new text
|
146 |
+
- **Summarization**: Condense long documents
|
147 |
+
- **Translation**: Convert text between languages
|
148 |
+
|
149 |
+
???
|
150 |
+
The `pipeline()` function is incredibly versatile, supporting numerous NLP tasks out-of-the-box. For text-based applications, this includes: Sentiment Analysis (determining positive/negative tone), Named Entity Recognition (identifying entities like people, places, organizations), Question Answering (extracting answers from context), Text Generation (completing prompts or creating new text), Summarization (condensing documents), and Translation (converting between languages). Each task typically leverages a model fine-tuned specifically for that purpose, all accessible through the consistent `pipeline()` interface.
|
151 |
+
|
152 |
+
---
|
153 |
+
|
154 |
+
# pipeline() Task Examples (Beyond Text)
|
155 |
+
|
156 |
+
**Image Classification**:
|
157 |
+
```python
|
158 |
+
from transformers import pipeline
|
159 |
+
|
160 |
+
img_classifier = pipeline('image-classification')
|
161 |
+
img_classifier('path/to/your/image.jpg')
|
162 |
+
```
|
163 |
+
|
164 |
+
**Automatic Speech Recognition**:
|
165 |
+
|
166 |
+
```python
|
167 |
+
from transformers import pipeline
|
168 |
+
|
169 |
+
transcriber = pipeline('automatic-speech-recognition')
|
170 |
+
transcriber('path/to/your/audio.flac')
|
171 |
+
```
|
172 |
+
|
173 |
+
???
|
174 |
+
The power of Transformers extends beyond text. The `pipeline()` function also supports tasks in other modalities, demonstrating the architecture's flexibility. Examples include Image Classification (identifying objects in images) and Automatic Speech Recognition (ASR - transcribing spoken audio to text). This ability to handle multimodal data stems from adapting the core Transformer concepts, like the attention mechanism, to process sequences derived from images (e.g., patches) or audio (e.g., frames).
|
175 |
+
|
176 |
+
---
|
177 |
+
|
178 |
+
# How Do Transformers Work?
|
179 |
+
|
180 |
+
.center[]
|
181 |
+
|
182 |
+
???
|
183 |
+
At its core, the original Transformer architecture, as shown in this diagram from the 'Attention Is All You Need' paper, comprises two main blocks: an Encoder (left side) and a Decoder (right side). The Encoder's role is to process the input sequence and build a rich representation of it. The Decoder then uses this representation, along with the sequence generated so far, to produce the output sequence. The groundbreaking element connecting these is the *attention mechanism*, which enables the model to dynamically weigh the importance of different parts of the input sequence when generating each part of the output.
|
184 |
+
|
185 |
+
---
|
186 |
+
|
187 |
+
# The Attention Mechanism
|
188 |
+
|
189 |
+
**Core Idea**: Allow the model to focus on relevant parts of the input when processing each word.
|
190 |
+
|
191 |
+
**Example**: In translating "You like this course" to French:
|
192 |
+
- When translating "like", attention focuses on "You" (for conjugation)
|
193 |
+
- When translating "this", attention focuses on "course" (for gender agreement)
|
194 |
+
|
195 |
+
**Key Innovation**: Enables modeling of long-range dependencies in text.
|
196 |
+
|
197 |
+
???
|
198 |
+
The *attention mechanism* is arguably the most crucial innovation of the Transformer architecture. It empowers the model to selectively focus on the most relevant parts of the input sequence when processing or generating each element (like a word). Consider translating 'You like this course' to French: To correctly conjugate 'like' (aimer), the model must attend to 'You'. To choose the correct form of 'this' (ce/cet/cette), it needs to attend to 'course' to determine its gender. This dynamic weighting of input elements allows Transformers to effectively model long-range dependencies and contextual relationships in data, a significant advantage over earlier sequential models.
|
199 |
+
|
200 |
+
---
|
201 |
+
|
202 |
+
# Transformer Architectures: Overview
|
203 |
+
|
204 |
+
Most models fall into three main categories:
|
205 |
+
|
206 |
+
1. **Encoder-only**: Best for understanding tasks (BERT)
|
207 |
+
2. **Decoder-only**: Best for generation tasks (GPT)
|
208 |
+
3. **Encoder-Decoder**: Best for transformation tasks (T5)
|
209 |
+
|
210 |
+
???
|
211 |
+
While based on the original design, most modern Transformer models utilize one of three main architectural patterns: 1. **Encoder-only**: Uses just the encoder stack. Excels at tasks requiring a deep understanding of the entire input sequence (e.g., BERT). 2. **Decoder-only**: Uses just the decoder stack. Ideal for generative tasks where the output is produced sequentially based on previous context (e.g., GPT). 3. **Encoder-Decoder**: Uses both stacks. Suited for sequence-to-sequence tasks that transform an input sequence into a new output sequence (e.g., T5, BART). Each architecture is optimized for different kinds of problems, so understanding their characteristics is key to selecting the appropriate model.
|
212 |
+
|
213 |
+
---
|
214 |
+
|
215 |
+
# Architecture 1: Encoder-only
|
216 |
+
|
217 |
+
**Examples**: BERT, RoBERTa
|
218 |
+
|
219 |
+
**How it works**: Uses only the encoder part. Attention layers access the entire input sentence (bi-directional).
|
220 |
+
|
221 |
+
**Best for**: Tasks requiring full sentence understanding.
|
222 |
+
- Sentence Classification
|
223 |
+
- Named Entity Recognition (NER)
|
224 |
+
- Extractive Question Answering
|
225 |
+
|
226 |
+
.center[]
|
227 |
+
|
228 |
+
???
|
229 |
+
Encoder-only models, exemplified by BERT and RoBERTa, leverage solely the encoder component of the Transformer. A key characteristic is their use of *bi-directional attention*, meaning that when processing any given word, the attention layers can access information from *all* other words in the input sentence, both preceding and succeeding. This allows the model to build a deep contextual understanding of the entire input. Consequently, these models excel at tasks demanding comprehensive input analysis, such as sentence classification, Named Entity Recognition (NER), and extractive question answering (where the answer is a span within the input text). They are generally not used for free-form text generation.
|
230 |
+
|
231 |
+
---
|
232 |
+
|
233 |
+
# Architecture 2: Decoder-only
|
234 |
+
|
235 |
+
**Examples**: GPT, Llama
|
236 |
+
|
237 |
+
**How it works**: Uses only the decoder part. Predicts the next word based on preceding words (causal/auto-regressive attention).
|
238 |
+
|
239 |
+
**Best for**: Text generation tasks.
|
240 |
+
- Auto-completion
|
241 |
+
- Creative Writing
|
242 |
+
- Chatbots
|
243 |
+
|
244 |
+
.center[]
|
245 |
+
|
246 |
+
???
|
247 |
+
Decoder-only models, such as those in the GPT family and Llama, utilize only the decoder stack. They operate differently from encoders: attention is *causal* or *auto-regressive*. This means that when predicting the next word (or token), the model can only attend to the words that came *before* it in the sequence, plus the current position. This inherent structure makes them exceptionally well-suited for text generation tasks, where the goal is to predict subsequent tokens based on preceding context. Applications include auto-completion, creative writing assistants, and chatbots. Indeed, many of the most prominent modern LLMs fall into this category.
|
248 |
+
|
249 |
+
---
|
250 |
+
|
251 |
+
# Architecture 3: Encoder-Decoder
|
252 |
+
|
253 |
+
**Examples**: BART, T5, MarianMT
|
254 |
+
|
255 |
+
**How it works**: Uses both encoder (processes input) and decoder (generates output). Attention in the decoder can access encoder outputs.
|
256 |
+
|
257 |
+
**Best for**: Sequence-to-sequence tasks (transforming input to output).
|
258 |
+
- Translation
|
259 |
+
- Summarization
|
260 |
+
- Generative Question Answering
|
261 |
+
|
262 |
+
.center[]
|
263 |
+
|
264 |
+
???
|
265 |
+
Encoder-Decoder models, also known as sequence-to-sequence models (e.g., BART, Text-to-Text Transfer Transformer (T5), MarianMT for translation), employ the complete original Transformer structure. The encoder first processes the entire input sequence to create a comprehensive representation. The decoder then generates the output sequence token by token, utilizing both its own previously generated tokens (causal attention) and the encoder's output representation (cross-attention). This architecture excels at tasks that involve transforming an input sequence into a different output sequence, such as machine translation, text summarization, or generative question answering (where the answer is generated, not just extracted). They effectively combine the input understanding strengths of encoders with the text generation capabilities of decoders.
|
266 |
+
|
267 |
+
---
|
268 |
+
|
269 |
+
# Model Architectures and Tasks
|
270 |
+
|
271 |
+
| Architecture | Examples | Best For |
|
272 |
+
|--------------|----------|----------|
|
273 |
+
| **Encoder-only** | BERT, RoBERTa | Classification, NER, Q&A |
|
274 |
+
| **Decoder-only** | GPT, LLaMA | Text generation, Chatbots |
|
275 |
+
| **Encoder-Decoder** | BART, T5 | Translation, Summarization |
|
276 |
+
|
277 |
+
???
|
278 |
+
This table provides a concise summary of the three primary Transformer architectures and their typical applications. To reiterate: Encoder-only models (like BERT) are strong for tasks requiring deep input understanding (classification, NER). Decoder-only models (like GPT, Llama) excel at generating text sequentially (chatbots, creative writing). Encoder-Decoder models (like BART, T5) are ideal for transforming input sequences into output sequences (translation, summarization). Selecting the most appropriate architecture is a critical initial decision in designing an effective NLP solution, as each is optimized for different strengths.
|
279 |
+
|
280 |
+
---
|
281 |
+
|
282 |
+
# Causal Language Modeling
|
283 |
+
|
284 |
+
.col-6[
|
285 |
+
**Key Characteristics:**
|
286 |
+
- Used by decoder models (GPT family)
|
287 |
+
- Predicts next word based on previous words
|
288 |
+
- Unidirectional attention (left-to-right)
|
289 |
+
- Well-suited for text generation
|
290 |
+
- Examples: GPT, LLaMA, Claude
|
291 |
+
]
|
292 |
+
.col-6[
|
293 |
+
.center[]
|
294 |
+
]
|
295 |
+
|
296 |
+
???
|
297 |
+
Causal Language Modeling (CLM) is the training objective typically associated with decoder-only models (like the GPT family, Llama, Claude). The core task is to predict the *next* token in a sequence, given all the preceding tokens. This inherently relies on *unidirectional* or *causal* attention – the model can only 'look back' at previous tokens and the current position. The term 'causal' highlights that the prediction at any point depends only on past information, mirroring the natural flow of language generation. This auto-regressive process is fundamental to how these models generate coherent text, one token after another.
|
298 |
+
|
299 |
+
---
|
300 |
+
|
301 |
+
# Masked Language Modeling
|
302 |
+
|
303 |
+
.col-6[
|
304 |
+
**Key Characteristics:**
|
305 |
+
- Used by encoder models (BERT family)
|
306 |
+
- Masks random words and predicts them
|
307 |
+
- Bidirectional attention (full context)
|
308 |
+
- Well-suited for understanding tasks
|
309 |
+
- Examples: BERT, RoBERTa, DeBERTa
|
310 |
+
]
|
311 |
+
.col-6[
|
312 |
+
.center[]
|
313 |
+
]
|
314 |
+
|
315 |
+
???
|
316 |
+
Masked Language Modeling (MLM) is the pretraining strategy characteristic of encoder-only models (like BERT, RoBERTa, DeBERTa). During training, a certain percentage of input tokens are randomly replaced with a special `[MASK]` token. The model's objective is then to predict the *original* identity of these masked tokens, using the surrounding *unmasked* context. This task necessitates *bidirectional* attention, as the model needs to consider context from both the left and the right of the mask to make an accurate prediction. MLM forces the model to develop a deep understanding of word meanings in context, making it highly effective for downstream tasks requiring rich text representations, such as classification or NER.
|
317 |
+
|
318 |
+
---
|
319 |
+
|
320 |
+
# Transfer Learning in NLP
|
321 |
+
|
322 |
+
**The key to efficient NLP:**
|
323 |
+
- Train once on general data
|
324 |
+
- Adapt to specific tasks
|
325 |
+
- Reduces computational costs
|
326 |
+
- Democratizes access to advanced AI
|
327 |
+
|
328 |
+
???
|
329 |
+
Transfer learning is a cornerstone technique in modern NLP. The core idea is to leverage the knowledge captured by a model trained on a massive, general dataset (pretraining) and adapt it to a new, specific task (fine-tuning). This approach significantly reduces the need for task-specific data and computational resources compared to training from scratch. It has democratized access to powerful AI, enabling smaller teams and individuals to build sophisticated applications by standing on the shoulders of large pretrained models. Let's examine the two key phases: pretraining and fine-tuning.
|
330 |
+
|
331 |
+
---
|
332 |
+
|
333 |
+
# Pretraining Phase
|
334 |
+
|
335 |
+
.col-6[
|
336 |
+
**Characteristics:**
|
337 |
+
- Train on massive datasets (billions of words)
|
338 |
+
- Learn general language patterns and representations
|
339 |
+
- Computationally expensive (can cost millions)
|
340 |
+
- Usually done once by research labs or companies
|
341 |
+
- Foundation for many downstream tasks
|
342 |
+
]
|
343 |
+
.col-6[
|
344 |
+
.center[]
|
345 |
+
]
|
346 |
+
|
347 |
+
???
|
348 |
+
The *pretraining* phase involves training a model from scratch on an enormous corpus of text (often billions or trillions of words). The goal is to learn general statistical patterns, grammar, and world knowledge embedded in the language. This is achieved using self-supervised objectives like Masked Language Modeling (MLM) or Causal Language Modeling (CLM), which don't require manual labels. Pretraining is extremely computationally intensive, often costing millions of dollars and requiring specialized hardware. It's typically performed once by large research labs or companies, creating foundational models that others can then adapt.
|
349 |
+
|
350 |
+
---
|
351 |
+
|
352 |
+
# Fine-tuning Phase
|
353 |
+
|
354 |
+
.col-6[
|
355 |
+
**Characteristics:**
|
356 |
+
- Adapt pretrained model to specific tasks
|
357 |
+
- Use smaller, task-specific datasets
|
358 |
+
- Much less expensive than pretraining
|
359 |
+
- Can be done on consumer hardware
|
360 |
+
- Preserves general knowledge while adding specialized capabilities
|
361 |
+
]
|
362 |
+
.col-6[
|
363 |
+
.center[]
|
364 |
+
]
|
365 |
+
|
366 |
+
???
|
367 |
+
The *fine-tuning* phase takes a pretrained model and further trains it on a smaller, labeled dataset specific to a target task (e.g., sentiment analysis, medical text summarization). This adaptation process is significantly less computationally expensive than pretraining and often feasible on standard hardware. Fine-tuning allows the model to specialize its general language understanding, acquired during pretraining, for the nuances of the specific task. It effectively transfers the broad knowledge base and refines it for a particular application, requiring far less task-specific data than training from scratch.
|
368 |
+
|
369 |
+
---
|
370 |
+
|
371 |
+
# Understanding LLM Inference
|
372 |
+
|
373 |
+
**Inference**: The process of using a trained model to generate predictions or outputs.
|
374 |
+
|
375 |
+
**Key Components**:
|
376 |
+
- Input processing (tokenization)
|
377 |
+
- Model computation
|
378 |
+
- Output generation
|
379 |
+
- Sampling strategies
|
380 |
+
|
381 |
+
???
|
382 |
+
Once a model is trained (either pretrained or fine-tuned), we use it for *inference* – the process of generating predictions or outputs for new, unseen inputs. For LLMs, inference typically means taking an input prompt and generating a textual response. Understanding the mechanics of inference is vital for deploying these models effectively, as it involves trade-offs between speed, cost, memory usage, and output quality. Key components include how the input is processed, the model's computations, how the output is generated step-by-step, and the strategies used to select the next token.
|
383 |
+
|
384 |
+
---
|
385 |
+
|
386 |
+
# The Two-Phase Inference Process
|
387 |
+
|
388 |
+
**1. Prefill Phase**:
|
389 |
+
- Process the entire input prompt
|
390 |
+
- Compute initial hidden states
|
391 |
+
- Computationally intensive for long prompts
|
392 |
+
- Only happens once per generation
|
393 |
+
|
394 |
+
**2. Decode Phase**:
|
395 |
+
- Generate one token at a time
|
396 |
+
- Use previous tokens as context
|
397 |
+
- Repeat until completion criteria met
|
398 |
+
- Most of the generation time is spent here
|
399 |
+
|
400 |
+
???
|
401 |
+
LLM inference typically proceeds in two distinct phases: 1. **Prefill Phase**: The model processes the entire input prompt simultaneously. This involves calculating initial internal states (often called hidden states or activations) based on the prompt. This phase is compute-bound, especially for long prompts, as it involves parallel processing across the input length, but it only occurs once per generation request. Think of it as the model 'reading and understanding' the prompt. 2. **Decode Phase**: The model generates the output token by token, auto-regressively. In each step, it uses the context of the prompt *and* all previously generated tokens to predict the next token. This phase is memory-bandwidth-bound, as it involves sequential generation and accessing previously computed states. This is the 'writing' part, repeated until a stopping condition is met.
|
402 |
+
|
403 |
+
---
|
404 |
+
|
405 |
+
# Sampling Strategies
|
406 |
+
|
407 |
+
**Temperature**: Controls randomness (higher = more creative, lower = more deterministic)
|
408 |
+
|
409 |
+
**Top-k Sampling**: Only consider the k most likely next tokens
|
410 |
+
|
411 |
+
**Top-p (Nucleus) Sampling**: Consider tokens covering p probability mass
|
412 |
+
|
413 |
+
**Beam Search**: Explore multiple possible continuations in parallel
|
414 |
+
|
415 |
+
???
|
416 |
+
During the decode phase, selecting the *next* token isn't always about picking the single most probable one. Various *sampling strategies* allow control over the generation process: **Temperature**: Adjusts the randomness of predictions. Lower values (<1.0) make generation more focused and deterministic; higher values (>1.0) increase diversity and creativity, potentially at the cost of coherence. **Top-k Sampling**: Limits the selection pool to the 'k' most probable next tokens. **Top-p (Nucleus) Sampling**: Selects from the smallest set of tokens whose cumulative probability exceeds a threshold 'p'. **Beam Search**: Explores multiple potential sequences (beams) in parallel, choosing the overall most probable sequence at the end. These strategies enable tuning the output for different needs, balancing predictability and creativity.
|
417 |
+
|
418 |
+
---
|
419 |
+
|
420 |
+
# Optimizing Inference
|
421 |
+
|
422 |
+
**KV Cache**: Store key-value pairs from previous tokens to avoid recomputation
|
423 |
+
|
424 |
+
**Quantization**: Reduce precision of model weights (e.g., FP16, INT8)
|
425 |
+
|
426 |
+
**Batching**: Process multiple requests together for better hardware utilization
|
427 |
+
|
428 |
+
**Model Pruning**: Remove less important weights to reduce model size
|
429 |
+
|
430 |
+
???
|
431 |
+
Making LLM inference efficient is critical for practical deployment. Several optimization techniques are commonly used: **KV Cache**: Stores intermediate results (Key and Value projections from attention layers) for previously processed tokens, avoiding redundant computations during the decode phase. This significantly speeds up generation but increases memory usage. **Quantization**: Reduces the numerical precision of model weights (e.g., from 32-bit floats to 8-bit integers). This shrinks model size and speeds up computation, often with minimal impact on accuracy. **Batching**: Processing multiple input prompts simultaneously to maximize hardware (GPU) utilization. **Model Pruning/Sparsification**: Removing less important weights or structures from the model to reduce its size and computational cost. These techniques are essential for reducing latency, cost, and memory footprint in production.
|
432 |
+
|
433 |
+
---
|
434 |
+
|
435 |
+
# Bias and Limitations
|
436 |
+
|
437 |
+
**Source**: Models learn from vast internet data, including harmful stereotypes (sexist, racist, etc.).
|
438 |
+
|
439 |
+
**Problem**: LLMs can replicate and amplify these biases.
|
440 |
+
|
441 |
+
**Caution**: Fine-tuning does not automatically remove underlying bias. Awareness is crucial.
|
442 |
+
|
443 |
+
???
|
444 |
+
Despite their impressive capabilities, LLMs have significant limitations and potential harms. A major concern is *bias*. Trained on vast datasets scraped from the internet, models inevitably learn and can replicate harmful societal biases present in that data – including sexist, racist, and other prejudiced stereotypes. They learn statistical patterns, not true understanding or ethical reasoning. Consequently, their outputs can reflect and even amplify these biases. It's crucial to remember that fine-tuning on specific data doesn't automatically eliminate these underlying biases learned during pretraining.
|
445 |
+
|
446 |
+
---
|
447 |
+
|
448 |
+
# Bias Example: BERT Fill-Mask
|
449 |
+
|
450 |
+
**Code**:
|
451 |
+
```python
|
452 |
+
from transformers import pipeline
|
453 |
+
|
454 |
+
unmasker = pipeline("fill-mask", model="bert-base-uncased")
|
455 |
+
print(unmasker("This man works as a [MASK]."))
|
456 |
+
print(unmasker("This woman works as a [MASK]."))
|
457 |
+
```
|
458 |
+
|
459 |
+
**Results**:
|
460 |
+
- Man: lawyer, carpenter, doctor, waiter, mechanic
|
461 |
+
- Woman: nurse, waitress, teacher, maid, prostitute
|
462 |
+
|
463 |
+
???
|
464 |
+
This code snippet provides a concrete example of gender bias using the BERT model in a `fill-mask` task. When prompted to predict occupations for 'This man works as a...' versus 'This woman works as a...', the model suggests stereotypical roles (e.g., 'lawyer', 'carpenter' for men; 'nurse', 'waitress' for women). This occurs even though BERT was pretrained on seemingly neutral sources like Wikipedia and BookCorpus. It starkly illustrates how models absorb and reflect societal biases present in their training data, highlighting the need for careful evaluation and mitigation when using them in real-world scenarios.
|
465 |
+
|
466 |
+
---
|
467 |
+
|
468 |
+
# Sources of Bias in Models
|
469 |
+
|
470 |
+
**Three main sources**:
|
471 |
+
1. **Training Data**: Biased data leads to biased models
|
472 |
+
2. **Pretrained Models**: Bias persists through transfer learning
|
473 |
+
3. **Optimization Metrics**: What we optimize for shapes model behavior
|
474 |
+
|
475 |
+
**Mitigation approaches**:
|
476 |
+
- Careful dataset curation
|
477 |
+
- Bias evaluation and monitoring
|
478 |
+
- Fine-tuning with debiased data
|
479 |
+
- Post-processing techniques
|
480 |
+
|
481 |
+
???
|
482 |
+
Bias in AI models can originate from several sources throughout the development lifecycle: 1. **Training Data**: The most direct source; biased data leads to biased models. If the data reflects societal stereotypes, the model will learn them. 2. **Pretrained Models**: Bias embedded in foundational models persists even after fine-tuning on new data (negative transfer). 3. **Optimization Metrics & Objectives**: The very definition of 'good performance' can inadvertently favor biased outcomes if not carefully designed. Addressing bias requires a multi-faceted approach, including careful dataset curation and filtering, rigorous evaluation and monitoring for bias, specialized fine-tuning techniques using debiased data or objectives, and potentially post-processing model outputs.
|
483 |
+
|
484 |
+
---
|
485 |
+
|
486 |
+
# Summary & Recap
|
487 |
+
|
488 |
+
- NLP & LLMs: Defined the field and the impact of large models.
|
489 |
+
- Transformer Capabilities: Explored tasks solvable via pipeline().
|
490 |
+
- Attention: Introduced the core mechanism.
|
491 |
+
- Architectures: Covered Encoder-only, Decoder-only, Encoder-Decoder.
|
492 |
+
- Transfer Learning: Pretraining and fine-tuning approach.
|
493 |
+
- Inference: Understanding how models generate text.
|
494 |
+
- Bias: Highlighted the importance of awareness.
|
495 |
+
|
496 |
+
???
|
497 |
+
Let's quickly recap the key topics we've covered in this chapter: We defined Natural Language Processing (NLP) and saw how Large Language Models (LLMs) represent a major advancement. We explored the diverse capabilities of Transformers using the simple `pipeline()` function for tasks ranging from classification to generation. We introduced the core concept of the *attention mechanism*. We differentiated between the three main Transformer architectures (Encoder-only, Decoder-only, Encoder-Decoder) and their typical use cases. We explained the crucial *transfer learning* paradigm involving pretraining and fine-tuning. We looked into the mechanics of LLM *inference*, including sampling and optimization. Finally, we highlighted the critical issues of *bias and limitations* in these models. This provides a solid foundation for the rest of the course.
|
498 |
+
|
499 |
+
---
|
500 |
+
|
501 |
+
# Next Steps
|
502 |
+
|
503 |
+
- Deeper dive into the 🤗 Transformers library.
|
504 |
+
- Data processing techniques.
|
505 |
+
- Fine-tuning models for specific needs.
|
506 |
+
- Exploring advanced concepts.
|
507 |
+
|
508 |
+
???
|
509 |
+
Building on this foundation, the upcoming chapters will delve into more practical aspects. We will explore the 🤗 Transformers library in greater detail, learn essential data processing techniques for preparing text for models, walk through the process of fine-tuning pretrained models for specific tasks, and touch upon more advanced concepts and deployment strategies. Get ready to gain hands-on skills for effectively working with these powerful models.
|
510 |
+
|
511 |
+
---
|
512 |
+
|
513 |
+
class: center, middle
|
514 |
+
|
515 |
+
# Thank You!
|
516 |
+
|
517 |
+
???
|
518 |
+
Thank you for joining this first chapter. We hope this introduction has sparked your interest, and we look forward to seeing you in the next lesson!
|
chapter1/template/index.html
ADDED
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html>
|
3 |
+
<head>
|
4 |
+
<meta charset="UTF-8">
|
5 |
+
<title>{{title}}</title>
|
6 |
+
{{{style}}}
|
7 |
+
<script src="remark.min.js"></script>
|
8 |
+
<script>
|
9 |
+
function create() {
|
10 |
+
return remark.create({
|
11 |
+
{{{source}}},
|
12 |
+
ratio: '16:9',
|
13 |
+
highlightLines: true,
|
14 |
+
countIncrementalSlides: false,
|
15 |
+
highlightStyle: 'github'
|
16 |
+
});
|
17 |
+
}
|
18 |
+
</script>
|
19 |
+
</head>
|
20 |
+
<body onload="slideshow = create()">
|
21 |
+
</body>
|
22 |
+
</html>
|
chapter1/template/remark.min.js
ADDED
The diff for this file is too large to render.
See raw diff
|
|
chapter1/template/style.scss
ADDED
@@ -0,0 +1,259 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
@use "sass:math";
|
2 |
+
|
3 |
+
// Theme
|
4 |
+
// ---------------------------------------------------------------------------
|
5 |
+
|
6 |
+
// Hugging Face colors
|
7 |
+
$primary : #FFD21E; // HF Yellow
|
8 |
+
$secondary : #FF9D00; // HF Orange
|
9 |
+
$tertiary : #6B7280; // HF Gray
|
10 |
+
$light : #FFF;
|
11 |
+
$dark : #333;
|
12 |
+
$text-dark : #212121;
|
13 |
+
$text-light : $light;
|
14 |
+
$code-background : #F8F8F8;
|
15 |
+
$overlay : transparentize(#000, .5);
|
16 |
+
$font-size : 28px;
|
17 |
+
$font-size-impact : 128px;
|
18 |
+
$font : Arial, Helvetica, sans-serif;
|
19 |
+
$font-title : Arial, Helvetica, sans-serif;
|
20 |
+
$font-fixed : 'Lucida Console', Monaco, monospace;
|
21 |
+
$margin : 20px;
|
22 |
+
$iframe-scale : 1.5;
|
23 |
+
|
24 |
+
|
25 |
+
// CSS Base
|
26 |
+
// ---------------------------------------------------------------------------
|
27 |
+
|
28 |
+
* { box-sizing: border-box; }
|
29 |
+
body { font-family: $font; }
|
30 |
+
h1, h2, h3, h4, h5, h6 {
|
31 |
+
margin: 0 0 $margin 0;
|
32 |
+
font-family: $font-title;
|
33 |
+
}
|
34 |
+
h1 { color: $primary; }
|
35 |
+
h2 { color: $secondary; }
|
36 |
+
h3 { color: $tertiary; }
|
37 |
+
li { margin-bottom: .25em; };
|
38 |
+
pre, code {
|
39 |
+
text-align: left;
|
40 |
+
font-family: $font-fixed;
|
41 |
+
color: $secondary;
|
42 |
+
background: $code-background;
|
43 |
+
}
|
44 |
+
a, a:visited, a:hover, a:active { color: $text-dark; }
|
45 |
+
img {
|
46 |
+
vertical-align: inherit;
|
47 |
+
max-width: 100%;
|
48 |
+
height: auto;
|
49 |
+
}
|
50 |
+
blockquote {
|
51 |
+
border-left: 8px solid;
|
52 |
+
padding-left: .5em;
|
53 |
+
color: $tertiary;
|
54 |
+
text-align: left;
|
55 |
+
margin: 1em 0;
|
56 |
+
& > p { margin: 0; }
|
57 |
+
}
|
58 |
+
|
59 |
+
|
60 |
+
// Remark base
|
61 |
+
// ---------------------------------------------------------------------------
|
62 |
+
|
63 |
+
.remark-container { background: $dark; }
|
64 |
+
.remark-slide-scaler { box-shadow: none; }
|
65 |
+
.remark-notes { font-size: 1.5em; }
|
66 |
+
|
67 |
+
.remark-slide-content {
|
68 |
+
font-size: $font-size;
|
69 |
+
padding: 1em 2em;
|
70 |
+
color: $text-dark;
|
71 |
+
background-size: cover;
|
72 |
+
}
|
73 |
+
|
74 |
+
.remark-slide-number {
|
75 |
+
color: $text-light;
|
76 |
+
right: 1em;
|
77 |
+
opacity: .6;
|
78 |
+
font-size: 0.8em;
|
79 |
+
z-index: 2;
|
80 |
+
.no-counter & { display: none; }
|
81 |
+
}
|
82 |
+
|
83 |
+
// Additions
|
84 |
+
.impact {
|
85 |
+
background-color: #f0f0f0;
|
86 |
+
vertical-align: middle;
|
87 |
+
text-align: center;
|
88 |
+
&, h1, h2 { color: $dark; } // Dark text on yellow background for better readability
|
89 |
+
h1 { font-size: $font-size-impact; }
|
90 |
+
}
|
91 |
+
|
92 |
+
.full {
|
93 |
+
&, h1, h2 { color: $text-light; }
|
94 |
+
&iframe {
|
95 |
+
height: calc(math.div(100%, $iframe-scale) - 1.2em);
|
96 |
+
width: math.div(100%, $iframe-scale);
|
97 |
+
transform: scale($iframe-scale);
|
98 |
+
transform-origin: 0 0;
|
99 |
+
border: 0;
|
100 |
+
}
|
101 |
+
}
|
102 |
+
|
103 |
+
.bottom-bar {
|
104 |
+
background-color: $primary;
|
105 |
+
color: $dark; // Dark text on yellow background for better readability
|
106 |
+
position: absolute;
|
107 |
+
bottom: 0;
|
108 |
+
left: 0;
|
109 |
+
right: 0;
|
110 |
+
font-size: 20px;
|
111 |
+
padding: .8em;
|
112 |
+
text-align: left;
|
113 |
+
z-index: 1;
|
114 |
+
p { margin: 0;}
|
115 |
+
.impact &, .full & { display: none; }
|
116 |
+
}
|
117 |
+
|
118 |
+
|
119 |
+
// Utilities
|
120 |
+
// ---------------------------------------------------------------------------
|
121 |
+
|
122 |
+
// Positioning
|
123 |
+
.side-layer {
|
124 |
+
position: absolute;
|
125 |
+
left: 0;
|
126 |
+
width: 100%;
|
127 |
+
padding: 0 2em;
|
128 |
+
}
|
129 |
+
.middle { &, & img, & span { vertical-align: middle; } };
|
130 |
+
.top { vertical-align: top; };
|
131 |
+
.bottom { vertical-align: bottom; };
|
132 |
+
.inline-block {
|
133 |
+
p, ul, ol, blockquote {
|
134 |
+
display: inline-block;
|
135 |
+
text-align: left;
|
136 |
+
}
|
137 |
+
}
|
138 |
+
.no-margin { &, & > p, & > pre, & > ul, & > ol { margin: 0; } }
|
139 |
+
.no-padding { padding: 0; }
|
140 |
+
.space-left { padding-left: 1em; }
|
141 |
+
.space-right { padding-right: 1em; }
|
142 |
+
|
143 |
+
// Images
|
144 |
+
.responsive > img { width: 100%; height: auto; };
|
145 |
+
.contain { background-size: contain; };
|
146 |
+
.overlay { box-shadow: inset 0 0 0 9999px $overlay; }
|
147 |
+
|
148 |
+
// Center images
|
149 |
+
.center {
|
150 |
+
text-align: center;
|
151 |
+
img {
|
152 |
+
display: block;
|
153 |
+
margin: 0 auto;
|
154 |
+
max-width: 100%;
|
155 |
+
max-height: 400px; // Limit height to fit on slides
|
156 |
+
}
|
157 |
+
}
|
158 |
+
|
159 |
+
// Text
|
160 |
+
.left { text-align: left; }
|
161 |
+
.right { text-align: right; }
|
162 |
+
.justify { text-align: justify; }
|
163 |
+
.primary { color: $primary; }
|
164 |
+
.primary-bg { background-color: $primary; }
|
165 |
+
.secondary { color: $secondary; }
|
166 |
+
.secondary-bg { background-color: $secondary; }
|
167 |
+
.tertiary { color: $tertiary; }
|
168 |
+
.tertiary-bg { background-color: $tertiary; }
|
169 |
+
.alt { color: $secondary; };
|
170 |
+
.em { color: $tertiary; };
|
171 |
+
.thin { font-weight: 200; }
|
172 |
+
.huge { font-size: 2em; }
|
173 |
+
.big { font-size: 1.5em; }
|
174 |
+
.small { font-size: .8em; }
|
175 |
+
.strike { text-decoration: line-through; }
|
176 |
+
.dark { color: $dark; }
|
177 |
+
.dark-bg { background-color: $dark; }
|
178 |
+
.light { color: $light; }
|
179 |
+
.light-bg { background-color: $light; }
|
180 |
+
.alt-bg { background-color: $secondary; };
|
181 |
+
|
182 |
+
// Simple 12-columns grid system
|
183 |
+
.row {
|
184 |
+
width: 100%;
|
185 |
+
&::after {
|
186 |
+
content: '';
|
187 |
+
display: table;
|
188 |
+
clear: both;
|
189 |
+
}
|
190 |
+
&.table { display: table; };
|
191 |
+
&.table [class^="col-"] {
|
192 |
+
float: none;
|
193 |
+
display: table-cell;
|
194 |
+
vertical-align: inherit;
|
195 |
+
}
|
196 |
+
}
|
197 |
+
|
198 |
+
[class^="col-"] {
|
199 |
+
float: left;
|
200 |
+
&.inline-block {
|
201 |
+
float: none;
|
202 |
+
display: inline-block;
|
203 |
+
}
|
204 |
+
}
|
205 |
+
|
206 |
+
@for $i from 1 through 12 {
|
207 |
+
.col-#{$i} { width: math.div(100%, 12) * $i; }
|
208 |
+
}
|
209 |
+
|
210 |
+
// Animations
|
211 |
+
@keyframes fadeIn {
|
212 |
+
from { opacity: 0; }
|
213 |
+
to { opacity: 1; }
|
214 |
+
}
|
215 |
+
|
216 |
+
.animation-fade {
|
217 |
+
animation-duration: 300ms;
|
218 |
+
animation-fill-mode: both;
|
219 |
+
animation-timing-function: ease-out;
|
220 |
+
.remark-visible & { animation-name: fadeIn; }
|
221 |
+
}
|
222 |
+
|
223 |
+
// Hugging Face specific styles
|
224 |
+
.hf-logo {
|
225 |
+
max-height: 80px;
|
226 |
+
margin: 20px 0;
|
227 |
+
}
|
228 |
+
|
229 |
+
// Fix PDF print with chrome
|
230 |
+
// ---------------------------------------------------------------------------
|
231 |
+
|
232 |
+
@page {
|
233 |
+
// 908px 681px for 4/3 slides
|
234 |
+
size: 1210px 681px;
|
235 |
+
margin: 0;
|
236 |
+
}
|
237 |
+
|
238 |
+
@media print {
|
239 |
+
.remark-slide-scaler {
|
240 |
+
width: 100% !important;
|
241 |
+
height: 100% !important;
|
242 |
+
transform: scale(1) !important;
|
243 |
+
top: 0 !important;
|
244 |
+
left: 0 !important;
|
245 |
+
}
|
246 |
+
}
|
247 |
+
|
248 |
+
// Code
|
249 |
+
.remark-code {
|
250 |
+
font-size: .7em;
|
251 |
+
line-height: 1.2;
|
252 |
+
background: $dark !important;
|
253 |
+
color: $light !important;
|
254 |
+
|
255 |
+
}
|
256 |
+
|
257 |
+
.remark-code .hljs-keyword { color: $primary; }
|
258 |
+
|
259 |
+
.remark-code .hljs-string { color: $secondary; }
|
pyproject.toml
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[project]
|
2 |
+
name = "video-course"
|
3 |
+
version = "0.1.0"
|
4 |
+
description = "Add your description here"
|
5 |
+
readme = "README.md"
|
6 |
+
requires-python = ">=3.11"
|
7 |
+
dependencies = []
|
scripts/create_video.py
ADDED
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import glob
|
3 |
+
import argparse
|
4 |
+
import sys
|
5 |
+
from typing import List, Tuple
|
6 |
+
|
7 |
+
import natsort
|
8 |
+
from moviepy import *
|
9 |
+
from pdf2image import convert_from_path
|
10 |
+
|
11 |
+
|
12 |
+
def parse_arguments():
|
13 |
+
"""Parse command line arguments."""
|
14 |
+
parser = argparse.ArgumentParser(
|
15 |
+
description="Create a video from PDF slides and audio files."
|
16 |
+
)
|
17 |
+
parser.add_argument(
|
18 |
+
"--pdf", required=True, help="Path to the PDF file containing slides"
|
19 |
+
)
|
20 |
+
parser.add_argument(
|
21 |
+
"--audio-dir", required=True, help="Directory containing audio files"
|
22 |
+
)
|
23 |
+
parser.add_argument(
|
24 |
+
"--audio-pattern",
|
25 |
+
default="*.wav",
|
26 |
+
help="Pattern to match audio files (default: *.wav)",
|
27 |
+
)
|
28 |
+
parser.add_argument(
|
29 |
+
"--buffer",
|
30 |
+
type=float,
|
31 |
+
default=1.5,
|
32 |
+
help="Buffer time in seconds after each audio clip (default: 1.5)",
|
33 |
+
)
|
34 |
+
parser.add_argument(
|
35 |
+
"--output",
|
36 |
+
default="final_presentation.mp4",
|
37 |
+
help="Output video filename (default: final_presentation.mp4)",
|
38 |
+
)
|
39 |
+
parser.add_argument(
|
40 |
+
"--fps", type=int, default=5, help="Frame rate of output video (default: 5)"
|
41 |
+
)
|
42 |
+
parser.add_argument(
|
43 |
+
"--dpi",
|
44 |
+
type=int,
|
45 |
+
default=120,
|
46 |
+
help="DPI for PDF to image conversion (default: 120)",
|
47 |
+
)
|
48 |
+
return parser.parse_args()
|
49 |
+
|
50 |
+
|
51 |
+
def find_audio_files(audio_dir: str, pattern: str) -> List[str]:
|
52 |
+
"""Find and sort audio files in the specified directory."""
|
53 |
+
search_pattern = os.path.join(audio_dir, pattern)
|
54 |
+
audio_files = natsort.natsorted(glob.glob(search_pattern))
|
55 |
+
return audio_files
|
56 |
+
|
57 |
+
|
58 |
+
def convert_pdf_to_images(pdf_path: str, dpi: int) -> List:
|
59 |
+
"""Convert PDF pages to images."""
|
60 |
+
print(f"Converting PDF '{pdf_path}' to images...")
|
61 |
+
try:
|
62 |
+
pdf_images = convert_from_path(pdf_path, dpi=dpi)
|
63 |
+
print(f"Successfully converted {len(pdf_images)} pages from PDF.")
|
64 |
+
return pdf_images
|
65 |
+
except Exception as e:
|
66 |
+
print(f"Error converting PDF to images: {e}")
|
67 |
+
sys.exit(1)
|
68 |
+
|
69 |
+
|
70 |
+
def create_video_clips(
|
71 |
+
pdf_images: List, audio_files: List[str], buffer_seconds: float, output_fps: int
|
72 |
+
) -> List:
|
73 |
+
"""Create video clips from images and audio files."""
|
74 |
+
video_clips_list = []
|
75 |
+
|
76 |
+
print("\nCreating individual video clips...")
|
77 |
+
for i, (img, aud_file) in enumerate(zip(pdf_images, audio_files)):
|
78 |
+
print(
|
79 |
+
f"Processing pair {i + 1}/{len(pdf_images)}: "
|
80 |
+
f"Page {i + 1} + {os.path.basename(aud_file)}"
|
81 |
+
)
|
82 |
+
try:
|
83 |
+
# Load audio to get duration
|
84 |
+
audio_clip = AudioFileClip(aud_file)
|
85 |
+
audio_duration = audio_clip.duration
|
86 |
+
|
87 |
+
# Calculate target duration for the image clip
|
88 |
+
target_duration = audio_duration + buffer_seconds
|
89 |
+
|
90 |
+
# Create a temporary file for the image
|
91 |
+
temp_img_path = f"temp_slide_{i + 1}.png"
|
92 |
+
img.save(temp_img_path, "PNG")
|
93 |
+
|
94 |
+
# Create video clip from image with the correct duration
|
95 |
+
# In MoviePy v2.0+, we use ImageSequenceClip with a single image
|
96 |
+
img_clip = ImageSequenceClip([temp_img_path], durations=[target_duration])
|
97 |
+
|
98 |
+
# Set FPS for the individual clip
|
99 |
+
img_clip = img_clip.with_fps(output_fps)
|
100 |
+
|
101 |
+
# Set the audio for the image clip
|
102 |
+
video_clip_with_audio = img_clip.with_audio(audio_clip)
|
103 |
+
|
104 |
+
video_clips_list.append(video_clip_with_audio)
|
105 |
+
print(
|
106 |
+
f" -> Clip created (Audio: {audio_duration:.2f}s + "
|
107 |
+
f"Buffer: {buffer_seconds:.2f}s = "
|
108 |
+
f"Total: {target_duration:.2f}s)"
|
109 |
+
)
|
110 |
+
|
111 |
+
except Exception as e:
|
112 |
+
print(f" Error processing pair {i + 1}: {e}")
|
113 |
+
print(" Skipping this pair.")
|
114 |
+
# Close clips if they were opened, to release file handles
|
115 |
+
if "audio_clip" in locals() and audio_clip:
|
116 |
+
audio_clip.close()
|
117 |
+
if "img_clip" in locals() and img_clip:
|
118 |
+
img_clip.close()
|
119 |
+
if "video_clip_with_audio" in locals() and video_clip_with_audio:
|
120 |
+
video_clip_with_audio.close()
|
121 |
+
|
122 |
+
return video_clips_list
|
123 |
+
|
124 |
+
|
125 |
+
def concatenate_clips(
|
126 |
+
video_clips_list: List, output_file: str, output_fps: int
|
127 |
+
) -> None:
|
128 |
+
"""Concatenate video clips and write to output file."""
|
129 |
+
if not video_clips_list:
|
130 |
+
print("\nNo video clips were successfully created. Exiting.")
|
131 |
+
sys.exit(1)
|
132 |
+
|
133 |
+
print(f"\nConcatenating {len(video_clips_list)} clips...")
|
134 |
+
final_clip = None
|
135 |
+
try:
|
136 |
+
final_clip = concatenate_videoclips(video_clips_list, method="compose")
|
137 |
+
|
138 |
+
print(f"Writing final video file: {output_file}...")
|
139 |
+
# Write the final video file
|
140 |
+
final_clip.write_videofile(
|
141 |
+
output_file,
|
142 |
+
fps=output_fps,
|
143 |
+
codec="libx264",
|
144 |
+
audio_codec="aac",
|
145 |
+
# logger=None, # Suppress verbose output
|
146 |
+
)
|
147 |
+
print("Final video file written successfully.")
|
148 |
+
|
149 |
+
except Exception as e:
|
150 |
+
print(f"\nError during concatenation or writing video file: {e}")
|
151 |
+
print("Ensure you have enough free disk space and RAM.")
|
152 |
+
|
153 |
+
finally:
|
154 |
+
# Close clips to release resources
|
155 |
+
if final_clip:
|
156 |
+
final_clip.close()
|
157 |
+
for clip in video_clips_list:
|
158 |
+
clip.close()
|
159 |
+
|
160 |
+
|
161 |
+
def cleanup_temp_files(pdf_images: List) -> None:
|
162 |
+
"""Clean up temporary image files."""
|
163 |
+
print("\nCleaning up temporary files...")
|
164 |
+
for i in range(len(pdf_images)):
|
165 |
+
temp_img_path = f"temp_slide_{i + 1}.png"
|
166 |
+
if os.path.exists(temp_img_path):
|
167 |
+
os.remove(temp_img_path)
|
168 |
+
|
169 |
+
|
170 |
+
def main():
|
171 |
+
"""Main function to run the script."""
|
172 |
+
args = parse_arguments()
|
173 |
+
|
174 |
+
# Validate inputs
|
175 |
+
if not os.path.exists(args.pdf):
|
176 |
+
print(f"Error: PDF file '{args.pdf}' not found.")
|
177 |
+
sys.exit(1)
|
178 |
+
|
179 |
+
if not os.path.exists(args.audio_dir):
|
180 |
+
print(f"Error: Audio directory '{args.audio_dir}' not found.")
|
181 |
+
sys.exit(1)
|
182 |
+
|
183 |
+
# Find audio files
|
184 |
+
audio_files = find_audio_files(args.audio_dir, args.audio_pattern)
|
185 |
+
if not audio_files:
|
186 |
+
print(
|
187 |
+
f"Error: No audio files found matching pattern '{args.audio_pattern}' "
|
188 |
+
f"in directory '{args.audio_dir}'."
|
189 |
+
)
|
190 |
+
sys.exit(1)
|
191 |
+
|
192 |
+
# Convert PDF to images
|
193 |
+
pdf_images = convert_pdf_to_images(args.pdf, args.dpi)
|
194 |
+
|
195 |
+
# Check if number of PDF pages matches number of audio files
|
196 |
+
if len(pdf_images) != len(audio_files):
|
197 |
+
print("Error: Mismatched number of files found.")
|
198 |
+
print(f" PDF pages ({len(pdf_images)})")
|
199 |
+
print(f" Audio files ({len(audio_files)}): {audio_files}")
|
200 |
+
print("Please ensure you have one corresponding audio file for each PDF page.")
|
201 |
+
sys.exit(1)
|
202 |
+
|
203 |
+
print(f"Found {len(pdf_images)} PDF pages with {len(audio_files)} audio files.")
|
204 |
+
|
205 |
+
# Create video clips
|
206 |
+
video_clips = create_video_clips(pdf_images, audio_files, args.buffer, args.fps)
|
207 |
+
|
208 |
+
# Concatenate clips and create final video
|
209 |
+
concatenate_clips(video_clips, args.output, args.fps)
|
210 |
+
|
211 |
+
# Clean up
|
212 |
+
cleanup_temp_files(pdf_images)
|
213 |
+
|
214 |
+
print("\nScript finished.")
|
215 |
+
|
216 |
+
|
217 |
+
if __name__ == "__main__":
|
218 |
+
main()
|
scripts/transcription_to_audio.py
ADDED
@@ -0,0 +1,318 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Script to extract speaker notes from presentation.md files and convert them to
|
4 |
+
audio files.
|
5 |
+
|
6 |
+
Usage:
|
7 |
+
python transcription_to_audio.py path/to/chapter/directory
|
8 |
+
|
9 |
+
This will:
|
10 |
+
1. Parse the presentation.md file in the specified directory
|
11 |
+
2. Extract speaker notes (text between ??? markers)
|
12 |
+
3. Generate audio files using FAL AI with optional voice customization
|
13 |
+
4. Save audio files in {dir}/audio/{n}.wav format
|
14 |
+
"""
|
15 |
+
|
16 |
+
import re
|
17 |
+
import sys
|
18 |
+
import os
|
19 |
+
import argparse
|
20 |
+
import json
|
21 |
+
import hashlib
|
22 |
+
from pathlib import Path
|
23 |
+
import logging
|
24 |
+
import time
|
25 |
+
import requests
|
26 |
+
|
27 |
+
from dotenv import load_dotenv
|
28 |
+
import fal_client
|
29 |
+
|
30 |
+
load_dotenv()
|
31 |
+
|
32 |
+
# Configure logging
|
33 |
+
logging.basicConfig(
|
34 |
+
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
|
35 |
+
)
|
36 |
+
|
37 |
+
logger = logging.getLogger(__name__)
|
38 |
+
|
39 |
+
VOICE_ID = os.getenv("VOICE_ID")
|
40 |
+
|
41 |
+
|
42 |
+
def extract_speaker_notes(markdown_content):
|
43 |
+
"""Extract speaker notes from markdown content."""
|
44 |
+
# Pattern to match content between ??? markers
|
45 |
+
pattern = r"\?\?\?(.*?)(?=\n---|\n$|\Z)"
|
46 |
+
|
47 |
+
# Find all matches using regex
|
48 |
+
matches = re.findall(pattern, markdown_content, re.DOTALL)
|
49 |
+
|
50 |
+
# Clean up the extracted notes
|
51 |
+
notes = [note.strip() for note in matches]
|
52 |
+
|
53 |
+
return notes
|
54 |
+
|
55 |
+
|
56 |
+
def get_cache_key(text, voice, speed, emotion, language):
|
57 |
+
"""Generate a unique cache key for the given parameters."""
|
58 |
+
# Create a string with all parameters
|
59 |
+
params_str = f"{text}|{voice}|{speed}|{emotion}|{language}"
|
60 |
+
|
61 |
+
# Generate a hash of the parameters
|
62 |
+
return hashlib.md5(params_str.encode()).hexdigest()
|
63 |
+
|
64 |
+
|
65 |
+
def load_cache(cache_file):
|
66 |
+
"""Load the cache from a file."""
|
67 |
+
if not cache_file.exists():
|
68 |
+
return {}
|
69 |
+
|
70 |
+
try:
|
71 |
+
with open(cache_file, "r") as f:
|
72 |
+
return json.load(f)
|
73 |
+
except (json.JSONDecodeError, IOError) as e:
|
74 |
+
logger.warning(f"Error loading cache: {e}")
|
75 |
+
return {}
|
76 |
+
|
77 |
+
|
78 |
+
def save_cache(cache_data, cache_file):
|
79 |
+
"""Save the cache to a file."""
|
80 |
+
try:
|
81 |
+
with open(cache_file, "w") as f:
|
82 |
+
json.dump(cache_data, f)
|
83 |
+
except IOError as e:
|
84 |
+
logger.warning(f"Error saving cache: {e}")
|
85 |
+
|
86 |
+
|
87 |
+
def text_to_speech(
|
88 |
+
text,
|
89 |
+
output_file,
|
90 |
+
voice=None,
|
91 |
+
speed=1.0,
|
92 |
+
emotion="happy",
|
93 |
+
language="English",
|
94 |
+
cache_dir=None,
|
95 |
+
):
|
96 |
+
"""Convert text to speech using FAL AI and save as audio file.
|
97 |
+
|
98 |
+
Args:
|
99 |
+
text: The text to convert to speech
|
100 |
+
output_file: Path to save the output audio file
|
101 |
+
voice: The voice ID to use
|
102 |
+
speed: Speech speed (0.5-2.0)
|
103 |
+
emotion: Emotion to apply (neutral, happy, sad, etc.)
|
104 |
+
language: Language for language boost
|
105 |
+
cache_dir: Directory to store cache files
|
106 |
+
"""
|
107 |
+
try:
|
108 |
+
start_time = time.monotonic()
|
109 |
+
|
110 |
+
# Create the output directory if it doesn't exist
|
111 |
+
output_file.parent.mkdir(exist_ok=True)
|
112 |
+
|
113 |
+
# Set up caching
|
114 |
+
cache_file = None
|
115 |
+
cache_data = {}
|
116 |
+
cache_key = get_cache_key(text, voice, speed, emotion, language)
|
117 |
+
|
118 |
+
if cache_dir:
|
119 |
+
cache_dir_path = Path(cache_dir)
|
120 |
+
cache_dir_path.mkdir(exist_ok=True)
|
121 |
+
cache_file = cache_dir_path / "audio_cache.json"
|
122 |
+
cache_data = load_cache(cache_file)
|
123 |
+
|
124 |
+
# Check if we have a cached URL for this request
|
125 |
+
if cache_key in cache_data:
|
126 |
+
audio_url = cache_data[cache_key]
|
127 |
+
logger.info(f"Using cached audio URL: {audio_url}")
|
128 |
+
|
129 |
+
# Download the audio from the cached URL
|
130 |
+
response = requests.get(audio_url)
|
131 |
+
if response.status_code == 200:
|
132 |
+
with open(output_file, "wb") as f:
|
133 |
+
f.write(response.content)
|
134 |
+
|
135 |
+
logger.info(f"Downloaded cached audio to {output_file}")
|
136 |
+
return True
|
137 |
+
else:
|
138 |
+
logger.warning(f"Cached URL failed, status: {response.status_code}")
|
139 |
+
# Continue with generation as the cached URL failed
|
140 |
+
|
141 |
+
# Set up voice settings
|
142 |
+
voice_setting = {"speed": speed, "emotion": emotion}
|
143 |
+
|
144 |
+
# Add custom voice ID if provided
|
145 |
+
if voice:
|
146 |
+
voice_setting["custom_voice_id"] = voice
|
147 |
+
|
148 |
+
def on_queue_update(update):
|
149 |
+
if isinstance(update, fal_client.InProgress):
|
150 |
+
for log in update.logs:
|
151 |
+
logger.debug(log["message"])
|
152 |
+
|
153 |
+
# Generate speech with FAL AI
|
154 |
+
logger.info(f"Generating speech with voice ID: {voice}")
|
155 |
+
|
156 |
+
result = fal_client.subscribe(
|
157 |
+
"fal-ai/minimax-tts/text-to-speech/turbo",
|
158 |
+
arguments={
|
159 |
+
"text": text,
|
160 |
+
"voice_setting": voice_setting,
|
161 |
+
"language_boost": language,
|
162 |
+
},
|
163 |
+
with_logs=True,
|
164 |
+
on_queue_update=on_queue_update,
|
165 |
+
)
|
166 |
+
|
167 |
+
# Download the audio file from the URL
|
168 |
+
if "audio" in result and "url" in result["audio"]:
|
169 |
+
audio_url = result["audio"]["url"]
|
170 |
+
logger.info(f"Downloading audio from {audio_url}")
|
171 |
+
|
172 |
+
# Cache the URL if caching is enabled
|
173 |
+
if cache_file:
|
174 |
+
cache_data[cache_key] = audio_url
|
175 |
+
save_cache(cache_data, cache_file)
|
176 |
+
logger.info(f"Cached audio URL for future use")
|
177 |
+
|
178 |
+
response = requests.get(audio_url)
|
179 |
+
if response.status_code == 200:
|
180 |
+
# Save the audio file
|
181 |
+
with open(output_file, "wb") as f:
|
182 |
+
f.write(response.content)
|
183 |
+
else:
|
184 |
+
logger.error(f"Failed to download audio: {response.status_code}")
|
185 |
+
return False
|
186 |
+
else:
|
187 |
+
logger.error(f"Unexpected response format: {result}")
|
188 |
+
return False
|
189 |
+
|
190 |
+
end_time = time.monotonic()
|
191 |
+
logger.info(
|
192 |
+
f"Generated audio in {end_time - start_time:.2f} seconds, "
|
193 |
+
f"saved to {output_file}"
|
194 |
+
)
|
195 |
+
|
196 |
+
return True
|
197 |
+
except Exception as e:
|
198 |
+
logger.error(f"Error generating audio: {e}")
|
199 |
+
return False
|
200 |
+
|
201 |
+
|
202 |
+
def process_presentation(
|
203 |
+
chapter_dir,
|
204 |
+
voice=None,
|
205 |
+
speed=1.0,
|
206 |
+
emotion="happy",
|
207 |
+
language="English",
|
208 |
+
cache_dir=None,
|
209 |
+
):
|
210 |
+
"""Process the presentation.md file in the given directory."""
|
211 |
+
# Construct paths
|
212 |
+
chapter_path = Path(chapter_dir)
|
213 |
+
presentation_file = chapter_path / "presentation.md"
|
214 |
+
audio_dir = chapter_path / "audio"
|
215 |
+
|
216 |
+
# Check if presentation file exists
|
217 |
+
if not presentation_file.exists():
|
218 |
+
logger.error(f"Presentation file not found: {presentation_file}")
|
219 |
+
return False
|
220 |
+
|
221 |
+
# Create audio directory if it doesn't exist
|
222 |
+
audio_dir.mkdir(exist_ok=True)
|
223 |
+
|
224 |
+
# Read the presentation file
|
225 |
+
with open(presentation_file, "r", encoding="utf-8") as file:
|
226 |
+
content = file.read()
|
227 |
+
|
228 |
+
# Extract speaker notes
|
229 |
+
notes = extract_speaker_notes(content)
|
230 |
+
|
231 |
+
if not notes:
|
232 |
+
logger.warning("No speaker notes found in the presentation file.")
|
233 |
+
return False
|
234 |
+
|
235 |
+
logger.info(f"Found {len(notes)} slides with speaker notes.")
|
236 |
+
|
237 |
+
# Generate audio files for each note
|
238 |
+
for i, note in enumerate(notes, 1):
|
239 |
+
if not note.strip():
|
240 |
+
logger.warning(f"Skipping empty note for slide {i}")
|
241 |
+
continue
|
242 |
+
|
243 |
+
output_file = audio_dir / f"{i}.wav"
|
244 |
+
logger.info(f"Generating audio for slide {i}")
|
245 |
+
|
246 |
+
success = text_to_speech(
|
247 |
+
note,
|
248 |
+
output_file,
|
249 |
+
voice,
|
250 |
+
speed,
|
251 |
+
emotion,
|
252 |
+
language,
|
253 |
+
cache_dir,
|
254 |
+
)
|
255 |
+
if success:
|
256 |
+
logger.info(f"Saved audio to {output_file}")
|
257 |
+
else:
|
258 |
+
logger.error(f"Failed to generate audio for slide {i}")
|
259 |
+
|
260 |
+
return True
|
261 |
+
|
262 |
+
|
263 |
+
def main():
|
264 |
+
parser = argparse.ArgumentParser(
|
265 |
+
description="Extract speaker notes from presentation.md and convert to"
|
266 |
+
" audio files."
|
267 |
+
)
|
268 |
+
parser.add_argument(
|
269 |
+
"chapter_dir", help="Path to the chapter directory containing presentation.md"
|
270 |
+
)
|
271 |
+
parser.add_argument(
|
272 |
+
"--voice",
|
273 |
+
default=VOICE_ID,
|
274 |
+
help="Voice ID to use (defaults to VOICE_ID from .env)",
|
275 |
+
)
|
276 |
+
parser.add_argument(
|
277 |
+
"--speed", type=float, default=1.0, help="Speech speed (0.5-2.0, default: 1.0)"
|
278 |
+
)
|
279 |
+
parser.add_argument(
|
280 |
+
"--emotion",
|
281 |
+
default="happy",
|
282 |
+
help="Emotion to apply (neutral, happy, sad, etc.)",
|
283 |
+
)
|
284 |
+
parser.add_argument(
|
285 |
+
"--language", default="English", help="Language for language boost"
|
286 |
+
)
|
287 |
+
parser.add_argument(
|
288 |
+
"--cache-dir",
|
289 |
+
default=".cache",
|
290 |
+
help="Directory to store cache files (default: .cache)",
|
291 |
+
)
|
292 |
+
parser.add_argument(
|
293 |
+
"--no-cache", action="store_true", help="Disable caching of audio URLs"
|
294 |
+
)
|
295 |
+
args = parser.parse_args()
|
296 |
+
|
297 |
+
# Determine cache directory
|
298 |
+
cache_dir = None if args.no_cache else args.cache_dir
|
299 |
+
|
300 |
+
logger.info(f"Processing presentation in {args.chapter_dir}")
|
301 |
+
success = process_presentation(
|
302 |
+
args.chapter_dir,
|
303 |
+
args.voice,
|
304 |
+
args.speed,
|
305 |
+
args.emotion,
|
306 |
+
args.language,
|
307 |
+
cache_dir,
|
308 |
+
)
|
309 |
+
|
310 |
+
if success:
|
311 |
+
logger.info("Audio generation completed successfully.")
|
312 |
+
else:
|
313 |
+
logger.error("Audio generation failed.")
|
314 |
+
sys.exit(1)
|
315 |
+
|
316 |
+
|
317 |
+
if __name__ == "__main__":
|
318 |
+
main()
|