Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,96 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# flan-t5-small-capitalizer
|
5 |
+
|
6 |
+
This model is a fine-tuned version of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
|
7 |
+
on the [agentlans/c4-en-lowercased](https://huggingface.co/datasets/agentlans/c4-en-lowercased) dataset.
|
8 |
+
|
9 |
+
It restores proper noun and sentence capitalization to lowercased text.
|
10 |
+
It builds upon the capabilities of the FLAN-T5 small model, which is known for its strong performance across various natural language processing tasks.
|
11 |
+
|
12 |
+
## Intended uses & limitations
|
13 |
+
|
14 |
+
This model is intended for
|
15 |
+
- Capitalizing lowercased text
|
16 |
+
- Restoring proper noun and sentence capitalization
|
17 |
+
- Text normalization
|
18 |
+
|
19 |
+
Limitations:
|
20 |
+
- English language only
|
21 |
+
- Focused on modern prose found on the Internet
|
22 |
+
- May not capitalize titles correctly
|
23 |
+
- Not guaranteed to use a capitalization style consistently
|
24 |
+
- May have trouble with special terms and abbreviations that need to be capitalized
|
25 |
+
- Maximum input and output 1024 tokens
|
26 |
+
|
27 |
+
Usage example:
|
28 |
+
|
29 |
+
```python
|
30 |
+
import torch
|
31 |
+
from transformers import pipeline
|
32 |
+
|
33 |
+
device = 0 if torch.cuda.is_available() else -1
|
34 |
+
model_name = "agentlans/flan-t5-small-capitalizer/"
|
35 |
+
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
|
36 |
+
|
37 |
+
input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."
|
38 |
+
|
39 |
+
output = flan_t5_pipeline(input_text, max_length=1024)
|
40 |
+
print(output[0]["generated_text"])
|
41 |
+
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
|
42 |
+
```
|
43 |
+
|
44 |
+
## Training and evaluation data
|
45 |
+
|
46 |
+
The model was trained on a subset of the C4 dataset's English configuration.
|
47 |
+
This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version.
|
48 |
+
It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.
|
49 |
+
|
50 |
+
### Training hyperparameters
|
51 |
+
|
52 |
+
The model was trained using the following key hyperparameters:
|
53 |
+
- Learning rate: 5e-05
|
54 |
+
- Batch size: 8
|
55 |
+
- Number of epochs: 1
|
56 |
+
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
57 |
+
- Learning rate scheduler: Linear
|
58 |
+
- Maximum source and target length: 1024 tokens
|
59 |
+
|
60 |
+
Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.
|
61 |
+
|
62 |
+
### Training results
|
63 |
+
|
64 |
+
<details>
|
65 |
+
<summary>Click here for table</summary>
|
66 |
+
|
67 |
+
| Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
|
68 |
+
|:-------------:|:-----:|:-----:|:---------------:|:-----------------:|
|
69 |
+
| 0.2532 | 0.05 | 2500 | 0.1739 | 2824810 |
|
70 |
+
| 0.231 | 0.1 | 5000 | 0.1653 | 5702148 |
|
71 |
+
| 0.2163 | 0.15 | 7500 | 0.1571 | 8531178 |
|
72 |
+
| 0.1966 | 0.2 | 10000 | 0.1529 | 11350902 |
|
73 |
+
| 0.2013 | 0.25 | 12500 | 0.1491 | 14191502 |
|
74 |
+
| 0.1971 | 0.3 | 15000 | 0.1464 | 17050704 |
|
75 |
+
| 0.1791 | 0.35 | 17500 | 0.1447 | 19857804 |
|
76 |
+
| 0.193 | 0.4 | 20000 | 0.1424 | 22687180 |
|
77 |
+
| 0.1821 | 0.45 | 22500 | 0.1416 | 25532518 |
|
78 |
+
| 0.19 | 0.5 | 25000 | 0.1397 | 28423408 |
|
79 |
+
| 0.1753 | 0.55 | 27500 | 0.1388 | 31248170 |
|
80 |
+
| 0.184 | 0.6 | 30000 | 0.1378 | 34048604 |
|
81 |
+
| 0.1717 | 0.65 | 32500 | 0.1371 | 36903282 |
|
82 |
+
| 0.1693 | 0.7 | 35000 | 0.1359 | 39709784 |
|
83 |
+
| 0.1729 | 0.75 | 37500 | 0.1345 | 42614112 |
|
84 |
+
| 0.1711 | 0.8 | 40000 | 0.1344 | 45471178 |
|
85 |
+
| 0.1735 | 0.85 | 42500 | 0.1340 | 48355942 |
|
86 |
+
| 0.1797 | 0.9 | 45000 | 0.1340 | 51187066 |
|
87 |
+
| 0.1659 | 0.95 | 47500 | 0.1338 | 54074434 |
|
88 |
+
| 0.1658 | 1.0 | 50000 | 0.1338 | 56941616 |
|
89 |
+
</details>
|
90 |
+
|
91 |
+
### Framework versions
|
92 |
+
|
93 |
+
- Transformers 4.43.3
|
94 |
+
- PyTorch 2.3.0+cu121
|
95 |
+
- Datasets 3.2.0
|
96 |
+
- Tokenizers 0.19.1
|