File size: 4,373 Bytes
fa054b2 3512914 fa054b2 3512914 fa054b2 3512914 fa054b2 3512914 fa054b2 3512914 fa054b2 3512914 fa054b2 3bfe011 fa054b2 3512914 fa054b2 3512914 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: apache-2.0
datasets:
- agentlans/c4-en-lowercased
language:
- en
base_model:
- google/flan-t5-small
tags:
- capitalization
---
# flan-t5-small-capitalizer
A specialized fine-tuned version of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
trained on the [agentlans/c4-en-lowercased](https://huggingface.co/datasets/agentlans/c4-en-lowercased) dataset to restore proper capitalization.
## Key Features
- Restores proper noun and sentence capitalization
- Builds on FLAN-T5 small's robust NLP capabilities
- Designed for text normalization tasks
## Intended Uses
- Capitalizing lowercased text
- Sentence and proper noun capitalization
- Text normalization
## Usage Example
```python
import torch
from transformers import pipeline
device = 0 if torch.cuda.is_available() else -1
model_name = "agentlans/flan-t5-small-capitalizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."
output = flan_t5_pipeline(input_text, max_length=1024)
print(output[0]["generated_text"])
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
```
## Limitations
- **Language**: English only
- **Text Type**: Primarily modern prose found on the Internet
- **Capitalization Issues**:
- May not capitalize titles correctly
- Inconsistent capitalization style across texts
- Difficulty with special terms and abbreviations requiring capitalization
- **Input/Output Constraint**: Maximum length of 1024 tokens for both input and output
## Training and evaluation data
The model was trained on a subset of the C4 dataset's English configuration.
This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version.
It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.
### Training hyperparameters
The model was trained using the following key hyperparameters:
- Learning rate: 5e-05
- Batch size: 8
- Number of epochs: 1
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Maximum source and target length: 1024 tokens
Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.
### Training results
<details>
<summary>Click here for table</summary>
| Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
|:-------------:|:-----:|:-----:|:---------------:|:-----------------:|
| 0.2532 | 0.05 | 2500 | 0.1739 | 2824810 |
| 0.231 | 0.1 | 5000 | 0.1653 | 5702148 |
| 0.2163 | 0.15 | 7500 | 0.1571 | 8531178 |
| 0.1966 | 0.2 | 10000 | 0.1529 | 11350902 |
| 0.2013 | 0.25 | 12500 | 0.1491 | 14191502 |
| 0.1971 | 0.3 | 15000 | 0.1464 | 17050704 |
| 0.1791 | 0.35 | 17500 | 0.1447 | 19857804 |
| 0.193 | 0.4 | 20000 | 0.1424 | 22687180 |
| 0.1821 | 0.45 | 22500 | 0.1416 | 25532518 |
| 0.19 | 0.5 | 25000 | 0.1397 | 28423408 |
| 0.1753 | 0.55 | 27500 | 0.1388 | 31248170 |
| 0.184 | 0.6 | 30000 | 0.1378 | 34048604 |
| 0.1717 | 0.65 | 32500 | 0.1371 | 36903282 |
| 0.1693 | 0.7 | 35000 | 0.1359 | 39709784 |
| 0.1729 | 0.75 | 37500 | 0.1345 | 42614112 |
| 0.1711 | 0.8 | 40000 | 0.1344 | 45471178 |
| 0.1735 | 0.85 | 42500 | 0.1340 | 48355942 |
| 0.1797 | 0.9 | 45000 | 0.1340 | 51187066 |
| 0.1659 | 0.95 | 47500 | 0.1338 | 54074434 |
| 0.1658 | 1.0 | 50000 | 0.1338 | 56941616 |
</details>
### Framework versions
- Transformers 4.43.3
- PyTorch 2.3.0+cu121
- Datasets 3.2.0
- Tokenizers 0.19.1 |