File size: 4,373 Bytes

---
license: apache-2.0
datasets:
- agentlans/c4-en-lowercased
language:
- en
base_model:
- google/flan-t5-small
tags:
- capitalization
---
# flan-t5-small-capitalizer

A specialized fine-tuned version of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) 
trained on the [agentlans/c4-en-lowercased](https://huggingface.co/datasets/agentlans/c4-en-lowercased) dataset to restore proper capitalization.

## Key Features
- Restores proper noun and sentence capitalization
- Builds on FLAN-T5 small's robust NLP capabilities
- Designed for text normalization tasks

## Intended Uses
- Capitalizing lowercased text
- Sentence and proper noun capitalization
- Text normalization

## Usage Example

```python
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1
model_name = "agentlans/flan-t5-small-capitalizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."

output = flan_t5_pipeline(input_text, max_length=1024)
print(output[0]["generated_text"])
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
```

## Limitations

- **Language**: English only
- **Text Type**: Primarily modern prose found on the Internet
- **Capitalization Issues**:
  - May not capitalize titles correctly
  - Inconsistent capitalization style across texts
  - Difficulty with special terms and abbreviations requiring capitalization
- **Input/Output Constraint**: Maximum length of 1024 tokens for both input and output

## Training and evaluation data

The model was trained on a subset of the C4 dataset's English configuration.
This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version.
It achieves a final validation loss of 0.1338 after processing 56&thinsp;941&thinsp;616 input tokens.

### Training hyperparameters

The model was trained using the following key hyperparameters:
- Learning rate: 5e-05
- Batch size: 8
- Number of epochs: 1
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Maximum source and target length: 1024 tokens

Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.

### Training results

<details>
  <summary>Click here for table</summary>
  
| Training Loss | Epoch | Step  | Validation Loss | Input Tokens Seen |
|:-------------:|:-----:|:-----:|:---------------:|:-----------------:|
| 0.2532        | 0.05  | 2500  | 0.1739          | 2824810           |
| 0.231         | 0.1   | 5000  | 0.1653          | 5702148           |
| 0.2163        | 0.15  | 7500  | 0.1571          | 8531178           |
| 0.1966        | 0.2   | 10000 | 0.1529          | 11350902          |
| 0.2013        | 0.25  | 12500 | 0.1491          | 14191502          |
| 0.1971        | 0.3   | 15000 | 0.1464          | 17050704          |
| 0.1791        | 0.35  | 17500 | 0.1447          | 19857804          |
| 0.193         | 0.4   | 20000 | 0.1424          | 22687180          |
| 0.1821        | 0.45  | 22500 | 0.1416          | 25532518          |
| 0.19          | 0.5   | 25000 | 0.1397          | 28423408          |
| 0.1753        | 0.55  | 27500 | 0.1388          | 31248170          |
| 0.184         | 0.6   | 30000 | 0.1378          | 34048604          |
| 0.1717        | 0.65  | 32500 | 0.1371          | 36903282          |
| 0.1693        | 0.7   | 35000 | 0.1359          | 39709784          |
| 0.1729        | 0.75  | 37500 | 0.1345          | 42614112          |
| 0.1711        | 0.8   | 40000 | 0.1344          | 45471178          |
| 0.1735        | 0.85  | 42500 | 0.1340          | 48355942          |
| 0.1797        | 0.9   | 45000 | 0.1340          | 51187066          |
| 0.1659        | 0.95  | 47500 | 0.1338          | 54074434          |
| 0.1658        | 1.0   | 50000 | 0.1338          | 56941616          |
</details>

### Framework versions

- Transformers 4.43.3
- PyTorch 2.3.0+cu121
- Datasets 3.2.0
- Tokenizers 0.19.1