agentlans
/

flan-t5-small-capitalizer

Safetensors

English

capitalization

Model card Files Files and versions Community

agentlans commited on Jan 19

Commit

fa054b2

verified ·

1 Parent(s): 3f87ac5

Update README.md

Browse files

Files changed (1) hide show

README.md +96 -3

README.md CHANGED Viewed

@@ -1,3 +1,96 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+# flan-t5-small-capitalizer
+This model is a fine-tuned version of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
+on the [agentlans/c4-en-lowercased](https://huggingface.co/datasets/agentlans/c4-en-lowercased) dataset.
+It restores proper noun and sentence capitalization to lowercased text.
+It builds upon the capabilities of the FLAN-T5 small model, which is known for its strong performance across various natural language processing tasks.
+## Intended uses & limitations
+This model is intended for
+- Capitalizing lowercased text
+- Restoring proper noun and sentence capitalization
+- Text normalization
+Limitations:
+- English language only
+- Focused on modern prose found on the Internet
+- May not capitalize titles correctly
+- Not guaranteed to use a capitalization style consistently
+- May have trouble with special terms and abbreviations that need to be capitalized
+- Maximum input and output 1024 tokens
+Usage example:
+```python
+import torch
+from transformers import pipeline
+device = 0 if torch.cuda.is_available() else -1
+model_name = "agentlans/flan-t5-small-capitalizer/"
+flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
+input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."
+output = flan_t5_pipeline(input_text, max_length=1024)
+print(output[0]["generated_text"])
+# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
+```
+## Training and evaluation data
+The model was trained on a subset of the C4 dataset's English configuration.
+This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version.
+It achieves a final validation loss of 0.1338 after processing 56&thinsp;941&thinsp;616 input tokens.
+### Training hyperparameters
+The model was trained using the following key hyperparameters:
+- Learning rate: 5e-05
+- Batch size: 8
+- Number of epochs: 1
+- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- Learning rate scheduler: Linear
+- Maximum source and target length: 1024 tokens
+Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.
+### Training results
+<details>
+  <summary>Click here for table</summary>
+| Training Loss | Epoch | Step  | Validation Loss | Input Tokens Seen |
+|:-------------:|:-----:|:-----:|:---------------:|:-----------------:|
+| 0.2532        | 0.05  | 2500  | 0.1739          | 2824810           |
+| 0.231         | 0.1   | 5000  | 0.1653          | 5702148           |
+| 0.2163        | 0.15  | 7500  | 0.1571          | 8531178           |
+| 0.1966        | 0.2   | 10000 | 0.1529          | 11350902          |
+| 0.2013        | 0.25  | 12500 | 0.1491          | 14191502          |
+| 0.1971        | 0.3   | 15000 | 0.1464          | 17050704          |
+| 0.1791        | 0.35  | 17500 | 0.1447          | 19857804          |
+| 0.193         | 0.4   | 20000 | 0.1424          | 22687180          |
+| 0.1821        | 0.45  | 22500 | 0.1416          | 25532518          |
+| 0.19          | 0.5   | 25000 | 0.1397          | 28423408          |
+| 0.1753        | 0.55  | 27500 | 0.1388          | 31248170          |
+| 0.184         | 0.6   | 30000 | 0.1378          | 34048604          |
+| 0.1717        | 0.65  | 32500 | 0.1371          | 36903282          |
+| 0.1693        | 0.7   | 35000 | 0.1359          | 39709784          |
+| 0.1729        | 0.75  | 37500 | 0.1345          | 42614112          |
+| 0.1711        | 0.8   | 40000 | 0.1344          | 45471178          |
+| 0.1735        | 0.85  | 42500 | 0.1340          | 48355942          |
+| 0.1797        | 0.9   | 45000 | 0.1340          | 51187066          |
+| 0.1659        | 0.95  | 47500 | 0.1338          | 54074434          |
+| 0.1658        | 1.0   | 50000 | 0.1338          | 56941616          |
+</details>
+### Framework versions
+- Transformers 4.43.3
+- PyTorch 2.3.0+cu121
+- Datasets 3.2.0
+- Tokenizers 0.19.1