agentlans commited on
Commit
fa054b2
·
verified ·
1 Parent(s): 3f87ac5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # flan-t5-small-capitalizer
5
+
6
+ This model is a fine-tuned version of [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
7
+ on the [agentlans/c4-en-lowercased](https://huggingface.co/datasets/agentlans/c4-en-lowercased) dataset.
8
+
9
+ It restores proper noun and sentence capitalization to lowercased text.
10
+ It builds upon the capabilities of the FLAN-T5 small model, which is known for its strong performance across various natural language processing tasks.
11
+
12
+ ## Intended uses & limitations
13
+
14
+ This model is intended for
15
+ - Capitalizing lowercased text
16
+ - Restoring proper noun and sentence capitalization
17
+ - Text normalization
18
+
19
+ Limitations:
20
+ - English language only
21
+ - Focused on modern prose found on the Internet
22
+ - May not capitalize titles correctly
23
+ - Not guaranteed to use a capitalization style consistently
24
+ - May have trouble with special terms and abbreviations that need to be capitalized
25
+ - Maximum input and output 1024 tokens
26
+
27
+ Usage example:
28
+
29
+ ```python
30
+ import torch
31
+ from transformers import pipeline
32
+
33
+ device = 0 if torch.cuda.is_available() else -1
34
+ model_name = "agentlans/flan-t5-small-capitalizer/"
35
+ flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
36
+
37
+ input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."
38
+
39
+ output = flan_t5_pipeline(input_text, max_length=1024)
40
+ print(output[0]["generated_text"])
41
+ # Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
42
+ ```
43
+
44
+ ## Training and evaluation data
45
+
46
+ The model was trained on a subset of the C4 dataset's English configuration.
47
+ This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version.
48
+ It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.
49
+
50
+ ### Training hyperparameters
51
+
52
+ The model was trained using the following key hyperparameters:
53
+ - Learning rate: 5e-05
54
+ - Batch size: 8
55
+ - Number of epochs: 1
56
+ - Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
57
+ - Learning rate scheduler: Linear
58
+ - Maximum source and target length: 1024 tokens
59
+
60
+ Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.
61
+
62
+ ### Training results
63
+
64
+ <details>
65
+ <summary>Click here for table</summary>
66
+
67
+ | Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
68
+ |:-------------:|:-----:|:-----:|:---------------:|:-----------------:|
69
+ | 0.2532 | 0.05 | 2500 | 0.1739 | 2824810 |
70
+ | 0.231 | 0.1 | 5000 | 0.1653 | 5702148 |
71
+ | 0.2163 | 0.15 | 7500 | 0.1571 | 8531178 |
72
+ | 0.1966 | 0.2 | 10000 | 0.1529 | 11350902 |
73
+ | 0.2013 | 0.25 | 12500 | 0.1491 | 14191502 |
74
+ | 0.1971 | 0.3 | 15000 | 0.1464 | 17050704 |
75
+ | 0.1791 | 0.35 | 17500 | 0.1447 | 19857804 |
76
+ | 0.193 | 0.4 | 20000 | 0.1424 | 22687180 |
77
+ | 0.1821 | 0.45 | 22500 | 0.1416 | 25532518 |
78
+ | 0.19 | 0.5 | 25000 | 0.1397 | 28423408 |
79
+ | 0.1753 | 0.55 | 27500 | 0.1388 | 31248170 |
80
+ | 0.184 | 0.6 | 30000 | 0.1378 | 34048604 |
81
+ | 0.1717 | 0.65 | 32500 | 0.1371 | 36903282 |
82
+ | 0.1693 | 0.7 | 35000 | 0.1359 | 39709784 |
83
+ | 0.1729 | 0.75 | 37500 | 0.1345 | 42614112 |
84
+ | 0.1711 | 0.8 | 40000 | 0.1344 | 45471178 |
85
+ | 0.1735 | 0.85 | 42500 | 0.1340 | 48355942 |
86
+ | 0.1797 | 0.9 | 45000 | 0.1340 | 51187066 |
87
+ | 0.1659 | 0.95 | 47500 | 0.1338 | 54074434 |
88
+ | 0.1658 | 1.0 | 50000 | 0.1338 | 56941616 |
89
+ </details>
90
+
91
+ ### Framework versions
92
+
93
+ - Transformers 4.43.3
94
+ - PyTorch 2.3.0+cu121
95
+ - Datasets 3.2.0
96
+ - Tokenizers 0.19.1