agentlans commited on
Commit
282775d
·
verified ·
1 Parent(s): d3d5cc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -3
README.md CHANGED
@@ -1,3 +1,128 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - agentlans/readability
5
+ language:
6
+ - en
7
+ base_model:
8
+ - microsoft/deberta-v3-base
9
+ pipeline_tag: text-classification
10
+ ---
11
+ # DeBERTa V3 Base and XSmall for Readability Assessment
12
+
13
+ This is one of two fine-tuned versions of DeBERTa V3 (Base and XSmall) for assessing text readability.
14
+
15
+ ## Model Details
16
+
17
+ - **Architecture:** DeBERTa V3 (Base and XSmall variants)
18
+ - **Task:** Regression (Readability Assessment)
19
+ - **Training Data:** 105,000 paragraphs from diverse sources
20
+ - **Input:** Text
21
+ - **Output:** Estimated U.S. grade level for text comprehension
22
+ - Higher values indicate more complex text.
23
+
24
+ ## Performance
25
+
26
+ Root mean squared error (RMSE) on 20% held-out validation set:
27
+
28
+ | Model | RMSE |
29
+ |-------|------|
30
+ | Base | 0.5038 |
31
+ | XSmall| 0.6296 |
32
+
33
+ ## Training Data
34
+
35
+ The models were trained on a diverse dataset of 105 000 paragraphs with the following characteristics:
36
+
37
+ - Character length: 50 to 2 000
38
+ - Interquartile Range (IQR) of readability grades < 1
39
+
40
+ **Sources:**
41
+ - HuggingFace's Fineweb-Edu
42
+ - Ronen Eldan's TinyStories
43
+ - Wikipedia-2023-11-embed-multilingual-v3 (English only)
44
+ - ArXiv Abstracts-2021
45
+
46
+ For more details, please see [agentlans/readability](https://huggingface.co/datasets/agentlans/readability).
47
+
48
+ ## Usage
49
+
50
+ Example on how to use the model:
51
+
52
+ ```
53
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
54
+ import torch
55
+
56
+ model_name="agentlans/deberta-v3-base-readability-v2"
57
+
58
+ # Put model on GPU or else CPU
59
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
60
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
61
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
62
+ model = model.to(device)
63
+
64
+ def readability(text):
65
+ """Processes the text using the model and returns its logits.
66
+ In this case, it's reading grade level in years of education
67
+ (the higher the number, the harder it is to read the text)."""
68
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
69
+ with torch.no_grad():
70
+ logits = model(**inputs).logits.squeeze().cpu()
71
+ return logits.tolist()
72
+
73
+ # Example usage
74
+ texts = [x.strip() for x in """
75
+ The cat sat on the mat.
76
+ I like to eat pizza and ice cream for dinner.
77
+ The quick brown fox jumps over the lazy dog.
78
+ Students must complete their homework before watching television.
79
+ The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
80
+ Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
81
+ The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
82
+ The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
83
+ """.strip().split("\n")]
84
+
85
+ result = readability(texts)
86
+ for x, s in zip(texts, result):
87
+ print(f"Text: {x}\nReadability grade: {round(s, 2)}\n")
88
+ ```
89
+
90
+ Example output for `base` size model:
91
+ ```
92
+ Text: The cat sat on the mat.
93
+ Readability grade: 2.34
94
+
95
+ Text: I like to eat pizza and ice cream for dinner.
96
+ Readability grade: 3.56
97
+
98
+ Text: The quick brown fox jumps over the lazy dog.
99
+ Readability grade: 3.72
100
+
101
+ Text: Students must complete their homework before watching television.
102
+ Readability grade: 10.79
103
+
104
+ Text: The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
105
+ Readability grade: 11.1
106
+
107
+ Text: Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
108
+ Readability grade: 17.11
109
+
110
+ Text: The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
111
+ Readability grade: 19.53
112
+
113
+ Text: The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
114
+ Readability grade: 16.8
115
+ ```
116
+
117
+ ## Limitations
118
+
119
+ - English language only
120
+ - Performance may vary for texts significantly different from the training data
121
+ - Output is based on statistical patterns and may not always align with human judgment
122
+ - Readability is assessed purely on textual features, not considering factors like subject familiarity or cultural context
123
+
124
+ ## Ethical Considerations
125
+
126
+ - Should not be used as the sole determinant of text suitability for specific audiences
127
+ - Results may reflect biases present in the training data sources
128
+ - Care should be taken when using these models in educational or publishing contexts