Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,128 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- agentlans/readability
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- microsoft/deberta-v3-base
|
9 |
+
pipeline_tag: text-classification
|
10 |
+
---
|
11 |
+
# DeBERTa V3 Base and XSmall for Readability Assessment
|
12 |
+
|
13 |
+
This is one of two fine-tuned versions of DeBERTa V3 (Base and XSmall) for assessing text readability.
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
|
17 |
+
- **Architecture:** DeBERTa V3 (Base and XSmall variants)
|
18 |
+
- **Task:** Regression (Readability Assessment)
|
19 |
+
- **Training Data:** 105,000 paragraphs from diverse sources
|
20 |
+
- **Input:** Text
|
21 |
+
- **Output:** Estimated U.S. grade level for text comprehension
|
22 |
+
- Higher values indicate more complex text.
|
23 |
+
|
24 |
+
## Performance
|
25 |
+
|
26 |
+
Root mean squared error (RMSE) on 20% held-out validation set:
|
27 |
+
|
28 |
+
| Model | RMSE |
|
29 |
+
|-------|------|
|
30 |
+
| Base | 0.5038 |
|
31 |
+
| XSmall| 0.6296 |
|
32 |
+
|
33 |
+
## Training Data
|
34 |
+
|
35 |
+
The models were trained on a diverse dataset of 105 000 paragraphs with the following characteristics:
|
36 |
+
|
37 |
+
- Character length: 50 to 2 000
|
38 |
+
- Interquartile Range (IQR) of readability grades < 1
|
39 |
+
|
40 |
+
**Sources:**
|
41 |
+
- HuggingFace's Fineweb-Edu
|
42 |
+
- Ronen Eldan's TinyStories
|
43 |
+
- Wikipedia-2023-11-embed-multilingual-v3 (English only)
|
44 |
+
- ArXiv Abstracts-2021
|
45 |
+
|
46 |
+
For more details, please see [agentlans/readability](https://huggingface.co/datasets/agentlans/readability).
|
47 |
+
|
48 |
+
## Usage
|
49 |
+
|
50 |
+
Example on how to use the model:
|
51 |
+
|
52 |
+
```
|
53 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
54 |
+
import torch
|
55 |
+
|
56 |
+
model_name="agentlans/deberta-v3-base-readability-v2"
|
57 |
+
|
58 |
+
# Put model on GPU or else CPU
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
60 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
61 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
62 |
+
model = model.to(device)
|
63 |
+
|
64 |
+
def readability(text):
|
65 |
+
"""Processes the text using the model and returns its logits.
|
66 |
+
In this case, it's reading grade level in years of education
|
67 |
+
(the higher the number, the harder it is to read the text)."""
|
68 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
|
69 |
+
with torch.no_grad():
|
70 |
+
logits = model(**inputs).logits.squeeze().cpu()
|
71 |
+
return logits.tolist()
|
72 |
+
|
73 |
+
# Example usage
|
74 |
+
texts = [x.strip() for x in """
|
75 |
+
The cat sat on the mat.
|
76 |
+
I like to eat pizza and ice cream for dinner.
|
77 |
+
The quick brown fox jumps over the lazy dog.
|
78 |
+
Students must complete their homework before watching television.
|
79 |
+
The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
|
80 |
+
Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
|
81 |
+
The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
|
82 |
+
The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
|
83 |
+
""".strip().split("\n")]
|
84 |
+
|
85 |
+
result = readability(texts)
|
86 |
+
for x, s in zip(texts, result):
|
87 |
+
print(f"Text: {x}\nReadability grade: {round(s, 2)}\n")
|
88 |
+
```
|
89 |
+
|
90 |
+
Example output for `base` size model:
|
91 |
+
```
|
92 |
+
Text: The cat sat on the mat.
|
93 |
+
Readability grade: 2.34
|
94 |
+
|
95 |
+
Text: I like to eat pizza and ice cream for dinner.
|
96 |
+
Readability grade: 3.56
|
97 |
+
|
98 |
+
Text: The quick brown fox jumps over the lazy dog.
|
99 |
+
Readability grade: 3.72
|
100 |
+
|
101 |
+
Text: Students must complete their homework before watching television.
|
102 |
+
Readability grade: 10.79
|
103 |
+
|
104 |
+
Text: The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
|
105 |
+
Readability grade: 11.1
|
106 |
+
|
107 |
+
Text: Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
|
108 |
+
Readability grade: 17.11
|
109 |
+
|
110 |
+
Text: The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
|
111 |
+
Readability grade: 19.53
|
112 |
+
|
113 |
+
Text: The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
|
114 |
+
Readability grade: 16.8
|
115 |
+
```
|
116 |
+
|
117 |
+
## Limitations
|
118 |
+
|
119 |
+
- English language only
|
120 |
+
- Performance may vary for texts significantly different from the training data
|
121 |
+
- Output is based on statistical patterns and may not always align with human judgment
|
122 |
+
- Readability is assessed purely on textual features, not considering factors like subject familiarity or cultural context
|
123 |
+
|
124 |
+
## Ethical Considerations
|
125 |
+
|
126 |
+
- Should not be used as the sole determinant of text suitability for specific audiences
|
127 |
+
- Results may reflect biases present in the training data sources
|
128 |
+
- Care should be taken when using these models in educational or publishing contexts
|