|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- agentlans/readability |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/deberta-v3-base |
|
pipeline_tag: text-classification |
|
--- |
|
# DeBERTa V3 Base and XSmall for Readability Assessment |
|
|
|
This is one of two fine-tuned versions of DeBERTa V3 (Base and XSmall) for assessing text readability. |
|
|
|
## Model Details |
|
|
|
- **Architecture:** DeBERTa V3 (Base and XSmall variants) |
|
- **Task:** Regression (Readability Assessment) |
|
- **Training Data:** 105,000 paragraphs from diverse sources |
|
- **Input:** Text |
|
- **Output:** Estimated U.S. grade level for text comprehension |
|
- Higher values indicate more complex text. |
|
|
|
## Performance |
|
|
|
Root mean squared error (RMSE) on 20% held-out validation set: |
|
|
|
| Model | RMSE | |
|
|-------|------| |
|
| Base | 0.5038 | |
|
| XSmall| 0.6296 | |
|
|
|
## Training Data |
|
|
|
The models were trained on a diverse dataset of 105 000 paragraphs with the following characteristics: |
|
|
|
- Character length: 50 to 2 000 |
|
- Interquartile Range (IQR) of readability grades < 1 |
|
|
|
**Sources:** |
|
- HuggingFace's Fineweb-Edu |
|
- Ronen Eldan's TinyStories |
|
- Wikipedia-2023-11-embed-multilingual-v3 (English only) |
|
- ArXiv Abstracts-2021 |
|
|
|
For more details, please see [agentlans/readability](https://huggingface.co/datasets/agentlans/readability). |
|
|
|
## Usage |
|
|
|
Example on how to use the model: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
model_name="agentlans/deberta-v3-base-readability-v2" |
|
|
|
# Put model on GPU or else CPU |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
|
|
def readability(text): |
|
"""Processes the text using the model and returns its logits. |
|
In this case, it's reading grade level in years of education |
|
(the higher the number, the harder it is to read the text).""" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits.squeeze().cpu() |
|
return logits.tolist() |
|
|
|
# Example usage |
|
texts = [x.strip() for x in """ |
|
The cat sat on the mat. |
|
I like to eat pizza and ice cream for dinner. |
|
The quick brown fox jumps over the lazy dog. |
|
Students must complete their homework before watching television. |
|
The intricate ecosystem of the rainforest supports a diverse array of flora and fauna. |
|
Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels. |
|
The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization. |
|
The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality. |
|
""".strip().split("\n")] |
|
|
|
result = readability(texts) |
|
for x, s in zip(texts, result): |
|
print(f"Text: {x}\nReadability grade: {round(s, 2)}\n") |
|
``` |
|
|
|
Example output for `base` size model: |
|
``` |
|
Text: The cat sat on the mat. |
|
Readability grade: 2.34 |
|
|
|
Text: I like to eat pizza and ice cream for dinner. |
|
Readability grade: 3.56 |
|
|
|
Text: The quick brown fox jumps over the lazy dog. |
|
Readability grade: 3.72 |
|
|
|
Text: Students must complete their homework before watching television. |
|
Readability grade: 10.79 |
|
|
|
Text: The intricate ecosystem of the rainforest supports a diverse array of flora and fauna. |
|
Readability grade: 11.1 |
|
|
|
Text: Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels. |
|
Readability grade: 17.11 |
|
|
|
Text: The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization. |
|
Readability grade: 19.53 |
|
|
|
Text: The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality. |
|
Readability grade: 16.8 |
|
``` |
|
|
|
## Limitations |
|
|
|
- English language only |
|
- Performance may vary for texts significantly different from the training data |
|
- Output is based on statistical patterns and may not always align with human judgment |
|
- Readability is assessed purely on textual features, not considering factors like subject familiarity or cultural context |
|
|
|
## Ethical Considerations |
|
|
|
- Should not be used as the sole determinant of text suitability for specific audiences |
|
- Results may reflect biases present in the training data sources |
|
- Care should be taken when using these models in educational or publishing contexts |
|
|