agentlans
/

deberta-v3-base-readability-v2

Text Classification

Safetensors

English

deberta-v2

Model card Files Files and versions Community

agentlans commited on Oct 5, 2024

Commit

282775d

verified ·

1 Parent(s): d3d5cc5

Update README.md

Browse files

Files changed (1) hide show

README.md +128 -3

README.md CHANGED Viewed

@@ -1,3 +1,128 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- agentlans/readability
+language:
+- en
+base_model:
+- microsoft/deberta-v3-base
+pipeline_tag: text-classification
+---
+# DeBERTa V3 Base and XSmall for Readability Assessment
+This is one of two fine-tuned versions of DeBERTa V3 (Base and XSmall) for assessing text readability.
+## Model Details
+- **Architecture:** DeBERTa V3 (Base and XSmall variants)
+- **Task:** Regression (Readability Assessment)
+- **Training Data:** 105,000 paragraphs from diverse sources
+- **Input:** Text
+- **Output:** Estimated U.S. grade level for text comprehension
+  - Higher values indicate more complex text.
+## Performance
+Root mean squared error (RMSE) on 20% held-out validation set:
+| Model | RMSE |
+|-------|------|
+| Base  | 0.5038 |
+| XSmall| 0.6296 |
+## Training Data
+The models were trained on a diverse dataset of 105 000 paragraphs with the following characteristics:
+- Character length: 50 to 2 000
+- Interquartile Range (IQR) of readability grades < 1
+**Sources:**
+- HuggingFace's Fineweb-Edu
+- Ronen Eldan's TinyStories
+- Wikipedia-2023-11-embed-multilingual-v3 (English only)
+- ArXiv Abstracts-2021
+For more details, please see [agentlans/readability](https://huggingface.co/datasets/agentlans/readability).
+## Usage
+Example on how to use the model:
+```
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name="agentlans/deberta-v3-base-readability-v2"
+# Put model on GPU or else CPU
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+def readability(text):
+    """Processes the text using the model and returns its logits.
+    In this case, it's reading grade level in years of education
+    (the higher the number, the harder it is to read the text)."""
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
+    with torch.no_grad():
+        logits = model(**inputs).logits.squeeze().cpu()
+    return logits.tolist()
+# Example usage
+texts = [x.strip() for x in """
+The cat sat on the mat.
+I like to eat pizza and ice cream for dinner.
+The quick brown fox jumps over the lazy dog.
+Students must complete their homework before watching television.
+The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
+Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
+The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
+The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
+""".strip().split("\n")]
+result = readability(texts)
+for x, s in zip(texts, result):
+    print(f"Text: {x}\nReadability grade: {round(s, 2)}\n")
+```
+Example output for `base` size model:
+```
+Text: The cat sat on the mat.
+Readability grade: 2.34
+Text: I like to eat pizza and ice cream for dinner.
+Readability grade: 3.56
+Text: The quick brown fox jumps over the lazy dog.
+Readability grade: 3.72
+Text: Students must complete their homework before watching television.
+Readability grade: 10.79
+Text: The intricate ecosystem of the rainforest supports a diverse array of flora and fauna.
+Readability grade: 11.1
+Text: Quantum mechanics describes the behavior of matter and energy at the molecular, atomic, nuclear, and even smaller microscopic levels.
+Readability grade: 17.11
+Text: The socioeconomic ramifications of globalization have led to unprecedented levels of interconnectedness and cultural homogenization.
+Readability grade: 19.53
+Text: The ontological argument for the existence of God posits that the very concept of a maximally great being necessitates its existence in reality.
+Readability grade: 16.8
+```
+## Limitations
+- English language only
+- Performance may vary for texts significantly different from the training data
+- Output is based on statistical patterns and may not always align with human judgment
+- Readability is assessed purely on textual features, not considering factors like subject familiarity or cultural context
+## Ethical Considerations
+- Should not be used as the sole determinant of text suitability for specific audiences
+- Results may reflect biases present in the training data sources
+- Care should be taken when using these models in educational or publishing contexts