|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- agentlans/text-quality-v2 |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/deberta-v3-base |
|
pipeline_tag: text-classification |
|
--- |
|
# DeBERTa v3 for Text Quality Assessment |
|
|
|
## Model Details |
|
|
|
- **Model Architecture:** DeBERTa v3 (xsmall and base variants) |
|
- **Task:** Text quality assessment (regression) |
|
- **Training Data:** Text Quality Meta-Analysis Dataset at [agentlans/text-quality-v2](https://huggingface.co/datasets/agentlans/text-quality-v2) |
|
- **Output:** Single continuous value representing text quality |
|
|
|
## Intended Use |
|
|
|
These models are designed to assess the quality of English text, where "quality" refers to legible sentences that are not spam and contain useful information. They can be used for: |
|
|
|
- Content moderation |
|
- Spam detection |
|
- Information quality assessment |
|
- Text filtering |
|
|
|
## Usage |
|
|
|
The models accept text input and return a single continuous value representing the assessed quality. Higher values indicate higher perceived quality. Example usage is provided in the code snippet. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
model_name="agentlans/deberta-v3-base-quality-v2" |
|
|
|
# Put model on GPU or else CPU |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
|
|
def quality(text): |
|
"""Processes the text using the model and returns its logits. |
|
In this case, it's interpreted as the the combined quality score for that text.""" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits.squeeze().cpu() |
|
return logits.tolist() |
|
|
|
# Example usage |
|
text = [x.strip() for x in """ |
|
Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!! |
|
Page 1 2 3 4 5 Next Last>> |
|
Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!! |
|
Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment! |
|
In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe. |
|
The mitochondria is the powerhouse of the cell. |
|
Exclusive discount on Super MitoMax Energy Boost! Recharge your mitochondria today! |
|
Everyone is talking about this new diet that guarantees weight loss without exercise! |
|
Discover five tips for improving your productivity while working from home. |
|
""".strip().split("\n")] |
|
|
|
result = quality(text) |
|
for x, s in zip(text, result): |
|
print(f"Text: {x}\nQuality: {round(s, 2)}\n") |
|
``` |
|
|
|
Example output for the `base` size model: |
|
``` |
|
Text: Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!! |
|
Quality: -1.25 |
|
|
|
Text: Page 1 2 3 4 5 Next Last>> |
|
Quality: -1.54 |
|
|
|
Text: Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!! |
|
Quality: -2.01 |
|
|
|
Text: Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment! |
|
Quality: -1.72 |
|
|
|
Text: In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe. |
|
Quality: 0.45 |
|
|
|
Text: The mitochondria is the powerhouse of the cell. |
|
Quality: 1.32 |
|
|
|
Text: Exclusive discount on Super MitoMax Energy Boost! Recharge your mitochondria today! |
|
Quality: -1.16 |
|
|
|
Text: Everyone is talking about this new diet that guarantees weight loss without exercise! |
|
Quality: -0.27 |
|
|
|
Text: Discover five tips for improving your productivity while working from home. |
|
Quality: -0.42 |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
Root mean squared error (RMSE) on 20% held-out evaluation set: |
|
- **DeBERTa v3 xsmall:** 0.6296 |
|
- **DeBERTa v3 base:** 0.5038 |
|
|
|
The base model outperforms the xsmall variant in terms of accuracy. |
|
|
|
## Limitations and Biases |
|
|
|
- The models are trained on a specific dataset and may not generalize well to all types of text or domains. |
|
- "Quality" is a subjective concept, and the models' assessments may not align with all human judgments. |
|
- The models may exhibit biases present in the training data. |
|
- For example, there is a bias against self-help, promotional, and public relations material. |
|
- They do not assess factual correctness or grammatical accuracy. |
|
|
|
## Ethical Considerations |
|
|
|
- These models should not be used as the sole determinant for content moderation or censorship. |
|
- Care should be taken to avoid reinforcing existing biases in content selection or promotion. |
|
- The models' outputs should be interpreted as suggestions rather than definitive judgments. |
|
|
|
## Caveats and Recommendations |
|
|
|
- Use these models in conjunction with other tools and human oversight for content moderation. |
|
- Regularly evaluate the models' performance on your specific use case and data. |
|
- Be aware that the models may not perform equally well across all text types or domains. |
|
- Consider fine-tuning the models on domain-specific data for improved performance in specialized applications. |