|
--- |
|
language: en |
|
datasets: yahoo_answers_topics |
|
tags: |
|
- text-classification |
|
- topic-classification |
|
- yahoo-answers |
|
- distilbert |
|
- transformers |
|
- pytorch |
|
license: apache-2.0 |
|
model-index: |
|
- name: DistilBERT Yahoo Answers Classifier |
|
results: |
|
- task: |
|
name: Topic Classification |
|
type: text-classification |
|
dataset: |
|
name: Yahoo Answers Topics |
|
type: yahoo_answers_topics |
|
metrics: |
|
- name: Accuracy |
|
type: accuracy |
|
value: 0.71 |
|
--- |
|
|
|
# DistilBERT Fine-Tuned on Yahoo Answers Topics |
|
|
|
This is a fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model for **topic classification** on the [Yahoo Answers Topics dataset](https://huggingface.co/datasets/yahoo_answers_topics). It classifies questions into one of 10 predefined categories like "Science & Mathematics", "Health", "Business & Finance", etc. |
|
|
|
## π§ Model Details |
|
|
|
- **Base model**: `distilbert-base-uncased` |
|
- **Task**: Multi-class Text Classification (10 classes) |
|
- **Dataset**: Yahoo Answers Topics |
|
- **Training samples**: 50,000 (subset) |
|
- **Evaluation samples**: 5,000 (subset) |
|
- **Metrics**: Accuracy |
|
|
|
## π§ͺ How to Use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Koushim/distilbert-yahoo-answers") |
|
model = AutoModelForSequenceClassification.from_pretrained("Koushim/distilbert-yahoo-answers") |
|
|
|
text = "How do I improve my math skills for competitive exams?" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
outputs = model(**inputs) |
|
|
|
predicted_class = outputs.logits.argmax(dim=1).item() |
|
print("Predicted class:", predicted_class) |
|
```` |
|
|
|
## π Classes (Labels) |
|
|
|
0. Society & Culture |
|
1. Science & Mathematics |
|
2. Health |
|
3. Education & Reference |
|
4. Computers & Internet |
|
5. Sports |
|
6. Business & Finance |
|
7. Entertainment & Music |
|
8. Family & Relationships |
|
9. Politics & Government |
|
|
|
## π¦ Training Details |
|
|
|
* Optimizer: AdamW |
|
* Learning rate: 2e-5 |
|
* Batch size: 16 (train), 32 (eval) |
|
* Epochs: 3 |
|
* Weight decay: 0.01 |
|
* Framework: PyTorch + π€ Transformers |
|
|
|
## π Repository Structure |
|
|
|
* `config.json` β Model config |
|
* `pytorch_model.bin` β Trained model weights |
|
* `tokenizer.json`, `vocab.txt` β Tokenizer files |
|
|
|
## βοΈ Author |
|
|
|
* Hugging Face Hub: [Koushim](https://huggingface.co/Koushim) |
|
* Model trained using `transformers.Trainer` API |
|
|
|
## π License |
|
|
|
Apache 2.0 |
|
|
|
```` |