ModernCE-base-sts / README.md
dleemiller's picture
Update README.md
c0939f2 verified
|
raw
history blame
4.95 kB
---
license: mit
datasets:
- dleemiller/wiki-sim
- sentence-transformers/stsb
language:
- en
metrics:
- spearmanr
- pearsonr
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- cross-encoder
- modernbert
- sts
- stsb
---
# ModernBERT Cross-Encoder: Semantic Similarity (STS)
Cross encoders are high performing encoder models that compare two texts and output a 0-1 score.
I've found the `cross-encoders/roberta-large-stsb` model to be very useful in creating evaluators for LLM outputs.
They're simple to use, fast and very accurate.
Like many people, I was excited about the architecture and training uplift from the ModernBERT architecture (`answerdotai/ModernBERT-base`).
So I've applied it to the stsb cross encoder, which is a very handy model. Additionally, I've added
pretraining from my much larger semi-synthetic dataset `dleemiller/wiki-sim` that targets this kind of objective.
---
## Features
- **High performing:** Achieves **Pearson: 0.9162** and **Spearman: 0.9122** on the STS-Benchmark test set.
- **Efficient architecture:** Based on the ModernBERT-base design (149M parameters), offering faster inference speeds.
- **Extended context length:** Processes sequences up to 8192 tokens, great for LLM output evals.
- **Diversified training:** Pretrained on `dleemiller/wiki-sim` and fine-tuned on `sentence-transformers/stsb`.
---
## Performance
| Model | STS-B Test Pearson | STS-B Test Spearman | Context Length | Parameters | Speed* |
|--------------------------------|--------------------|---------------------|----------------|------------|---------|
| **ModernCE-base-sts** | **0.9162** | **0.9122** | **8192** | 149M | **Fast** |
| `roberta-large-stsb` | 0.9147 | 0.9115 | 512 | 355M | Slow |
| `distilroberta-base-stsb` | 0.8792 | 0.8765 | 512 | 66M | Fast |
---
## Usage
To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face `sentence-transformers` library:
```python
from sentence_transformers import CrossEncoder
# Load ModernCE model
model = CrossEncoder("dleemiller/ModernCE-base-sts")
# Predict similarity scores for sentence pairs
sentence_pairs = [
("It's a wonderful day outside.", "It's so sunny today!"),
("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)
print(scores) # Outputs: array([0.9184, 0.0123], dtype=float32)
```
### Output
The model returns similarity scores in the range `[0, 1]`, where higher scores indicate stronger semantic similarity.
---
## Training Details
### Pretraining
The model was pretrained on the `pair-score-sampled` subset of the [`dleemiller/wiki-sim`](https://huggingface.co/datasets/dleemiller/wiki-sim) dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.
- **Classifier Dropout:** 0.3, to introduce regularization and reduce overfitting.
- **Objective:** STS-B scores from `roberta-large-stsb`.
### Fine-Tuning
Fine-tuning was performed on the [`sentence-transformers/stsb`](https://huggingface.co/datasets/sentence-transformers/stsb) dataset.
### Validation Results
The model achieved the following test set performance after fine-tuning:
- **Pearson Correlation:** 0.9162
- **Spearman Correlation:** 0.9122
Logs for training and evaluation are included in the [training logs](output/eval/sts-test-results.csv).
---
## Applications
1. **Semantic Search:** Retrieve relevant documents or text passages based on query similarity.
2. **Retrieval-Augmented Generation (RAG):** Enhance generative models by providing contextually relevant information.
3. **Content Moderation:** Automatically classify or rank content based on similarity to predefined guidelines.
4. **Code Search:** Leverage the model's ability to understand code and natural language for large-scale programming tasks.
---
## Model Card
- **Architecture:** ModernBERT-base
- **Tokenizer:** Custom tokenizer trained with modern techniques for long-context handling.
- **Pretraining Data:** `dleemiller/wiki-sim (pair-score-sampled)`
- **Fine-Tuning Data:** `sentence-transformers/stsb`
---
## Thank You
Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.
---
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{moderncestsb2025,
author = {Miller, D. Lee},
title = {ModernCE STS: An STS cross encoder model},
year = {2025},
publisher = {Hugging Face Hub},
url = {https://huggingface.co/dleemiller/ModernCE-base-sts},
}
```
---
## License
This model is licensed under the [MIT License](LICENSE).