---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:40000
- loss:MSELoss
- multilingual
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
- source_sentence: Who is filming along?
sentences:
- Wién filmt mat?
- >-
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
- Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
sentences:
- >-
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
gëtt jo een ganz neie Wunnquartier gebaut.
- >-
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
eso'gucr me' we' 90 prozent.
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
Non-profit organisation Passerell, which provides legal council to refugees
in Luxembourg, announced that it has to make four employees redundant in
August due to a lack of funding.
sentences:
- Oetringen nach Remich....8.20» 215»
- >-
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
sentences:
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
sentences:
- D'grenzarbechetr missten och me' lo'n kre'en.
- >-
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
gemâcht!
- >-
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
SentenceTransformer based on
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
results:
- task:
type: contemporary-lb
name: Contemporary-lb
dataset:
name: Contemporary-lb
type: contemporary-lb
metrics:
- type: accuracy
value: 0.594
name: SIB-200(LB) accuracy
- type: accuracy
value: 0.805
name: ParaLUX accuracy
- task:
type: bitext-mining
name: LBHistoricalBitextMining
dataset:
name: LBHistoricalBitextMining
type: lb-en
metrics:
- type: accuracy
value: 0.8932
name: LB<->FR accuracy
- type: accuracy
value: 0.8955
name: LB<->EN accuracy
- type: mean_accuracy
value: 0.9144
name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---
# Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025)
## Limitations
This model only supports inputs of up to 128 subtokens long.
We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base)
However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2)
- **Maximum Sequence Length:** 128 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
- LB-EN (Historical, Modern)
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
'The cross-border workers should also receive more wages.',
"D'grenzarbechetr missten och me' lo'n kre'en.",
"De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
## Evaluation
### Metrics
(see introducing paper)
Historical Bitext Mining (Accuracy):
LB -> FR: 88.6
FR -> LB: 90.0
LB -> EN: 88.7
EN -> LB: 90.4
LB -> DE: 91.1
DE -> LB: 91.8
Contemporary LB (Accuracy):
ParaLUX: 80.5
SIB-200(LB): 59.4
## Training Details
### Training Dataset
#### LB-EN (Historical, Modern)
* Dataset: lb-en (mixed)
* Size: 40,000 training samples
* Columns: english
, luxembourgish
, and label (teacher's en embeddings)
* Approximate statistics based on the first 1000 samples:
| | english | luxembourgish | label |
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------|
| type | string | string | list |
| details |
A lesson for the next year
| Eng le’er fir dat anert joer
| [0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]
|
| On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.
| Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.
| [-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]
|
| The happiness, the peace is long gone now,
| V ergângen ass nu läng dat gléck, de' fréd,
| [0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]
|
* Loss: [MSELoss
](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss)
### Evaluation Dataset
#### Non-Default Hyperparameters
- `learning_rate`: 2e-05
- `num_train_epochs`: 5
- `warmup_ratio`: 0.1
- `bf16`: True
- Rest are default
-
### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0
## Citation
### BibTeX
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
```bibtex
@misc{michail2025adaptingmultilingualembeddingmodels,
title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
year={2025},
eprint={2502.07938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07938},
}
```
#### Multilingual Knowledge Distillation
```bibtex
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}
```