---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:40000
- loss:MSELoss
- multilingual
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
widget:
- source_sentence: Who is filming along?
  sentences:
  - Wién filmt mat?
  - >-
    Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
    krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
  - Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
  sentences:
  - >-
    Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
    gëtt jo een ganz neie Wunnquartier gebaut.
  - >-
    D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
    eso'gucr me' we' 90 prozent.
  - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
    Non-profit organisation Passerell, which provides legal council to refugees
    in Luxembourg, announced that it has to make four employees redundant in
    August due to a lack of funding.
  sentences:
  - Oetringen nach Remich....8.20» 215»
  - >-
    D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
    këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
  - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
  sentences:
  - Six Jours vu New-York si fir d’équipe Girgetti — Debacco
  - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
  - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
  sentences:
  - D'grenzarbechetr missten och me' lo'n kre'en.
  - >-
    De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
    gemâcht!
  - >-
    D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
    verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
    SentenceTransformer based on
    sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  results:
  - task:
      type: contemporary-lb
      name: Contemporary-lb
    dataset:
      name: Contemporary-lb
      type: contemporary-lb
    metrics:
    - type: accuracy
      value: 0.594
      name: SIB-200(LB) accuracy
    - type: accuracy
      value: 0.805
      name: ParaLUX accuracy
  - task:
      type: bitext-mining
      name: LBHistoricalBitextMining
    dataset:
      name: LBHistoricalBitextMining
      type: lb-en
    metrics:
    - type: accuracy
      value: 0.8932
      name: LB<->FR accuracy
    - type: accuracy
      value: 0.8955
      name: LB<->EN accuracy
    - type: mean_accuracy
      value: 0.9144
      name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---


# Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2

This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025)

## Limitations

This model only supports inputs of up to 128 subtokens long.

We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base)

However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.

### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) <!-- at revision 75c57757a97f90ad739aca51fa8bfea0e485a7f2 -->
- **Maximum Sequence Length:** 128 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
    - LB-EN (Historical, Modern)
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)


## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
# Run inference
sentences = [
    'The cross-border workers should also receive more wages.',
    "D'grenzarbechetr missten och me' lo'n kre'en.",
    "De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

<!--
### Direct Usage (Transformers)

<details><summary>Click to see the direct usage in Transformers</summary>

</details>
-->

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

## Evaluation

### Metrics

(see introducing paper)
Historical Bitext Mining (Accuracy):

LB -> FR: 88.6

FR -> LB: 90.0

LB -> EN: 88.7

EN -> LB: 90.4

LB -> DE: 91.1

DE -> LB: 91.8

Contemporary LB (Accuracy):

ParaLUX: 80.5

SIB-200(LB): 59.4

## Training Details

### Training Dataset

#### LB-EN (Historical, Modern)

* Dataset: lb-en (mixed)
* Size: 40,000 training samples
* Columns: <code>english</code>, <code>luxembourgish</code>, and <code>label (teacher's en embeddings)</code>
* Approximate statistics based on the first 1000 samples:
  |         | english                                                                            | luxembourgish                                                                      | label                                |
  |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------|
  | type    | string                                                                             | string                                                                             | list                                 |
  | details | <ul><li>min: 4 tokens</li><li>mean: 25.32 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 36.91 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>size: 768 elements</li></ul> |
* Samples:
  | english                                                                                                                          | luxembourgish                                                                                                                       | label                                                                                                                             |
  |:---------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------|
  | <code>A lesson for the next year</code>                                                                                          | <code>Eng le’er fir dat anert joer</code>                                                                                           | <code>[0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]</code>    |
  | <code>On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.</code> | <code>Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.</code> | <code>[-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]</code> |
  | <code>The happiness, the peace is long gone now,</code>                                                                          | <code>V ergângen ass nu läng dat gléck, de' fréd,</code>                                                                            | <code>[0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]</code>     |
* Loss: [<code>MSELoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss)

### Evaluation Dataset
#### Non-Default Hyperparameters
- `learning_rate`: 2e-05
- `num_train_epochs`: 5
- `warmup_ratio`: 0.1
- `bf16`: True
- Rest are default
- 

### Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0
- Accelerate: 1.4.0
- Datasets: 3.3.2
- Tokenizers: 0.21.0

## Citation

### BibTeX

#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

```bibtex
@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}
```

#### Multilingual Knowledge Distillation

```bibtex
@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}
```