|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- dataset_size:40000 |
|
- loss:MSELoss |
|
- multilingual |
|
base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
|
widget: |
|
- source_sentence: Who is filming along? |
|
sentences: |
|
- Wién filmt mat? |
|
- >- |
|
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng |
|
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt. |
|
- Brambilla 130.08.03 St. |
|
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.' |
|
sentences: |
|
- >- |
|
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do |
|
gëtt jo een ganz neie Wunnquartier gebaut. |
|
- >- |
|
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re |
|
eso'gucr me' we' 90 prozent. |
|
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen. |
|
- source_sentence: >- |
|
Non-profit organisation Passerell, which provides legal council to refugees |
|
in Luxembourg, announced that it has to make four employees redundant in |
|
August due to a lack of funding. |
|
sentences: |
|
- Oetringen nach Remich....8.20» 215» |
|
- >- |
|
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe |
|
këmmert, wäert am August mussen hir véier fix Salariéen entloossen. |
|
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent. |
|
- source_sentence: This regulation was temporarily lifted during the Covid pandemic. |
|
sentences: |
|
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco |
|
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat. |
|
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert. |
|
- source_sentence: The cross-border workers should also receive more wages. |
|
sentences: |
|
- D'grenzarbechetr missten och me' lo'n kre'en. |
|
- >- |
|
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck |
|
gemâcht! |
|
- >- |
|
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land |
|
verlooss, et war den Optakt vun der Zäit am Exil. |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
model-index: |
|
- name: >- |
|
SentenceTransformer based on |
|
sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
|
results: |
|
- task: |
|
type: contemporary-lb |
|
name: Contemporary-lb |
|
dataset: |
|
name: Contemporary-lb |
|
type: contemporary-lb |
|
metrics: |
|
- type: accuracy |
|
value: 0.594 |
|
name: SIB-200(LB) accuracy |
|
- type: accuracy |
|
value: 0.805 |
|
name: ParaLUX accuracy |
|
- task: |
|
type: bitext-mining |
|
name: LBHistoricalBitextMining |
|
dataset: |
|
name: LBHistoricalBitextMining |
|
type: lb-en |
|
metrics: |
|
- type: accuracy |
|
value: 0.8932 |
|
name: LB<->FR accuracy |
|
- type: accuracy |
|
value: 0.8955 |
|
name: LB<->EN accuracy |
|
- type: mean_accuracy |
|
value: 0.9144 |
|
name: LB<->DE accuracy |
|
license: agpl-3.0 |
|
datasets: |
|
- impresso-project/HistLuxAlign |
|
- fredxlpy/LuxAlign |
|
language: |
|
- lb |
|
--- |
|
|
|
|
|
# Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections. |
|
|
|
This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025) |
|
|
|
## Limitations |
|
|
|
This model only supports inputs of up to 128 subtokens long. |
|
|
|
We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base) |
|
|
|
However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX. |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) <!-- at revision 75c57757a97f90ad739aca51fa8bfea0e485a7f2 --> |
|
- **Maximum Sequence Length:** 128 tokens |
|
- **Output Dimensionality:** 768 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Training Dataset:** |
|
- LB-EN (Historical, Modern) |
|
<!-- - **Language:** Unknown --> |
|
<!-- - **License:** Unknown --> |
|
|
|
### Model Sources |
|
|
|
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net) |
|
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) |
|
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) |
|
|
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2") |
|
# Run inference |
|
sentences = [ |
|
'The cross-border workers should also receive more wages.', |
|
"D'grenzarbechetr missten och me' lo'n kre'en.", |
|
"De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!", |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 768] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
<!-- |
|
### Direct Usage (Transformers) |
|
|
|
<details><summary>Click to see the direct usage in Transformers</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Downstream Usage (Sentence Transformers) |
|
|
|
You can finetune this model on your own dataset. |
|
|
|
<details><summary>Click to expand</summary> |
|
|
|
</details> |
|
--> |
|
|
|
<!-- |
|
### Out-of-Scope Use |
|
|
|
*List how the model may foreseeably be misused and address what users ought not to do with the model.* |
|
--> |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
(see introducing paper) |
|
Historical Bitext Mining (Accuracy): |
|
|
|
LB -> FR: 88.6 |
|
|
|
FR -> LB: 90.0 |
|
|
|
LB -> EN: 88.7 |
|
|
|
EN -> LB: 90.4 |
|
|
|
LB -> DE: 91.1 |
|
|
|
DE -> LB: 91.8 |
|
|
|
Contemporary LB (Accuracy): |
|
|
|
ParaLUX: 80.5 |
|
|
|
SIB-200(LB): 59.4 |
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
#### LB-EN (Historical, Modern) |
|
|
|
* Dataset: lb-en (mixed) |
|
* Size: 40,000 training samples |
|
* Columns: <code>english</code>, <code>luxembourgish</code>, and <code>label (teacher's en embeddings)</code> |
|
* Approximate statistics based on the first 1000 samples: |
|
| | english | luxembourgish | label | |
|
|:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------| |
|
| type | string | string | list | |
|
| details | <ul><li>min: 4 tokens</li><li>mean: 25.32 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 36.91 tokens</li><li>max: 128 tokens</li></ul> | <ul><li>size: 768 elements</li></ul> | |
|
* Samples: |
|
| english | luxembourgish | label | |
|
|:---------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------| |
|
| <code>A lesson for the next year</code> | <code>Eng le’er fir dat anert joer</code> | <code>[0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]</code> | |
|
| <code>On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.</code> | <code>Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.</code> | <code>[-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]</code> | |
|
| <code>The happiness, the peace is long gone now,</code> | <code>V ergângen ass nu läng dat gléck, de' fréd,</code> | <code>[0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]</code> | |
|
* Loss: [<code>MSELoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss) |
|
|
|
### Evaluation Dataset |
|
#### Non-Default Hyperparameters |
|
- `learning_rate`: 2e-05 |
|
- `num_train_epochs`: 5 |
|
- `warmup_ratio`: 0.1 |
|
- `bf16`: True |
|
- Rest are default |
|
- |
|
|
|
### Framework Versions |
|
- Python: 3.11.11 |
|
- Sentence Transformers: 3.4.1 |
|
- Transformers: 4.49.0 |
|
- PyTorch: 2.6.0 |
|
- Accelerate: 1.4.0 |
|
- Datasets: 3.3.2 |
|
- Tokenizers: 0.21.0 |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper) |
|
|
|
```bibtex |
|
@misc{michail2025adaptingmultilingualembeddingmodels, |
|
title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, |
|
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide}, |
|
year={2025}, |
|
eprint={2502.07938}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.07938}, |
|
} |
|
``` |
|
|
|
#### Multilingual Knowledge Distillation |
|
|
|
```bibtex |
|
@inproceedings{reimers-2020-multilingual-sentence-bert, |
|
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2020", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/2004.09813", |
|
} |
|
``` |
|
|