Andrianos's picture
Updated Readme
6ed6856 verified
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:120000
- multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
- source_sentence: Who is filming along?
sentences:
- Wién filmt mat?
- >-
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
- Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
sentences:
- >-
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
gëtt jo een ganz neie Wunnquartier gebaut.
- >-
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
eso'gucr me' we' 90 prozent.
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
Non-profit organisation Passerell, which provides legal council to refugees
in Luxembourg, announced that it has to make four employees redundant in
August due to a lack of funding.
sentences:
- Oetringen nach Remich....8.20» 215»
- >-
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
sentences:
- Six Jours vu New-York si fir d’équipe Girgetti Debacco
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
sentences:
- D'grenzarbechetr missten och me' lo'n kre'en.
- >-
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
gemâcht!
- >-
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
SentenceTransformer based on
Alibaba-NLP/gte-multilingual-base
results:
- task:
type: contemporary-lb
name: Contemporary-lb
dataset:
name: Contemporary-lb
type: contemporary-lb
metrics:
- type: accuracy
value: 0.6216
name: SIB-200(LB) accuracy
- type: accuracy
value: 0.6282
name: ParaLUX accuracy
- task:
type: bitext-mining
name: LBHistoricalBitextMining
dataset:
name: LBHistoricalBitextMining
type: lb-en
metrics:
- type: accuracy
value: 0.9683
name: LB<->FR accuracy
- type: accuracy
value: 0.9715
name: LB<->EN accuracy
- type: mean_accuracy
value: 0.9793
name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---
# Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)
## Limitations
We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2)
### Model Description
- **Model Type:** GTE-Multilingual-Base
- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
- LB-EN (Historical, Modern)
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
```
## Evaluation Results
### Metrics
(see introducing paper)
Historical Bitext Mining (Accuracy):
LB -> FR: 96.8
FR -> LB: 96.9
LB -> EN: 97.2
EN -> LB: 97.2
LB -> DE: 98.0
DE -> LB: 91.8
Contemporary LB (Accuracy):
ParaLUX: 62.82
SIB-200(LB): 62.16
## Training Details
### Training Dataset
The parallel sentences data mix is the following:
impresso-project/HistLuxAlign:
- LB-FR (x20,000)
- LB-EN (x20,000)
- LB-DE (x20,000)
fredxlpy/LuxAlign:
- LB-FR (x40,000)
- LB-EN (x20,000)
Total: 120 000 Sentence pairs in mixed batches of size 8
### Contrastive Training
The model was trained with the parameters:
```
**Loss**:
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
```
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
```
Parameters of the fit()-Method:
```
{
"epochs": 1,
"evaluation_steps": 520,
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
}
```
```
## Citation
### BibTeX
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
```bibtex
@misc{michail2025adaptingmultilingualembeddingmodels,
title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
year={2025},
eprint={2502.07938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.07938},
}
```
#### Original Multilingual GTE Model
```bibtex
@inproceedings{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={1393--1412},
year={2024}
}
```