|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- dataset_size:120000 |
|
- multilingual |
|
base_model: Alibaba-NLP/gte-multilingual-base |
|
widget: |
|
- source_sentence: Who is filming along? |
|
sentences: |
|
- Wién filmt mat? |
|
- >- |
|
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng |
|
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt. |
|
- Brambilla 130.08.03 St. |
|
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.' |
|
sentences: |
|
- >- |
|
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do |
|
gëtt jo een ganz neie Wunnquartier gebaut. |
|
- >- |
|
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re |
|
eso'gucr me' we' 90 prozent. |
|
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen. |
|
- source_sentence: >- |
|
Non-profit organisation Passerell, which provides legal council to refugees |
|
in Luxembourg, announced that it has to make four employees redundant in |
|
August due to a lack of funding. |
|
sentences: |
|
- Oetringen nach Remich....8.20» 215» |
|
- >- |
|
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe |
|
këmmert, wäert am August mussen hir véier fix Salariéen entloossen. |
|
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent. |
|
- source_sentence: This regulation was temporarily lifted during the Covid pandemic. |
|
sentences: |
|
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco |
|
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat. |
|
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert. |
|
- source_sentence: The cross-border workers should also receive more wages. |
|
sentences: |
|
- D'grenzarbechetr missten och me' lo'n kre'en. |
|
- >- |
|
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck |
|
gemâcht! |
|
- >- |
|
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land |
|
verlooss, et war den Optakt vun der Zäit am Exil. |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
model-index: |
|
- name: >- |
|
SentenceTransformer based on |
|
Alibaba-NLP/gte-multilingual-base |
|
results: |
|
- task: |
|
type: contemporary-lb |
|
name: Contemporary-lb |
|
dataset: |
|
name: Contemporary-lb |
|
type: contemporary-lb |
|
metrics: |
|
- type: accuracy |
|
value: 0.6216 |
|
name: SIB-200(LB) accuracy |
|
- type: accuracy |
|
value: 0.6282 |
|
name: ParaLUX accuracy |
|
- task: |
|
type: bitext-mining |
|
name: LBHistoricalBitextMining |
|
dataset: |
|
name: LBHistoricalBitextMining |
|
type: lb-en |
|
metrics: |
|
- type: accuracy |
|
value: 0.9683 |
|
name: LB<->FR accuracy |
|
- type: accuracy |
|
value: 0.9715 |
|
name: LB<->EN accuracy |
|
- type: mean_accuracy |
|
value: 0.9793 |
|
name: LB<->DE accuracy |
|
license: agpl-3.0 |
|
datasets: |
|
- impresso-project/HistLuxAlign |
|
- fredxlpy/LuxAlign |
|
language: |
|
- lb |
|
--- |
|
|
|
# Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
|
|
## Model Details |
|
|
|
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections. |
|
|
|
This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025) |
|
|
|
## Limitations |
|
|
|
We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2) |
|
|
|
### Model Description |
|
- **Model Type:** GTE-Multilingual-Base |
|
- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) |
|
- **Maximum Sequence Length:** 8192 tokens |
|
- **Output Dimensionality:** 768 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Training Dataset:** |
|
- LB-EN (Historical, Modern) |
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True) |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
|
|
## Evaluation Results |
|
|
|
### Metrics |
|
|
|
(see introducing paper) |
|
Historical Bitext Mining (Accuracy): |
|
|
|
LB -> FR: 96.8 |
|
|
|
FR -> LB: 96.9 |
|
|
|
LB -> EN: 97.2 |
|
|
|
EN -> LB: 97.2 |
|
|
|
LB -> DE: 98.0 |
|
|
|
DE -> LB: 91.8 |
|
|
|
Contemporary LB (Accuracy): |
|
ParaLUX: 62.82 |
|
|
|
SIB-200(LB): 62.16 |
|
|
|
|
|
## Training Details |
|
|
|
### Training Dataset |
|
|
|
The parallel sentences data mix is the following: |
|
|
|
impresso-project/HistLuxAlign: |
|
- LB-FR (x20,000) |
|
- LB-EN (x20,000) |
|
- LB-DE (x20,000) |
|
|
|
fredxlpy/LuxAlign: |
|
- LB-FR (x40,000) |
|
- LB-EN (x20,000) |
|
|
|
Total: 120 000 Sentence pairs in mixed batches of size 8 |
|
|
|
|
|
### Contrastive Training |
|
The model was trained with the parameters: |
|
``` |
|
**Loss**: |
|
|
|
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters: |
|
``` |
|
{'scale': 20.0, 'similarity_fct': 'cos_sim'} |
|
``` |
|
|
|
Parameters of the fit()-Method: |
|
``` |
|
{ |
|
"epochs": 1, |
|
"evaluation_steps": 520, |
|
"max_grad_norm": 1, |
|
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
|
"optimizer_params": { |
|
"lr": 2e-05 |
|
}, |
|
"scheduler": "WarmupLinear", |
|
} |
|
``` |
|
``` |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper) |
|
|
|
```bibtex |
|
@misc{michail2025adaptingmultilingualembeddingmodels, |
|
title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, |
|
author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide}, |
|
year={2025}, |
|
eprint={2502.07938}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.07938}, |
|
} |
|
``` |
|
|
|
#### Original Multilingual GTE Model |
|
|
|
```bibtex |
|
@inproceedings{zhang2024mgte, |
|
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, |
|
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others}, |
|
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track}, |
|
pages={1393--1412}, |
|
year={2024} |
|
} |
|
``` |