Andrianos's picture
Updated evaluations
6eb32ff verified
|
raw
history blame
7.48 kB
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - dataset_size:120000
  - multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
  - source_sentence: Who is filming along?
    sentences:
      - Wién filmt mat?
      - >-
        Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
        krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer
        hätt.
      - Brambilla 130.08.03 St.
  - source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
    sentences:
      - >-
        Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai,
        do gëtt jo een ganz neie Wunnquartier gebaut.
      - >-
        D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden
        wor re eso'gucr me' we' 90 prozent.
      - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
  - source_sentence: >-
      Non-profit organisation Passerell, which provides legal council to
      refugees in Luxembourg, announced that it has to make four employees
      redundant in August due to a lack of funding.
    sentences:
      - Oetringen nach Remich....8.20» 215»
      - >-
        D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache
        Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen
        entloossen.
      - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
  - source_sentence: This regulation was temporarily lifted during the Covid pandemic.
    sentences:
      - Six Jours vu New-York si fir d’équipe Girgetti  Debacco
      - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
      - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
  - source_sentence: The cross-border workers should also receive more wages.
    sentences:
      - D'grenzarbechetr missten och me' lo'n kre'en.
      - >-
        De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der
        Bréck gemâcht!
      - >-
        D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
        verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
  - name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
    results:
      - task:
          type: contemporary-lb
          name: Contemporary-lb
        dataset:
          name: Contemporary-lb
          type: contemporary-lb
        metrics:
          - type: accuracy
            value: 0.6216
            name: SIB-200(LB) accuracy
          - type: accuracy
            value: 0.6282
            name: ParaLUX accuracy
      - task:
          type: bitext-mining
          name: LBHistoricalBitextMining
        dataset:
          name: LBHistoricalBitextMining
          type: lb-en
        metrics:
          - type: accuracy
            value: 0.9683
            name: LB<->FR accuracy
          - type: accuracy
            value: 0.9715
            name: LB<->EN accuracy
          - type: mean_accuracy
            value: 0.9793
            name: LB<->DE accuracy
license: agpl-3.0
datasets:
  - impresso-project/HistLuxAlign
  - fredxlpy/LuxAlign
language:
  - lb

Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

This is an Alibaba-NLP/gte-multilingual-base model that was further adapted by (Michail et al., 2025)

Limitations

Whilst this model natively supports inputs up to 8192, all of our evaluations are on sentence level so there are no guarantees on it's longer text embedding capabilities of Historical Luxembourgish.

We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use histlux-paraphrase-multilingual-mpnet-base-v2

Model Description

  • Model Type: GTE-Multilingual-Base
  • Base model: Alibaba-NLP/gte-multilingual-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • LB-EN (Hist-TR, RTL-M)

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

Metrics

(see introducing paper) Historical Bitext Mining (Accuracy):

LB -> FR: 96.8

FR -> LB: 96.9

LB -> EN: 97.2

EN -> LB: 97.2

LB -> DE: 98.0

DE -> LB: 91.8

Contemporary LB (Accuracy): ParaLUX: 62.82

SIB-200(LB): 62.16

Training Details

Training Dataset

The parallel sentences data mix is the following:

impresso-project/HistLuxAlign:

  • LB-FR (x20,000)
  • LB-EN (x20,000)
  • LB-DE (x20,000)

fredxlpy/LuxAlign:

  • LB-FR (x40,000)
  • LB-EN (x20,000)

Total: 120 000 Sentence pairs in mixed batches of size 8

Contrastive Training

The model was trained with the parameters:

**Loss**:

`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:

{'scale': 20.0, 'similarity_fct': 'cos_sim'}


Parameters of the fit()-Method:

{ "epochs": 1, "evaluation_steps": 520, "max_grad_norm": 1, "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", "optimizer_params": { "lr": 2e-05 }, "scheduler": "WarmupLinear", }


Citation

BibTeX

Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

@misc{michail2025adaptingmultilingualembeddingmodels,
      title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, 
      author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
      year={2025},
      eprint={2502.07938},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07938}, 
}

Original Multilingual GTE Model

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}