histlux-gte-multilingual-base / README.md

Updated Readme

6ed6856 verified about 1 month ago

7.28 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- dataset_size:120000
	- multilingual
	base_model: Alibaba-NLP/gte-multilingual-base
	widget:
	- source_sentence: Who is filming along?
	sentences:
	- Wién filmt mat?
	- >-
	Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
	krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
	- Brambilla 130.08.03 St.
	- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
	sentences:
	- >-
	Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
	gëtt jo een ganz neie Wunnquartier gebaut.
	- >-
	D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
	eso'gucr me' we' 90 prozent.
	- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
	- source_sentence: >-
	Non-profit organisation Passerell, which provides legal council to refugees
	in Luxembourg, announced that it has to make four employees redundant in
	August due to a lack of funding.
	sentences:
	- Oetringen nach Remich....8.20» 215»
	- >-
	D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
	këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
	- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
	- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
	sentences:
	- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
	- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
	- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
	- source_sentence: The cross-border workers should also receive more wages.
	sentences:
	- D'grenzarbechetr missten och me' lo'n kre'en.
	- >-
	De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
	gemâcht!
	- >-
	D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
	verlooss, et war den Optakt vun der Zäit am Exil.
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	model-index:
	- name: >-
	SentenceTransformer based on
	Alibaba-NLP/gte-multilingual-base
	results:
	- task:
	type: contemporary-lb
	name: Contemporary-lb
	dataset:
	name: Contemporary-lb
	type: contemporary-lb
	metrics:
	- type: accuracy
	value: 0.6216
	name: SIB-200(LB) accuracy
	- type: accuracy
	value: 0.6282
	name: ParaLUX accuracy
	- task:
	type: bitext-mining
	name: LBHistoricalBitextMining
	dataset:
	name: LBHistoricalBitextMining
	type: lb-en
	metrics:
	- type: accuracy
	value: 0.9683
	name: LB<->FR accuracy
	- type: accuracy
	value: 0.9715
	name: LB<->EN accuracy
	- type: mean_accuracy
	value: 0.9793
	name: LB<->DE accuracy
	license: agpl-3.0
	datasets:
	- impresso-project/HistLuxAlign
	- fredxlpy/LuxAlign
	language:
	- lb
	---

	# Luxembourgish adaptation of Alibaba-NLP/gte-multilingual-base

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.


	## Model Details

	This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

	This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)

	## Limitations

	We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2)

	### Model Description
	- Model Type: GTE-Multilingual-Base
	- Base model: [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
	- Maximum Sequence Length: 8192 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	- Training Dataset:
	- LB-EN (Historical, Modern)


	## Usage (Sentence-Transformers)

	Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

	```
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import SentenceTransformer
	sentences = ["This is an example sentence", "Each sentence is converted"]

	model = SentenceTransformer('impresso-project/histlux-gte-multilingual-base', trust_remote_code=True)
	embeddings = model.encode(sentences)
	print(embeddings)
	```



	## Evaluation Results

	### Metrics

	(see introducing paper)
	Historical Bitext Mining (Accuracy):

	LB -> FR: 96.8

	FR -> LB: 96.9

	LB -> EN: 97.2

	EN -> LB: 97.2

	LB -> DE: 98.0

	DE -> LB: 91.8

	Contemporary LB (Accuracy):
	ParaLUX: 62.82

	SIB-200(LB): 62.16


	## Training Details

	### Training Dataset

	The parallel sentences data mix is the following:

	impresso-project/HistLuxAlign:
	- LB-FR (x20,000)
	- LB-EN (x20,000)
	- LB-DE (x20,000)

	fredxlpy/LuxAlign:
	- LB-FR (x40,000)
	- LB-EN (x20,000)

	Total: 120 000 Sentence pairs in mixed batches of size 8


	### Contrastive Training
	The model was trained with the parameters:
	```
	Loss:

	`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
	```
	{'scale': 20.0, 'similarity_fct': 'cos_sim'}
	```

	Parameters of the fit()-Method:
	```
	{
	"epochs": 1,
	"evaluation_steps": 520,
	"max_grad_norm": 1,
	"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
	"optimizer_params": {
	"lr": 2e-05
	},
	"scheduler": "WarmupLinear",
	}
	```
	```

	## Citation

	### BibTeX

	#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

	```bibtex
	@misc{michail2025adaptingmultilingualembeddingmodels,
	title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
	author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
	year={2025},
	eprint={2502.07938},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.07938},
	}
	```

	#### Original Multilingual GTE Model

	```bibtex
	@inproceedings{zhang2024mgte,
	title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
	author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
	booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
	pages={1393--1412},
	year={2024}
	}
	```