Updated Readme

18653b7 verified about 1 month ago

11.6 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- dataset_size:40000
	- loss:MSELoss
	- multilingual
	base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
	widget:
	- source_sentence: Who is filming along?
	sentences:
	- Wién filmt mat?
	- >-
	Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
	krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
	- Brambilla 130.08.03 St.
	- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
	sentences:
	- >-
	Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
	gëtt jo een ganz neie Wunnquartier gebaut.
	- >-
	D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
	eso'gucr me' we' 90 prozent.
	- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
	- source_sentence: >-
	Non-profit organisation Passerell, which provides legal council to refugees
	in Luxembourg, announced that it has to make four employees redundant in
	August due to a lack of funding.
	sentences:
	- Oetringen nach Remich....8.20» 215»
	- >-
	D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
	këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
	- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
	- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
	sentences:
	- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
	- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
	- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
	- source_sentence: The cross-border workers should also receive more wages.
	sentences:
	- D'grenzarbechetr missten och me' lo'n kre'en.
	- >-
	De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
	gemâcht!
	- >-
	D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
	verlooss, et war den Optakt vun der Zäit am Exil.
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	model-index:
	- name: >-
	SentenceTransformer based on
	sentence-transformers/paraphrase-multilingual-mpnet-base-v2
	results:
	- task:
	type: contemporary-lb
	name: Contemporary-lb
	dataset:
	name: Contemporary-lb
	type: contemporary-lb
	metrics:
	- type: accuracy
	value: 0.594
	name: SIB-200(LB) accuracy
	- type: accuracy
	value: 0.805
	name: ParaLUX accuracy
	- task:
	type: bitext-mining
	name: LBHistoricalBitextMining
	dataset:
	name: LBHistoricalBitextMining
	type: lb-en
	metrics:
	- type: accuracy
	value: 0.8932
	name: LB<->FR accuracy
	- type: accuracy
	value: 0.8955
	name: LB<->EN accuracy
	- type: mean_accuracy
	value: 0.9144
	name: LB<->DE accuracy
	license: agpl-3.0
	datasets:
	- impresso-project/HistLuxAlign
	- fredxlpy/LuxAlign
	language:
	- lb
	---


	# Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.

	This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025)

	## Limitations

	This model only supports inputs of up to 128 subtokens long.

	We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base)

	However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX.

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) <!-- at revision 75c57757a97f90ad739aca51fa8bfea0e485a7f2 -->
	- Maximum Sequence Length: 128 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	- Training Dataset:
	- LB-EN (Historical, Modern)
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)


	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2")
	# Run inference
	sentences = [
	'The cross-border workers should also receive more wages.',
	"D'grenzarbechetr missten och me' lo'n kre'en.",
	"De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!",
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	(see introducing paper)
	Historical Bitext Mining (Accuracy):

	LB -> FR: 88.6

	FR -> LB: 90.0

	LB -> EN: 88.7

	EN -> LB: 90.4

	LB -> DE: 91.1

	DE -> LB: 91.8

	Contemporary LB (Accuracy):

	ParaLUX: 80.5

	SIB-200(LB): 59.4

	## Training Details

	### Training Dataset

	#### LB-EN (Historical, Modern)

	* Dataset: lb-en (mixed)
	* Size: 40,000 training samples
	* Columns: <code>english</code>, <code>luxembourgish</code>, and <code>label (teacher's en embeddings)</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| english \| luxembourgish \| label \|
	\|:--------\|:-----------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------\|:-------------------------------------\|
	\| type \| string \| string \| list \|
	\| details \| <ul><li>min: 4 tokens</li><li>mean: 25.32 tokens</li><li>max: 128 tokens</li></ul> \| <ul><li>min: 5 tokens</li><li>mean: 36.91 tokens</li><li>max: 128 tokens</li></ul> \| <ul><li>size: 768 elements</li></ul> \|
	* Samples:
	\| english \| luxembourgish \| label \|
	\|:---------------------------------------------------------------------------------------------------------------------------------\|:------------------------------------------------------------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------------------------------------------------------\|
	\| <code>A lesson for the next year</code> \| <code>Eng le’er fir dat anert joer</code> \| <code>[0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...]</code> \|
	\| <code>On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station.</code> \| <code>Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare.</code> \| <code>[-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...]</code> \|
	\| <code>The happiness, the peace is long gone now,</code> \| <code>V ergângen ass nu läng dat gléck, de' fréd,</code> \| <code>[0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...]</code> \|
	* Loss: [<code>MSELoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss)

	### Evaluation Dataset
	#### Non-Default Hyperparameters
	- `learning_rate`: 2e-05
	- `num_train_epochs`: 5
	- `warmup_ratio`: 0.1
	- `bf16`: True
	- Rest are default
	-

	### Framework Versions
	- Python: 3.11.11
	- Sentence Transformers: 3.4.1
	- Transformers: 4.49.0
	- PyTorch: 2.6.0
	- Accelerate: 1.4.0
	- Datasets: 3.3.2
	- Tokenizers: 0.21.0

	## Citation

	### BibTeX

	#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)

	```bibtex
	@misc{michail2025adaptingmultilingualembeddingmodels,
	title={Adapting Multilingual Embedding Models to Historical Luxembourgish},
	author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide},
	year={2025},
	eprint={2502.07938},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.07938},
	}
	```

	#### Multilingual Knowledge Distillation

	```bibtex
	@inproceedings{reimers-2020-multilingual-sentence-bert,
	title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2020",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/2004.09813",
	}
	```