--- tags: - sentence-transformers - sentence-similarity - dataset_size:40000 - loss:MSELoss - multilingual base_model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2 widget: - source_sentence: Who is filming along? sentences: - Wién filmt mat? - >- Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt. - Brambilla 130.08.03 St. - source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.' sentences: - >- Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do gëtt jo een ganz neie Wunnquartier gebaut. - >- D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re eso'gucr me' we' 90 prozent. - Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen. - source_sentence: >- Non-profit organisation Passerell, which provides legal council to refugees in Luxembourg, announced that it has to make four employees redundant in August due to a lack of funding. sentences: - Oetringen nach Remich....8.20» 215» - >- D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe këmmert, wäert am August mussen hir véier fix Salariéen entloossen. - D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent. - source_sentence: This regulation was temporarily lifted during the Covid pandemic. sentences: - Six Jours vu New-York si fir d’équipe Girgetti — Debacco - Dës Reegelung gouf wärend der Covid-Pandemie ausgesat. - ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert. - source_sentence: The cross-border workers should also receive more wages. sentences: - D'grenzarbechetr missten och me' lo'n kre'en. - >- De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht! - >- D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land verlooss, et war den Optakt vun der Zäit am Exil. pipeline_tag: sentence-similarity library_name: sentence-transformers model-index: - name: >- SentenceTransformer based on sentence-transformers/paraphrase-multilingual-mpnet-base-v2 results: - task: type: contemporary-lb name: Contemporary-lb dataset: name: Contemporary-lb type: contemporary-lb metrics: - type: accuracy value: 0.594 name: SIB-200(LB) accuracy - type: accuracy value: 0.805 name: ParaLUX accuracy - task: type: bitext-mining name: LBHistoricalBitextMining dataset: name: LBHistoricalBitextMining type: lb-en metrics: - type: accuracy value: 0.8932 name: LB<->FR accuracy - type: accuracy value: 0.8955 name: LB<->EN accuracy - type: mean_accuracy value: 0.9144 name: LB<->DE accuracy license: agpl-3.0 datasets: - impresso-project/HistLuxAlign - fredxlpy/LuxAlign language: - lb --- # Luxembourgish adaptation of sentence-transformers/paraphrase-multilingual-mpnet-base-v2 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections. This is an [paraphrase-multilingual-mpnet-base-v2](sentence-transformers/paraphrase-multilingual-mpnet-base-v2) model that was further adapted by (Michail et al., 2025) ## Limitations This model only supports inputs of up to 128 subtokens long. We also release a model that performs better (7.5pp) on Historical Bitext Mining and natively supports long context (8192 subtokens). For most usecases we reccomend using [histlux-gte-multilingual-base](https://huggingface.co/impresso-project/histlux-gte-multilingual-base) However, this model exhibits superior performance (by 18pp) on the adversarial paraphrase discrimination task ParaLUX. ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [sentence-transformers/paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) - **Maximum Sequence Length:** 128 tokens - **Output Dimensionality:** 768 dimensions - **Similarity Function:** Cosine Similarity - **Training Dataset:** - LB-EN (Historical, Modern) ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2") # Run inference sentences = [ 'The cross-border workers should also receive more wages.', "D'grenzarbechetr missten och me' lo'n kre'en.", "De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck gemâcht!", ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 768] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` ## Evaluation ### Metrics (see introducing paper) Historical Bitext Mining (Accuracy): LB -> FR: 88.6 FR -> LB: 90.0 LB -> EN: 88.7 EN -> LB: 90.4 LB -> DE: 91.1 DE -> LB: 91.8 Contemporary LB (Accuracy): ParaLUX: 80.5 SIB-200(LB): 59.4 ## Training Details ### Training Dataset #### LB-EN (Historical, Modern) * Dataset: lb-en (mixed) * Size: 40,000 training samples * Columns: english, luxembourgish, and label (teacher's en embeddings) * Approximate statistics based on the first 1000 samples: | | english | luxembourgish | label | |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:-------------------------------------| | type | string | string | list | | details | | | | * Samples: | english | luxembourgish | label | |:---------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------| | A lesson for the next year | Eng le’er fir dat anert joer | [0.08891881257295609, 0.20895496010780334, -0.10672671347856522, -0.03302554786205292, 0.049002278596162796, ...] | | On Easter, the Maquisards' northern section organizes their big spring ball in Willy Pintsch's hall at the station. | Op O'schteren organisieren d'Maquisard'eiii section Nord, hire gro'sse fre'joersbal am sali Willy Pintsch op der gare. | [-0.08668982982635498, -0.06969941407442093, -0.0036096556577831507, 0.1605304628610611, -0.041704729199409485, ...] | | The happiness, the peace is long gone now, | V ergângen ass nu läng dat gléck, de' fréd, | [0.07229219377040863, 0.3288629353046417, -0.012548360042273998, 0.06720984727144241, -0.02617395855486393, ...] | * Loss: [MSELoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#mseloss) ### Evaluation Dataset #### Non-Default Hyperparameters - `learning_rate`: 2e-05 - `num_train_epochs`: 5 - `warmup_ratio`: 0.1 - `bf16`: True - Rest are default - ### Framework Versions - Python: 3.11.11 - Sentence Transformers: 3.4.1 - Transformers: 4.49.0 - PyTorch: 2.6.0 - Accelerate: 1.4.0 - Datasets: 3.3.2 - Tokenizers: 0.21.0 ## Citation ### BibTeX #### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper) ```bibtex @misc{michail2025adaptingmultilingualembeddingmodels, title={Adapting Multilingual Embedding Models to Historical Luxembourgish}, author={Andrianos Michail and Corina Julia Raclé and Juri Opitz and Simon Clematide}, year={2025}, eprint={2502.07938}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.07938}, } ``` #### Multilingual Knowledge Distillation ```bibtex @inproceedings{reimers-2020-multilingual-sentence-bert, title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2020", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2004.09813", } ```