Semantic Arabic Qwen Embeddings
Collection
A set of optimized Qwen-based embedding models fine-tuned for Arabic semantic understanding.
•
2 items
•
Updated
This is a sentence-transformers model finetuned from Qwen/Qwen3-Embedding-0.6B on STS tasks for better semantic arabic understanding. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
This model has been evaluated on Arabic semantic similarity benchmarks using the MTEB framework. Below are Spearman correlation scores for two tasks: STS17, STS22.v2, and their average.
Model | STS17 (Spearman) | STS22.v2 (Spearman) | Average |
---|---|---|---|
Qwen3 Embeddings 0.6B | 0.7505 | 0.6520 | 0.7013 |
Qwen3 Embeddings 4B | 0.7912 | 0.6669 | 0.7291 |
Qwen3 Embeddings 8B | 0.8220 | 0.6680 | 0.7450 |
Semantic-Ar-Qwen-Embed-V0.1 | 0.8300 | 0.6130 | 0.7215 |
✅ STS17: Sentence similarity from classical Arabic benchmarks
🧪 STS22.v2: Diverse, multi-domain Arabic similarity pairs
SentenceTransformer(
(0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Load model from Hugging Face Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B")
# Sentences for embedding (English + Arabic)
sentences = [
'Left side of a silver train engine.',
'A close-up of a black train engine.',
"One idea that's been going around at least since the 80s is that you can distinguish between Holds and Moves.",
"الجانب الأيسر من محرك قطار فضي.",
"صورة مقربة لمحرك قطار أسود.",
"إحدى الأفكار المتداولة منذ الثمانينات هي إمكانية التمييز بين الثبات والحركة.",
]
# Generate embeddings
embeddings = model.encode(sentences)
print("Embedding shape:", embeddings.shape)
# Output: (6, 1024)
# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print("Similarity shape:", similarities.shape)
# Output: (6, 6)
# Optionally print similarity scores
import numpy as np
import pandas as pd
df = pd.DataFrame(np.round(similarities, 3), index=sentences, columns=sentences)
print("\nSimilarity matrix:\n")
print(df)
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}