|
--- |
|
language: |
|
- ar |
|
- en |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- generated_from_trainer |
|
- loss:MatryoshkaLoss |
|
- loss:MultipleNegativesRankingLoss |
|
base_model: Qwen/Qwen3-Embedding-0.6B |
|
widget: |
|
- source_sentence: >- |
|
أقترح أن تجد بنكًا في بلدك المحلي، وأن تفكر في فتح حساب مصرفي مقوم باليورو |
|
لديهم. |
|
sentences: |
|
- يمكنك مزج هذه الأمور، ولكن من تجربتي، سيكون الأمر صعبًا جدًا في البداية. |
|
- المرأة تضع ظلال العيون بقلم. |
|
- لست متأكدًا مما إذا كان بإمكانك فتح حساب مصرفي في فرنسا إذا لم تكن مقيمًا. |
|
- source_sentence: صورة بالأبيض والأسود لموجة تتحطم في المحيط. |
|
sentences: |
|
- كلب صغير أسود في المحيط مع بعض الصخور في الخلفية. |
|
- امرأة تركب فيلًا. |
|
- طائر أصفر وبرتقالي متمسك بجانب قفص. |
|
- source_sentence: >- |
|
إذا تمكنت من تجاوز "عامل الاشمئزاز"، فسيكون لديك مصدر سهل الاستخدام من |
|
السماد العضوي النيتروجيني. |
|
sentences: |
|
- أرقام NPK على السماد تمثل النسبة المئوية، بالوزن، للنيتروجين وP2O5 وK2O. |
|
- تجميع ويكيبيديا لقواعد السفر عبر الزمن هو مصدر جيد لفهم هذا الموضوع. |
|
- رجل يعزف على الناي. |
|
- source_sentence: رجل يتحدث. |
|
sentences: |
|
- رجل يرقص. |
|
- أسد الجبل يطارد دبًا. |
|
- >- |
|
لأغراض الشمول، يحتوي برنامج Pages من Apple على العديد من قوالب الملصقات |
|
الجيدة. |
|
- source_sentence: الجانب الأيسر من محرك قطار فضي. |
|
sentences: |
|
- قرد يركب حافلة. |
|
- >- |
|
إحدى الأفكار التي كانت تُطرح منذ الثمانينات هي أنه يمكنك التمييز بين |
|
"الحركات" و"الثبات". |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
license: apache-2.0 |
|
--- |
|
|
|
# Semantic-Ar-Qwen-Embed-0.6B |
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on STS tasks for better semantic arabic understanding. |
|
It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) <!-- at revision a579a21d7aff542145eebef8d60ed73ec281a0b4 --> |
|
- **Maximum Sequence Length:** 32768 tokens |
|
- **Output Dimensionality:** 1024 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Language:** ar |
|
|
|
### 📊 Performance Evaluation |
|
|
|
|
|
This model has been evaluated on Arabic semantic similarity benchmarks using the [MTEB](https://github.com/embeddings-benchmark/mteb) framework. Below are **Spearman correlation scores** for two tasks: **STS17**, **STS22.v2**, and their average. |
|
|
|
| **Model** | **STS17 (Spearman)** | **STS22.v2 (Spearman)** | **Average** | |
|
|----------------------------------|----------------------|--------------------------|-------------| |
|
| Qwen3 Embeddings 0.6B | 0.7505 | 0.6520 | 0.7013 | |
|
| Qwen3 Embeddings 4B | 0.7912 | 0.6669 | 0.7291 | |
|
| Qwen3 Embeddings 8B | 0.8220 | **0.6680** | 0.7450 | |
|
| Semantic-Ar-Qwen-Embed-V0.1 | **0.8300** | 0.6130 | 0.7215 | |
|
|
|
> ✅ **STS17**: Sentence similarity from classical Arabic benchmarks |
|
> 🧪 **STS22.v2**: Diverse, multi-domain Arabic similarity pairs |
|
|
|
### 📌 Insights |
|
- **Semantic-Ar-Qwen-Embed-V0.1** leads on **STS17**, indicating task specialization. |
|
- **Qwen3 8B** achieves the **highest average** and **top STS22.v2** score, making it the best all-rounder. |
|
- Model size clearly correlates with performance across Qwen variants. |
|
|
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen3Model |
|
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
### Direct Usage (Sentence Transformers) |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Load model from Hugging Face Hub |
|
model = SentenceTransformer("Omartificial-Intelligence-Space/Semantic-Ar-Qwen-Embed-0.6B") |
|
|
|
# Sentences for embedding (English + Arabic) |
|
sentences = [ |
|
'Left side of a silver train engine.', |
|
'A close-up of a black train engine.', |
|
"One idea that's been going around at least since the 80s is that you can distinguish between Holds and Moves.", |
|
|
|
"الجانب الأيسر من محرك قطار فضي.", |
|
"صورة مقربة لمحرك قطار أسود.", |
|
"إحدى الأفكار المتداولة منذ الثمانينات هي إمكانية التمييز بين الثبات والحركة.", |
|
] |
|
|
|
# Generate embeddings |
|
embeddings = model.encode(sentences) |
|
print("Embedding shape:", embeddings.shape) |
|
# Output: (6, 1024) |
|
|
|
# Compute similarity matrix |
|
similarities = model.similarity(embeddings, embeddings) |
|
print("Similarity shape:", similarities.shape) |
|
# Output: (6, 6) |
|
|
|
# Optionally print similarity scores |
|
import numpy as np |
|
import pandas as pd |
|
|
|
df = pd.DataFrame(np.round(similarities, 3), index=sentences, columns=sentences) |
|
print("\nSimilarity matrix:\n") |
|
print(df) |
|
``` |
|
|
|
## Citation |
|
|
|
### BibTeX |
|
|
|
#### Sentence Transformers |
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|
|
#### MatryoshkaLoss |
|
```bibtex |
|
@misc{kusupati2024matryoshka, |
|
title={Matryoshka Representation Learning}, |
|
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, |
|
year={2024}, |
|
eprint={2205.13147}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG} |
|
} |
|
``` |
|
|
|
#### MultipleNegativesRankingLoss |
|
```bibtex |
|
@misc{henderson2017efficient, |
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
year={2017}, |
|
eprint={1705.00652}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |