SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v2.0
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-v2.0. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-m-v2.0
- Maximum Sequence Length: 8192 tokens
- Output Dimensionality: 768 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: GteModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'What conditions must a new registrant meet in order to refer to previously submitted study summaries for a substance that has already been registered?',
'5.\n\nIf a substance has already been registered, a new registrant shall be entitled to refer to the study summaries or robust study summaries, for the same substance submitted earlier, provided that he can show that the substance that he is now registering is the same as the one previously registered, including the degree of purity and the nature of impurities, and that the previous registrant(s) have given permission to refer to the full study reports for the purpose of registration.\n\nA new registrant shall not refer to such studies in order to provide the information required in Section 2 of Annex VI.\n\nArticle 14\n\nChemical safety report and duty to apply and recommend risk reduction measures\n\n1.',
'of high boiling fractions from bituminous coal high temperature tar and/or pitch coke oil, with a softening point of 140 to 170 °C according to DIN 52025. Composed primarily of tri- and polynuclear aromatic compounds which also contain heteroatoms.) 648-057-00-6 302-650-3 94114-13-3 M Residues (coal tar), pitch distillation; Pitch redistillate (Residue from the fractional distillation of pitch distillate boiling in the range of approximately 400 to 470 °C. Composed primarily of polynuclear aromatic hydrocarbons, and heterocyclic compounds.) 648-058-00-1 295-507-9 92061-94-4 M Tar, coal, high-temperature, distillation and storage residues; Coal tar solids residue (Coke- and ash-containing solid residues that separate on distillation and thermal treatment of bituminous coal high temperature tar in distillation installations and storage vessels. Consists predominantly of carbon and contains a small quantity of hetero compounds as well as ash components.) 648-059-00-7 295-535-1 92062-20-9 M Tar, coal, storage residues; Coal tar solids residue (The deposit removed from crude coal tar storages. Composed primarily of coal tar and carbonaceous particulate matter.) 648-060-00-2 293-764-1 91082-50-7 M Tar, coal, high-temperature, residues; Coal tar solids residue (Solids formed during the coking of bituminous coal to produce crude bituminous coal high temperature tar. Composed primarily of coke and coal particles, highly aromatised compounds and mineral substances.) 648-061-00-8 309-726-5 100684-51-3 M Tar, coal, high-temperature, high-solids; Coal tar solids residue (The condensation product obtained by cooling, to approximately ambient temperature, the gas evolved in the high temperature (greater than 700 °C) destructive distillation of coal. Composed primarily of a complex mixture of condensed ring aromatic hydrocarbons with a high solid content of coal-type materials.) 648-062-00-3 273-615-7 68990-61-4 M Waste solids, coal-tar pitch coking; Coal tar solids residue (The combination of wastes formed by the coking of bituminous coal tar pitch. It consists predominantly of carbon.) 648-063-00-9 295-549-8 92062-34-5 M Extract residues (coal), brown; Coal tar extract (The residue from extraction of dried coal.)',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.7204 |
cosine_accuracy@3 | 0.9237 |
cosine_accuracy@5 | 0.9555 |
cosine_accuracy@10 | 0.9796 |
cosine_precision@1 | 0.7204 |
cosine_precision@3 | 0.3079 |
cosine_precision@5 | 0.1911 |
cosine_precision@10 | 0.098 |
cosine_recall@1 | 0.7204 |
cosine_recall@3 | 0.9237 |
cosine_recall@5 | 0.9555 |
cosine_recall@10 | 0.9796 |
cosine_ndcg@10 | 0.8635 |
cosine_mrr@10 | 0.8248 |
cosine_map@100 | 0.8257 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 46,338 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 12 tokens
- mean: 42.34 tokens
- max: 246 tokens
- min: 6 tokens
- mean: 260.86 tokens
- max: 2053 tokens
- Samples:
sentence_0 sentence_1 What actions can a Member State take if it believes urgent measures are necessary to protect human health or the environment regarding a substance, and what are the requirements for informing other entities about these actions?
1.
Where a Member State has justifiable grounds for believing that urgent action is essential to protect human health or the environment in respect of a substance, on its own, in a ►M3 mixture ◄ or in an article, even if satisfying the requirements of this Regulation, it may take appropriate provisional measures. The Member State shall immediately inform the Commission, the Agency and the other Member States thereof, giving reasons for its decision and submitting the scientific or technical information on which the provisional measure is based.
2.Under what circumstances can Member States extend the time limits for the permit-granting process for Strategic Projects, and what is the maximum extension period allowed?
(b)
12 months for Strategic Projects involving only processing or recycling.
3.
Where an environmental impact assessment is required pursuant to Directive 2011/92/EU, the step of the assessment referred to in Article 1(2), point (g)(i), of that Directive shall not be included in the duration for permit- granting process referred to in paragraphs 1 and 2 of this Article.
4.
In exceptional cases, where the nature, complexity, location or size of the Strategic Project so require, Member States may extend, before their expiry and on a case-by-case basis, the time limits referred to in:
(a)
paragraph 1, point (a), and paragraph 2, point (a), by a maximum of six months;
(b)What types of compounds primarily compose the distillates mentioned in the context?
(86 °F to 572 °F). Composed primarily of partly hydrogenated condensed-ring aromatic hydrocarbons, aromatic compounds containing nitrogen, oxygen and sulfur, and their alkyl derivatives having carbon numbers predominantly in the range of C4 through C14.] 648-148-00-0 302-688-0 94114-52-0 J Distillates (coal), solvent extn., hydrocracked; [Distillate obtained by hydrocracking of coal extract or solution produced by the liquid solvent extraction or supercritical gas extraction processes and boiling in the range of approximately 30 °C to 300 °C (86 °F to 572 °F). Composed primarily of aromatic, hydrogenated aromatic and naphthenic compounds, their alkyl derivatives and alkanes with carbon numbers predominantly in the range of C4 through C14.
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsmulti_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss | cosine_ndcg@10 |
---|---|---|---|
0.0863 | 500 | 0.2243 | 0.8154 |
0.1726 | 1000 | 0.1242 | 0.8270 |
0.2589 | 1500 | 0.0877 | 0.8298 |
0.3452 | 2000 | 0.0823 | 0.8284 |
0.4316 | 2500 | 0.0627 | 0.8351 |
0.5179 | 3000 | 0.0636 | 0.8385 |
0.6042 | 3500 | 0.0587 | 0.8356 |
0.6905 | 4000 | 0.0746 | 0.8398 |
0.7768 | 4500 | 0.05 | 0.8440 |
0.8631 | 5000 | 0.0495 | 0.8441 |
0.9494 | 5500 | 0.0569 | 0.8451 |
1.0 | 5793 | - | 0.8432 |
1.0357 | 6000 | 0.0368 | 0.8458 |
1.1220 | 6500 | 0.0267 | 0.8501 |
1.2084 | 7000 | 0.0402 | 0.8451 |
1.2947 | 7500 | 0.0261 | 0.8524 |
1.3810 | 8000 | 0.0304 | 0.8503 |
1.4673 | 8500 | 0.0345 | 0.8521 |
1.5536 | 9000 | 0.0337 | 0.8551 |
1.6399 | 9500 | 0.0221 | 0.8525 |
1.7262 | 10000 | 0.0287 | 0.8560 |
1.8125 | 10500 | 0.0291 | 0.8549 |
1.8988 | 11000 | 0.0315 | 0.8577 |
1.9852 | 11500 | 0.0226 | 0.8577 |
2.0 | 11586 | - | 0.8578 |
2.0715 | 12000 | 0.0162 | 0.8552 |
2.1578 | 12500 | 0.0161 | 0.8561 |
2.2441 | 13000 | 0.0224 | 0.8550 |
2.3304 | 13500 | 0.0277 | 0.8601 |
2.4167 | 14000 | 0.0238 | 0.8591 |
2.5030 | 14500 | 0.0155 | 0.8593 |
2.5893 | 15000 | 0.0164 | 0.8598 |
2.6756 | 15500 | 0.0259 | 0.8624 |
2.7620 | 16000 | 0.0114 | 0.8617 |
2.8483 | 16500 | 0.025 | 0.8635 |
Framework Versions
- Python: 3.10.15
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu126
- Accelerate: 1.5.2
- Datasets: 3.4.1
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 1,485
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for amentaphd/snowflake-artic-embed-m-v2.0
Base model
Snowflake/snowflake-arctic-embed-m-v2.0Evaluation results
- Cosine Accuracy@1 on Unknownself-reported0.720
- Cosine Accuracy@3 on Unknownself-reported0.924
- Cosine Accuracy@5 on Unknownself-reported0.955
- Cosine Accuracy@10 on Unknownself-reported0.980
- Cosine Precision@1 on Unknownself-reported0.720
- Cosine Precision@3 on Unknownself-reported0.308
- Cosine Precision@5 on Unknownself-reported0.191
- Cosine Precision@10 on Unknownself-reported0.098
- Cosine Recall@1 on Unknownself-reported0.720
- Cosine Recall@3 on Unknownself-reported0.924