metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:46338
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
What criteria must Member States consider when establishing penalties for
infringements of the specified Regulation, and what is the deadline for
notifying the Commission about these rules?
sentences:
- >-
Enforcement
1.
Member States shall lay down the rules on penalties applicable to
infringements of this Regulation and shall take all measures necessary
to ensure that they are implemented. The penalties provided for must be
effective, proportionate and dissuasive taking into account, in
particular, the nature, duration, recurrence and gravity of the
infringement. Member States shall, by 31 December 2024, notify the
Commission of those rules and of those measures and shall notify it
without delay of any subsequent amendment affecting them.
2.
- >-
Within the transitional periods established, Member States shall
progressively reduce their respective gaps with regard to the new
minimum levels of taxation. However, where the difference between the
national level and the minimum level does not exceed 3 % of that minimum
level, the Member State concerned may wait until the end of the period
to adjust its national level.
- >-
AR 10. ‘Indirect political contribution’ refers to those political
contributions made through an intermediary organisation such as a
lobbyist or charity, or support given to an organisation such as a think
tank or trade association linked to or supporting particular political
parties or causes.
AR 11. When determining ‘comparable position’ in this standard, the
undertaking shall consider various factors, including level of
responsibility and scope of activities undertaken.
AR 12. The undertaking may provide the following information on its
financial or in-kind contributions with regard to its lobbying expenses:
(a)
the total monetary amount of such internal and external expenses; and
(b)
- source_sentence: >-
How does the use of AI systems impact access to essential public
assistance benefits and services?
sentences:
- >-
X. Among these substances there are ‘priority hazardous substances’
which means substances identified in accordance with Article 16(3) and
(6) for which measures have to be taken in accordance with Article 16(1)
and (8). --- --- 31. ‘Pollutant’means any substance liable to cause
pollution, in particular those listed in Annex VIII. --- --- 32. ‘Direct
discharge to groundwater’means discharge of pollutants into groundwater
without percolation throughout the soil or subsoil. --- --- 33.
‘Pollution’means the direct or indirect introduction, as a result of
human activity, of substances or heat into the air, water or land which
may be harmful to human health or the quality of aquatic ecosystems or
terrestrial ecosystems directly depending on
- >-
The competent authorities shall inform the requesting competent
authorities of any decision taken under the first subparagraph, stating
the reasons therefor.
4.
In order to ensure uniform application of this Article, ESMA may develop
draft implementing technical standards to establish common procedures
for competent authorities to cooperate in on-the-spot verifications and
investigations.
Power is conferred on the Commission to adopt the implementing technical
standards referred to in the first subparagraph in accordance with
Article 15 of Regulation (EU) No 1095/2010.
Article 55
Dispute settlement
- >-
(58) Another area in which the use of AI systems deserves special
consideration is the access to and enjoyment of certain essential
private and public services and benefits necessary for people to fully
participate in society or to improve one’s standard of living. In
particular, natural persons applying for or receiving essential public
assistance benefits and services from public authorities namely
healthcare services, social security benefits, social services providing
protection in cases such as maternity, illness, industrial accidents,
dependency or old age and loss of employment and social and housing
assistance, are typically dependent on those benefits and services and
in a vulnerable position in relation to the responsible
- source_sentence: >-
How does the context suggest promoting vulnerable customers' active
engagement in the energy market?
sentences:
- >-
energy efficiency improvement measures as priority actions; --- --- (c)
carry out early, forward-looking investments in energy efficiency
improvement measures before distributional impacts from other policies
and measures show their effect; --- --- (d) foster technical assistance
and the roll-out of enabling funding and financial tools, such as on-bill
schemes, local loan-loss reserve, guarantee funds, funds targeting deep
renovations and renovations with minimum energy gains; --- --- (e)
foster technical assistance for social actors to promote vulnerable
customer’s active engagement in the energy market, and positive changes
in their energy consumption behaviour; --- --- (f) ensure access to
finance, grants or subsidies bound to minimum
- >-
4.
To the extent that the tasks relating to the implementation of the
Innovation Fund are not delegated to an implementing body, the
Commission shall carry out those tasks.
Article 18
Tasks of the implementing body
►M2 The implementing body designated in accordance with Article 17(1) of
this Regulation to implement the Innovation Fund in accordance with
Article 17(2) may be entrusted with the overall management of the calls
for proposals, the disbursement of the Innovation Fund support and the
monitoring of the implementation of selected projects. ◄ For that
purpose, the implementing body may be entrusted with the following
tasks:
(a)
organising the call for proposals;
(b)
- >-
Calculation
Calculations of emissions shall be performed using the formula:
Activity data × Emission factor × Oxidation factor
Activity data (fuel used, production rate etc.) shall be monitored on
the basis of supply data or measurement.
- source_sentence: >-
What is the purpose of Directive 2004/109/EC of the European Parliament
and of the Council of 15 December 2004?
sentences:
- >-
3.7. Uses advised against ►M7 (see Section 1 of the safety data sheet) ◄
Where applicable, an indication of the uses which the registrant advises
against and why (i.e. non-statutory recommendations by supplier). This
need not be an exhaustive list.
4. CLASSIFICATION AND LABELLING
▼M3
4.1 The hazard classification of the substance(s), resulting from the
application of Title I and II of Regulation (EC) No 1272/2008 for all
hazard classes and categories in that Regulation,
In addition, for each entry, the reasons why no classification is given
for a hazard class or differentiation of a hazard class should be
provided (i.e. if data are lacking, inconclusive, or conclusive but not
sufficient for classification),
- >-
(b)
operations by which the user of an energy product makes its reuse
possible in his own undertaking provided that the taxation already paid
on such product is not less than the taxation which would be due if the
reused energy product were again to be liable to taxation;
(c)
an operation consisting of mixing, outside a production establishment or
a tax warehouse, energy products with other energy products or other
materials, provided that:
(i)
taxation on the components has been paid previously; and
(ii)
the amount paid is not less than the amount of the tax which would be
chargeable on the mixture.
The condition under (i) shall not apply where the mixture is exempted
for a specific use.
Article 22
- >-
( 15 ) Directive 2004/109/EC of the European Parliament and of the
Council of 15 December 2004 on the harmonisation of transparency
requirements in relation to information about issuers whose securities
are admitted to trading on a regulated market and amending Directive
2001/34/EC (OJ L 390, 31.12.2004, p. 38).
( 16 ) Regulation (EU) 2020/852 of the European Parliament and of the
Council of 18 June 2020 on the establishment of a framework to
facilitate sustainable investment, and amending Regulation (EU)
2019/2088 (OJ L 198, 22.6.2020, p. 13).
( 17 ) OJ L 142, 30.4.2004, p. 12.
( 18 ) OJ L 340, 22.12.2007, p. 66.
- source_sentence: >-
What are the main objectives of the directives mentioned in the text
regarding greenhouse gas emissions and carbon dioxide storage, and how do
they relate to environmental protection and sustainability within the
European Union?
sentences:
- >-
(24) Directive 2003/87/EC of the European Parliament and of the Council
of 13 October 2003 establishing a scheme for greenhouse gas emission
allowance trading within the Union and amending Council Directive
96/61/EC (OJ L 275, 25.10.2003, p. 32).
(25) Directive 2009/31/EC of the European Parliament and of the Council
of 23 April 2009 on the geological storage of carbon dioxide and
amending Council Directive 85/337/EEC, European Parliament and Council
Directives 2000/60/EC, 2001/80/EC, 2004/35/EC, 2006/12/EC, 2008/1/EC and
Regulation (EC) No 1013/2006 (OJ L 140, 5.6.2009, p. 114).
(26) Directive 2014/23/EU of the European Parliament and of the Council
of 26 February 2014 on the award of concession contracts (OJ L 94,
28.3.2014, p. 1).
- >-
Article 33
Responsibility and liability for drawing up and publishing the financial
statements and the management report
▼M4
1.
- >-
(b)
risks related to the undertaking’s dependencies on consumers and/or
end-users may include the loss of business continuity where an economic
crisis makes consumers unable to afford certain products or services;
(c)
►C1 opportunities related to the undertaking’s impacts on consumers
and/or end- users may include market differentiation and greater
customer appeal from offering safe products or privacy-respecting
services; and ◄
(d)
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.6659761781460383
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.8841705506645952
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.9312963921974797
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 0.9672017952701536
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.6659761781460383
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.29472351688819837
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.1862592784394959
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09672017952701535
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.6659761781460383
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.8841705506645952
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.9312963921974797
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.9672017952701536
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.8278291318026204
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.7818480980055302
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.783515504381956
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'What are the main objectives of the directives mentioned in the text regarding greenhouse gas emissions and carbon dioxide storage, and how do they relate to environmental protection and sustainability within the European Union?',
'(24) Directive 2003/87/EC of the European Parliament and of the Council of 13 October 2003 establishing a scheme for greenhouse gas emission allowance trading within the Union and amending Council Directive 96/61/EC (OJ L 275, 25.10.2003, p. 32).\n\n(25) Directive 2009/31/EC of the European Parliament and of the Council of 23 April 2009 on the geological storage of carbon dioxide and amending Council Directive 85/337/EEC, European Parliament and Council Directives 2000/60/EC, 2001/80/EC, 2004/35/EC, 2006/12/EC, 2008/1/EC and Regulation (EC) No 1013/2006 (OJ L 140, 5.6.2009, p. 114).\n\n(26) Directive 2014/23/EU of the European Parliament and of the Council of 26 February 2014 on the award of concession contracts (OJ L 94, 28.3.2014, p. 1).',
'Article 33\n\nResponsibility and liability for drawing up and publishing the financial statements and the management report\n\n▼M4\n\n1.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.666 |
cosine_accuracy@3 | 0.8842 |
cosine_accuracy@5 | 0.9313 |
cosine_accuracy@10 | 0.9672 |
cosine_precision@1 | 0.666 |
cosine_precision@3 | 0.2947 |
cosine_precision@5 | 0.1863 |
cosine_precision@10 | 0.0967 |
cosine_recall@1 | 0.666 |
cosine_recall@3 | 0.8842 |
cosine_recall@5 | 0.9313 |
cosine_recall@10 | 0.9672 |
cosine_ndcg@10 | 0.8278 |
cosine_mrr@10 | 0.7818 |
cosine_map@100 | 0.7835 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 46,338 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1 type string string details - min: 11 tokens
- mean: 35.24 tokens
- max: 206 tokens
- min: 4 tokens
- mean: 193.39 tokens
- max: 512 tokens
- Samples:
sentence_0 sentence_1 How is materiality defined in the context of an entity's sustainability reporting as per QC 4?
QC 4. Materiality is an entity-specific aspect of relevance based on the nature or magnitude, or both, of the items to which the information relates, as assessed in the context of the undertaking’s sustainability reporting (see chapter 3 of this Standard).
Faithful representation
QC 5. To be useful, the information must not only represent relevant phenomena, it must also faithfully represent the substance of the phenomena that it purports to represent. Faithful representation requires information to be (i) complete, (ii) neutral and (iii) accurate.What procedure must be followed for the adoption of implementing acts as mentioned in the text?
Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 22a(2).
3.
Articles 9, 9a and 10 shall apply to maritime transport activities in the same manner as they apply to other activities covered by the EU ETS with the following exception with regard to the application of Article 10.How should monitoring points be distributed for groundwater bodies that flow across Member State boundaries to effectively estimate groundwater flow?
The network shall include sufficient representative monitoring points to estimate the groundwater level in each groundwater body or group of bodies taking into account short and long-term variations in recharge and in particular:
— for groundwater bodies identified as being at risk of failing to achieve environmental objectives under Article 4, ensure sufficient density of monitoring points to assess the impact of abstractions and discharges on the groundwater level,
— for groundwater bodies within which groundwater flows across a Member State boundary, ensure sufficient monitoring points are provided to estimate the direction and rate of groundwater flow across the Member State boundary.
2.2.3. Monitoring frequency - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 1024, 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsmulti_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 8per_device_eval_batch_size
: 8per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | Training Loss | cosine_ndcg@10 |
---|---|---|---|
0.0863 | 500 | 0.938 | - |
0.1726 | 1000 | 0.2188 | - |
0.2589 | 1500 | 0.1998 | - |
0.3452 | 2000 | 0.2162 | 0.7843 |
0.4316 | 2500 | 0.1921 | - |
0.5179 | 3000 | 0.1749 | - |
0.6042 | 3500 | 0.1741 | - |
0.6905 | 4000 | 0.2007 | 0.7779 |
0.7768 | 4500 | 0.1456 | - |
0.8631 | 5000 | 0.1034 | - |
0.9494 | 5500 | 0.1285 | - |
1.0 | 5793 | - | 0.7806 |
1.0357 | 6000 | 0.1011 | 0.7879 |
1.1220 | 6500 | 0.065 | - |
1.2084 | 7000 | 0.0754 | - |
1.2947 | 7500 | 0.067 | - |
1.3810 | 8000 | 0.059 | 0.7953 |
1.4673 | 8500 | 0.0644 | - |
1.5536 | 9000 | 0.0705 | - |
1.6399 | 9500 | 0.0425 | - |
1.7262 | 10000 | 0.0515 | 0.8171 |
1.8125 | 10500 | 0.0358 | - |
1.8988 | 11000 | 0.0515 | - |
1.9852 | 11500 | 0.043 | - |
2.0 | 11586 | - | 0.8201 |
2.0715 | 12000 | 0.0257 | 0.8208 |
2.1578 | 12500 | 0.0343 | - |
2.2441 | 13000 | 0.0307 | - |
2.3304 | 13500 | 0.0324 | - |
2.4167 | 14000 | 0.0225 | 0.8236 |
2.5030 | 14500 | 0.0362 | - |
2.5893 | 15000 | 0.0255 | - |
2.6756 | 15500 | 0.0203 | - |
2.7620 | 16000 | 0.0244 | 0.8240 |
2.8483 | 16500 | 0.0461 | - |
2.9346 | 17000 | 0.0226 | - |
3.0 | 17379 | - | 0.8278 |
Framework Versions
- Python: 3.10.15
- Sentence Transformers: 3.4.1
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu126
- Accelerate: 1.5.2
- Datasets: 3.4.1
- Tokenizers: 0.21.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}