|
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |
|
|------------------|-------|----------------|-----:|-----------------------|---|-----:|---|------| |
|
|gsm8k | 3|flexible-extract| 5|exact_match |↑ |0.8400|± |0.0101| |
|
| | |strict-match | 5|exact_match |↑ |0.8378|± |0.0102| |
|
|hrm8k | N/A| | | | | | | | |
|
| - hrm8k_gsm8k | 1|none | 0|exact_match |↑ |0.8196|± |0.0106| |
|
| - hrm8k_ksm | 1|none | 0|exact_match |↑ |0.0511|± |0.0058| |
|
| - hrm8k_math | 1|none | 0|exact_match |↑ |0.5539|± |0.0093| |
|
| - hrm8k_mmmlu | 1|none | 0|exact_match |↑ |0.5362|± |0.0230| |
|
| - hrm8k_omni_math| 1|none | 0|exact_match |↑ |0.1812|± |0.0088| |
|
|ifeval | 4|none | 0|inst_level_loose_acc |↑ |0.8753|± | N/A| |
|
| | |none | 0|inst_level_strict_acc |↑ |0.8609|± | N/A| |
|
| | |none | 0|prompt_level_loose_acc |↑ |0.8244|± |0.0164| |
|
| | |none | 0|prompt_level_strict_acc|↑ |0.8078|± |0.0170| |
|
|
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|-------------------------------|------:|------|-----:|--------|---|-----:|---|------| |
|
|haerae | 1|none | |acc |↑ |0.6654|± |0.0140| |
|
| | |none | |acc_norm|↑ |0.6654|± |0.0140| |
|
| - haerae_general_knowledge | 1|none | 5|acc |↑ |0.4943|± |0.0378| |
|
| | |none | 5|acc_norm|↑ |0.4943|± |0.0378| |
|
| - haerae_history | 1|none | 5|acc |↑ |0.5585|± |0.0363| |
|
| | |none | 5|acc_norm|↑ |0.5585|± |0.0363| |
|
| - haerae_loan_word | 1|none | 5|acc |↑ |0.7456|± |0.0336| |
|
| | |none | 5|acc_norm|↑ |0.7456|± |0.0336| |
|
| - haerae_rare_word | 1|none | 5|acc |↑ |0.7160|± |0.0224| |
|
| | |none | 5|acc_norm|↑ |0.7160|± |0.0224| |
|
| - haerae_standard_nomenclature| 1|none | 5|acc |↑ |0.7712|± |0.0341| |
|
| | |none | 5|acc_norm|↑ |0.7712|± |0.0341| |
|
|kobest | 1|none | |acc |↑ |0.7768|± |0.0057| |
|
| | |none | |acc_norm|↑ |0.5880|± |0.0220| |
|
| | |none | |f1 |↑ |0.7764|± | N/A| |
|
| - kobest_boolq | 1|none | 5|acc |↑ |0.9252|± |0.0070| |
|
| | |none | 5|f1 |↑ |0.9252|± | N/A| |
|
| - kobest_copa | 1|none | 5|acc |↑ |0.6980|± |0.0145| |
|
| | |none | 5|f1 |↑ |0.6975|± | N/A| |
|
| - kobest_hellaswag | 1|none | 5|acc |↑ |0.4440|± |0.0222| |
|
| | |none | 5|acc_norm|↑ |0.5880|± |0.0220| |
|
| | |none | 5|f1 |↑ |0.4419|± | N/A| |
|
| - kobest_sentineg | 1|none | 5|acc |↑ |0.9622|± |0.0096| |
|
| | |none | 5|f1 |↑ |0.9622|± | N/A| |
|
| - kobest_wic | 1|none | 5|acc |↑ |0.7476|± |0.0122| |
|
| | |none | 5|f1 |↑ |0.7473|± | N/A| |
|
|
|
|Groups|Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|------|------:|------|------|--------|---|-----:|---|------| |
|
|haerae| 1|none | |acc |↑ |0.6654|± |0.0140| |
|
| | |none | |acc_norm|↑ |0.6654|± |0.0140| |
|
|kobest| 1|none | |acc |↑ |0.7768|± |0.0057| |
|
| | |none | |acc_norm|↑ |0.5880|± |0.0220| |
|
| | |none | |f1 |↑ |0.7764|± | N/A| |
|
|
|
|
|
vllm (pretrained=/data/public/glim/workspace/mergekit/output,max_model_len=8192,tensor_parallel_size=2,enable_chunked_prefill=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto |
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|-----------------------------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:| |
|
|kmmlu_direct | 2|none | |exact_match|↑ |0.5212|± |0.0026| |
|
| - kmmlu_direct_applied_science | 2|none | |exact_match|↑ |0.4997|± |0.0046| |
|
| - kmmlu_direct_aviation_engineering_and_maintenance | 2|none | 5|exact_match|↑ |0.5190|± |0.0158| |
|
| - kmmlu_direct_electronics_engineering | 2|none | 5|exact_match|↑ |0.6420|± |0.0152| |
|
| - kmmlu_direct_energy_management | 2|none | 5|exact_match|↑ |0.4160|± |0.0156| |
|
| - kmmlu_direct_environmental_science | 2|none | 5|exact_match|↑ |0.3710|± |0.0153| |
|
| - kmmlu_direct_gas_technology_and_engineering | 2|none | 5|exact_match|↑ |0.4240|± |0.0156| |
|
| - kmmlu_direct_geomatics | 2|none | 5|exact_match|↑ |0.4580|± |0.0158| |
|
| - kmmlu_direct_industrial_engineer | 2|none | 5|exact_match|↑ |0.5080|± |0.0158| |
|
| - kmmlu_direct_machine_design_and_manufacturing | 2|none | 5|exact_match|↑ |0.5360|± |0.0158| |
|
| - kmmlu_direct_maritime_engineering | 2|none | 5|exact_match|↑ |0.5117|± |0.0204| |
|
| - kmmlu_direct_nondestructive_testing | 2|none | 5|exact_match|↑ |0.5460|± |0.0158| |
|
| - kmmlu_direct_railway_and_automotive_engineering | 2|none | 5|exact_match|↑ |0.4310|± |0.0157| |
|
| - kmmlu_direct_telecommunications_and_wireless_technology| 2|none | 5|exact_match|↑ |0.6380|± |0.0152| |
|
| - kmmlu_direct_humss | 2|none | |exact_match|↑ |0.5365|± |0.0068| |
|
| - kmmlu_direct_accounting | 2|none | 5|exact_match|↑ |0.5400|± |0.0501| |
|
| - kmmlu_direct_criminal_law | 2|none | 5|exact_match|↑ |0.3800|± |0.0344| |
|
| - kmmlu_direct_economics | 2|none | 5|exact_match|↑ |0.5462|± |0.0438| |
|
| - kmmlu_direct_education | 2|none | 5|exact_match|↑ |0.6200|± |0.0488| |
|
| - kmmlu_direct_korean_history | 2|none | 5|exact_match|↑ |0.2700|± |0.0446| |
|
| - kmmlu_direct_law | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| |
|
| - kmmlu_direct_management | 2|none | 5|exact_match|↑ |0.6120|± |0.0154| |
|
| - kmmlu_direct_political_science_and_sociology | 2|none | 5|exact_match|↑ |0.5500|± |0.0288| |
|
| - kmmlu_direct_psychology | 2|none | 5|exact_match|↑ |0.4590|± |0.0158| |
|
| - kmmlu_direct_social_welfare | 2|none | 5|exact_match|↑ |0.6710|± |0.0149| |
|
| - kmmlu_direct_taxation | 2|none | 5|exact_match|↑ |0.4100|± |0.0349| |
|
| - kmmlu_direct_other | 2|none | |exact_match|↑ |0.5130|± |0.0053| |
|
| - kmmlu_direct_agricultural_sciences | 2|none | 5|exact_match|↑ |0.3920|± |0.0154| |
|
| - kmmlu_direct_construction | 2|none | 5|exact_match|↑ |0.4320|± |0.0157| |
|
| - kmmlu_direct_fashion | 2|none | 5|exact_match|↑ |0.5010|± |0.0158| |
|
| - kmmlu_direct_food_processing | 2|none | 5|exact_match|↑ |0.4700|± |0.0158| |
|
| - kmmlu_direct_health | 2|none | 5|exact_match|↑ |0.6800|± |0.0469| |
|
| - kmmlu_direct_interior_architecture_and_design | 2|none | 5|exact_match|↑ |0.6110|± |0.0154| |
|
| - kmmlu_direct_marketing | 2|none | 5|exact_match|↑ |0.8010|± |0.0126| |
|
| - kmmlu_direct_patent | 2|none | 5|exact_match|↑ |0.3900|± |0.0490| |
|
| - kmmlu_direct_public_safety | 2|none | 5|exact_match|↑ |0.4660|± |0.0158| |
|
| - kmmlu_direct_real_estate | 2|none | 5|exact_match|↑ |0.4850|± |0.0354| |
|
| - kmmlu_direct_refrigerating_machinery | 2|none | 5|exact_match|↑ |0.4320|± |0.0157| |
|
| - kmmlu_direct_stem | 2|none | |exact_match|↑ |0.5455|± |0.0048| |
|
| - kmmlu_direct_biology | 2|none | 5|exact_match|↑ |0.4140|± |0.0156| |
|
| - kmmlu_direct_chemical_engineering | 2|none | 5|exact_match|↑ |0.5580|± |0.0157| |
|
| - kmmlu_direct_chemistry | 2|none | 5|exact_match|↑ |0.5550|± |0.0203| |
|
| - kmmlu_direct_civil_engineering | 2|none | 5|exact_match|↑ |0.4750|± |0.0158| |
|
| - kmmlu_direct_computer_science | 2|none | 5|exact_match|↑ |0.7780|± |0.0131| |
|
| - kmmlu_direct_ecology | 2|none | 5|exact_match|↑ |0.5430|± |0.0158| |
|
| - kmmlu_direct_electrical_engineering | 2|none | 5|exact_match|↑ |0.4100|± |0.0156| |
|
| - kmmlu_direct_information_technology | 2|none | 5|exact_match|↑ |0.7860|± |0.0130| |
|
| - kmmlu_direct_materials_engineering | 2|none | 5|exact_match|↑ |0.5250|± |0.0158| |
|
| - kmmlu_direct_math | 2|none | 5|exact_match|↑ |0.3267|± |0.0271| |
|
| - kmmlu_direct_mechanical_engineering | 2|none | 5|exact_match|↑ |0.4800|± |0.0158| |
|
|
|
| Groups |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|-------------------------------|------:|------|------|-----------|---|-----:|---|-----:| |
|
|kmmlu_direct | 2|none | |exact_match|↑ |0.5212|± |0.0026| |
|
| - kmmlu_direct_applied_science| 2|none | |exact_match|↑ |0.4997|± |0.0046| |
|
| - kmmlu_direct_humss | 2|none | |exact_match|↑ |0.5365|± |0.0068| |
|
| - kmmlu_direct_other | 2|none | |exact_match|↑ |0.5130|± |0.0053| |
|
| - kmmlu_direct_stem | 2|none | |exact_match|↑ |0.5455|± |0.0048| |