core_leaderboard / evals_live

Commit History

Upload swebench_verified_Agentless_gpt-4o-mini-2024-07-18_50_Instances_1723916965.json
01fb261
verified

benediktstroebl commited on

Delete evals_live/swebench_verified_Agentless_gpt-4o-2024-07-18_50_Instances_1723916965.json
e23eddc
verified

benediktstroebl commited on

Upload swebench_verified_Agentless_gpt-4o-2024-07-18_50_Instances_1723916965.json
a2d5cb2
verified

benediktstroebl commited on

Merge branch 'main' of https://huggingface.co/spaces/agent-evals/leaderboard
3427022

benediktstroebl commited on

added failure report and two new swebench variants
5a7e21a

benediktstroebl commited on

Upload usaco_USACO_Reflexion__Episodic__Semantic_gpt-4o-mini-2024-07-18_1723558382.json
974935f
verified

benediktstroebl commited on

fixed broken fle
9ed6519

benediktstroebl commited on

update
addf4e7

benediktstroebl commited on

Merge branch 'main' of https://huggingface.co/spaces/agent-evals/leaderboard
b585234

benediktstroebl commited on

Upload usaco_USACO_Episodic_gpt-4o-mini-2024-07-18_1723429624.json
19f1cd0
unverified

benediktstroebl commited on

Delete evals_live/usaco_USACO_Episodic_gpt-4o-mini-2024-07-18_1723429624.json
4cf2b30
unverified

benediktstroebl commited on

Upload usaco_USACO_Semantic_gpt-4o-mini-2024-07-18_1723431631.json
7380536
unverified

benediktstroebl commited on

Delete evals_live/usaco_usaco_test_172306727812321123.json
d3e9bdb
unverified

benediktstroebl commited on

Delete evals_live/usaco_usaco_example_agent_1722871527.json
73428db
unverified

benediktstroebl commited on

Delete evals_live/usaco_usaco_example_agent_1722871.json
317b884
unverified

benediktstroebl commited on

Upload usaco_USACO_Episodic_gpt-4o-mini-2024-07-18_1723429624.json
3ee1461
unverified

benediktstroebl commited on

Upload usaco_USACO_Zero-shot_gpt-4o-mini-2024-07-18_1723417375.json
b0b576a
unverified

benediktstroebl commited on

added monitoring results
86a15ac

benediktstroebl commited on

layout update
066588c

benediktstroebl commited on

data reformatting for demo
766750f

benediktstroebl commited on

big update with raw predictions section and dropdowns that dynamically parse agents of current leaderboard
ca89148

benediktstroebl commited on