Spaces:
Running
To submit a new agent to the leaderboard, follow these steps:
Run your agent on the CORE-Bench Harness. When developing your agent, ensure that it gegnerates a file named
agent_trace.log
in the base directory it is invoked for each run. The file must be in JSON format and at least include the keyscost
andagent_trace
:{ "cost": 0.59, "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution." }
cost
: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).agent_trace
: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by SWE-Bench:- Human-readable.
- Reflects the intermediate steps your system took that led to the final solution.
- Generated with the inference process, not post-hoc.
If you have any trouble implementing this, feel free to reach out to us for support.
Evaluate your agent using the CORE-Bench Harness on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the
--use_azure
flag) to avoid long experiment times. Set the--experiment_name
flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.Submit the following two directories from the harness:
benchmark/results/[experiment_name]
: Contains the results of your agent on each task.benchmark/logs/[experiment_name]
: Contains the logs of your agent's execution on each task (which are theagent_trace.log
files your agent submits).
Compress these directories into two
.tar.gz
files and email them to [email protected]. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.[Optional] We highly encourage you to submit the files of your agent (i.e.
benchmark/agents/[agent_name]
) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a.tar.gz
file and include it in the email.