agent_submission.md · agent-evals/core_leaderboard at 2c91b5e26cbf91c838b17516f0788944f6d44631

To submit a new agent to the leaderboard, follow these steps:

Run your agent on the CORE-Bench Harness. When developing your agent, ensure that it gegnerates a file named agent_trace.log in the base directory it is invoked for each run. The file must be in JSON format and at least include the keys cost and agent_trace:
```
{
    "cost": 0.59,
    "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
}
```
- cost: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
- agent_trace: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by SWE-Bench:
  - Human-readable.
  - Reflects the intermediate steps your system took that led to the final solution.
  - Generated with the inference process, not post-hoc.
If you have any trouble implementing this, feel free to reach out to us for support.
Evaluate your agent using the CORE-Bench Harness on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the --use_azure flag) to avoid long experiment times. Set the --experiment_name flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
Submit the following two directories from the harness:
- benchmark/results/[experiment_name]: Contains the results of your agent on each task.
- benchmark/logs/[experiment_name]: Contains the logs of your agent's execution on each task (which are the agent_trace.log files your agent submits).
Compress these directories into two .tar.gz files and email them to [email protected]. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.
[Optional] We highly encourage you to submit the files of your agent (i.e. benchmark/agents/[agent_name]) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a .tar.gz file and include it in the email.