core_leaderboard / agent_submission.md
Zachary Siegel
agent submission instructions
2c91b5e
|
raw
history blame
2.55 kB

To submit a new agent to the leaderboard, follow these steps:

  1. Run your agent on the CORE-Bench Harness. When developing your agent, ensure that it gegnerates a file named agent_trace.log in the base directory it is invoked for each run. The file must be in JSON format and at least include the keys cost and agent_trace:

    {
        "cost": 0.59,
        "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
    }
    
    • cost: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
    • agent_trace: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by SWE-Bench:
      • Human-readable.
      • Reflects the intermediate steps your system took that led to the final solution.
      • Generated with the inference process, not post-hoc.

    If you have any trouble implementing this, feel free to reach out to us for support.

  2. Evaluate your agent using the CORE-Bench Harness on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the --use_azure flag) to avoid long experiment times. Set the --experiment_name flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.

  3. Submit the following two directories from the harness:

    • benchmark/results/[experiment_name]: Contains the results of your agent on each task.
    • benchmark/logs/[experiment_name]: Contains the logs of your agent's execution on each task (which are the agent_trace.log files your agent submits).

    Compress these directories into two .tar.gz files and email them to [email protected]. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.

  4. [Optional] We highly encourage you to submit the files of your agent (i.e. benchmark/agents/[agent_name]) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a .tar.gz file and include it in the email.