Zachary Siegel commited on
Commit
2c91b5e
·
1 Parent(s): abf78cc

agent submission instructions

Browse files
Files changed (2) hide show
  1. agent_submission.md +25 -9
  2. app.py +1 -3
agent_submission.md CHANGED
@@ -1,12 +1,28 @@
1
- To submit **a new agent** for evaluation, developers should only need to:
2
 
3
- 1. Adhere to Standardized I/O Format: Ensure the agent run file complies with the benchmark-specific I/O format. Depending on HAL's implementation, this could involve:
4
- * Providing a specific entry point to the agent (e.g., a Python script or function)
5
- * Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a *main.py* file as the entry point and managing *instructions.txt *and *submission.txt *files.
6
 
7
- 2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
8
- * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
9
- * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
10
- * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.
 
 
11
 
12
- 3. Use our CLI to run evaluations and upload the results. The same CLI can also be used to rerun existing agent-benchmark pairs from the leaderboard.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## To submit **a new agent** to the leaderboard, follow these steps:
2
 
3
+ 1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it gegnerates a file named `agent_trace.log` in the base directory it is invoked for each run. The file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
 
 
4
 
5
+ ```json
6
+ {
7
+ "cost": 0.59,
8
+ "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
9
+ }
10
+ ```
11
 
12
+ - **`cost`**: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
13
+ - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
14
+ - Human-readable.
15
+ - Reflects the intermediate steps your system took that led to the final solution.
16
+ - Generated with the inference process, not post-hoc.
17
+
18
+ If you have any trouble implementing this, feel free to reach out to us for support.
19
+
20
+ 2. **Evaluate your agent using the [CORE-Bench Harness](https://github.com/siegelz/core-bench)** on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
21
+
22
+ 3. **Submit the following two directories from the harness**:
23
+ - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
24
+ - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
25
+
26
+ Compress these directories into two `.tar.gz` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
27
+
28
+ 4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.
app.py CHANGED
@@ -442,10 +442,8 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
442
  }, 100);
443
  }
444
  """)
445
- gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>""")
446
  gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
447
- gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
448
- gr.Markdown("""Coming soon...""")
449
 
450
  async def main():
451
  # Preprocess traces
 
442
  }, 100);
443
  }
444
  """)
445
+ gr.HTML("""<h2 class="section-heading" id="agent-submission">How can I submit agents to the leaderboard?</h2>""")
446
  gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
 
 
447
 
448
  async def main():
449
  # Preprocess traces