Spaces:
Running
Running
Zachary Siegel
commited on
Commit
·
2c91b5e
1
Parent(s):
abf78cc
agent submission instructions
Browse files- agent_submission.md +25 -9
- app.py +1 -3
agent_submission.md
CHANGED
@@ -1,12 +1,28 @@
|
|
1 |
-
To submit **a new agent**
|
2 |
|
3 |
-
1.
|
4 |
-
* Providing a specific entry point to the agent (e.g., a Python script or function)
|
5 |
-
* Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a *main.py* file as the entry point and managing *instructions.txt *and *submission.txt *files.
|
6 |
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## To submit **a new agent** to the leaderboard, follow these steps:
|
2 |
|
3 |
+
1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it gegnerates a file named `agent_trace.log` in the base directory it is invoked for each run. The file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
|
|
|
|
|
4 |
|
5 |
+
```json
|
6 |
+
{
|
7 |
+
"cost": 0.59,
|
8 |
+
"agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution."
|
9 |
+
}
|
10 |
+
```
|
11 |
|
12 |
+
- **`cost`**: A float representing the total cost (USD) of API calls made by the agent (the agent will need to log the cost of each request and sum them up).
|
13 |
+
- **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
|
14 |
+
- Human-readable.
|
15 |
+
- Reflects the intermediate steps your system took that led to the final solution.
|
16 |
+
- Generated with the inference process, not post-hoc.
|
17 |
+
|
18 |
+
If you have any trouble implementing this, feel free to reach out to us for support.
|
19 |
+
|
20 |
+
2. **Evaluate your agent using the [CORE-Bench Harness](https://github.com/siegelz/core-bench)** on all tasks of the test set. You will almost certainly need to run your agent on Azure (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. We highly encourage you to run your agent on all three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, and CORE-Bench-Hard, but you can choose to run on any subset of these levels.
|
21 |
+
|
22 |
+
3. **Submit the following two directories from the harness**:
|
23 |
+
- `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
|
24 |
+
- `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
|
25 |
+
|
26 |
+
Compress these directories into two `.tar.gz` files and email them to [[email protected]](mailto:[email protected]). **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
|
27 |
+
|
28 |
+
4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.
|
app.py
CHANGED
@@ -442,10 +442,8 @@ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderb
|
|
442 |
}, 100);
|
443 |
}
|
444 |
""")
|
445 |
-
gr.HTML("""<h2 class="section-heading" id="agent-submission">How to
|
446 |
gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
|
447 |
-
gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
|
448 |
-
gr.Markdown("""Coming soon...""")
|
449 |
|
450 |
async def main():
|
451 |
# Preprocess traces
|
|
|
442 |
}, 100);
|
443 |
}
|
444 |
""")
|
445 |
+
gr.HTML("""<h2 class="section-heading" id="agent-submission">How can I submit agents to the leaderboard?</h2>""")
|
446 |
gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
|
|
|
|
|
447 |
|
448 |
async def main():
|
449 |
# Preprocess traces
|