|
Judge,Overall,Recall,F1,AB,VWA,WA,Work,Work++,URL,Author |
|
Rule-based,83.8,55.9,67.1,25.0,85.2,79.0,100.0,83.3,https://arxiv.org/abs/2504.08942,Lù et al. |
|
WebJudge,73.7,N/A,66.7,69.8,72.6,92.3,75.0,https://arxiv.org/pdf/2504.01382,Xue et al. |
|
AER-C (GPT-4o),67.7,71.9,69.7,83.3,56.0,68.8,100.0,66.7,https://arxiv.org/abs/2504.08942,Lù et al. |
|
AER-V (GPT-4o),67.6,71.5,69.5,83.3,61.2,67.6,96.4,59.3,https://arxiv.org/abs/2504.08942,Lù et al. |
|
NNetNav (Llama-3.3 70B),52.5,82.4,64.1,20.8,54.5,54.3,77.3,43.2,https://arxiv.org/abs/2504.08942,Lù et al. |
|
Claude 3.7 S. (A),68.8,81.6,74.7,87.5,61.0,69.3,85.0,66.7,https://arxiv.org/abs/2504.08942,Lù et al. |
|
GPT-4o (A),69.8,83.1,75.9,77.8,63.0,70.2,94.6,63.0,https://arxiv.org/abs/2504.08942,Lù et al. |
|
GPT-4o Mini (A),61.5,86.1,71.7,80.0,57.9,63.5,84.2,49.4,https://arxiv.org/abs/2504.08942,Lù et al. |
|
Llama 3.3 (A),67.7,79.0,72.9,75.0,59.6,68.2,94.3,62.7,https://arxiv.org/abs/2504.08942,Lù et al. |
|
Qwen2.5-VL (A),64.3,89.8,75.0,72.7,59.3,63.6,87.2,60.3,https://arxiv.org/abs/2504.08942,Lù et al. |
|
Claude 3.7 S. (S),69.4,76.3,72.7,71.4,64.8,69.3,85.3,66.7,https://arxiv.org/abs/2504.08942,Lù et al. |
|
GPT-4o (S),68.1,80.3,73.7,77.8,60.7,69.9,93.8,59.6,https://arxiv.org/abs/2504.08942,Lù et al. |
|
GPT-4o Mini (S),64.5,78.3,70.8,80.0,57.4,66.9,90.3,54.8,https://arxiv.org/abs/2504.08942,Lù et al. |
|
Qwen2.5-VL (S),64.5,86.1,73.7,70.0,58.5,62.9,93.8,64.4,https://arxiv.org/abs/2504.08942,Lù et al. |