
Branches · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - Branches · McGill-NLP/safearena
safearena/README.md at main · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - safearena/README.md at main · McGill-NLP/safearena
Releases · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - Releases · McGill-NLP/safearena
Could the author share the AMI of the container used in SafeArena ...
Jun 15, 2025 · Hi, I noticed that SafeArena uses Docker containers that are different from those used in WebArena. Would it be possible to share the specific AMI or other setup details used for the …
safearena/utils at main · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - safearena/utils at main · McGill-NLP/safearena
Network Graph · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - Network Graph · McGill-NLP/safearena
Code frequency · McGill-NLP/safearena · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - Code frequency · McGill-NLP/safearena
Executing tasks in different order or resetting the dockers after each ...
Executing tasks in different order or resetting the dockers after each task will give different evaluation scores. This is an expected behavior from webarena and visualwebarena. Note that resetting docker …
Community Standards · GitHub
SafeArena is a benchmark for assessing the harmful capabilities of web agents - Community Standards · McGill-NLP/safearena
Timeout error · Issue #11 · McGill-NLP/safearena · GitHub
Jun 27, 2025 · Hi there, thanks for the great work! When I was running evaluation I found a lot of TimeoutErrors like the following one. Do you have some intuition about why this is happening? If this …