About 35 results
Open links in new tab
  1. Branches · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - Branches · McGill-NLP/safearena

  2. safearena/README.md at main · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - safearena/README.md at main · McGill-NLP/safearena

  3. Releases · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - Releases · McGill-NLP/safearena

  4. Could the author share the AMI of the container used in SafeArena ...

    Jun 15, 2025 · Hi, I noticed that SafeArena uses Docker containers that are different from those used in WebArena. Would it be possible to share the specific AMI or other setup details used for the …

  5. safearena/utils at main · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - safearena/utils at main · McGill-NLP/safearena

  6. Network Graph · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - Network Graph · McGill-NLP/safearena

  7. Code frequency · McGill-NLP/safearena · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - Code frequency · McGill-NLP/safearena

  8. Executing tasks in different order or resetting the dockers after each ...

    Executing tasks in different order or resetting the dockers after each task will give different evaluation scores. This is an expected behavior from webarena and visualwebarena. Note that resetting docker …

  9. Community Standards · GitHub

    SafeArena is a benchmark for assessing the harmful capabilities of web agents - Community Standards · McGill-NLP/safearena

  10. Timeout error · Issue #11 · McGill-NLP/safearena · GitHub

    Jun 27, 2025 · Hi there, thanks for the great work! When I was running evaluation I found a lot of TimeoutErrors like the following one. Do you have some intuition about why this is happening? If this …