2 days ago
Thurs Jan 16, 2025 1:07pm PST
Show HN: Skyvern 2.0 – open-source AI Browser Agent scoring 85.8% on WebVoyager
Hey HN,

We’re Suchintan and Shu from Skyvern (https://www.skyvern.com). We’re building an open source AI Agent that can browse the web and take actions. Our open source repo can be found at https://github.com/Skyvern-AI/Skyvern.

We’ve re-built Skyvern with a Planner-Actor-Validator agent architecture and achieved 85.8% state of the art (SOTA) on the WebVoyager Benchmark. You can see the results for yourself here: https://eval.skyvern.com/

For reference, here were the previous SOTA results: 83.5% - Google Mariner (https://deepmind.google/technologies/project-mariner/) 73.1% - AgentE (https://arxiv.org/html/2407.13032v1) 67.0% - HCompany (https://www.hcompany.ai/blog/a-research-update) 59.1% - WebVoyager (https://arxiv.org/html/2401.13919v4) 52.6% - WILBUR (https://arxiv.org/html/2404.05902v1) 52.0% - Claude Computer Use (https://docs.anthropic.com/en/docs/build-with-claude/compute...)

Achieving this SOTA result required expanding Skyvern’s original architecture. Skyvern 1.0 involved a single prompt operating in a loop both making decisions and taking actions on a website. This approach was a good starting point, but scored ~45% on the WebVoyager benchmark because it had insufficient memory of previous actions and could not do complex reasoning.

We re-built this all using a Planner-Actor-Validator agent architecture: 1. Planner - Decides that goals to accomplish on a website, and maintains a working memory of the overall goal and progress towards it 2. Actor - Given a narrowly scoped goal, executes the goal on the website, reporting back 3. Validator - Asserts whether the goal was successfully achieved and passes feedback back to the Actor + Planner

We ran the benchmark on Skyvern cloud to test Skyvern 2.0 in a real-world environment – autonomously navigating the web in a remotely hosted browser without any human involvement.

To keep with our open source mission, we decided to publish benchmark, modifications, and final results for anyone to review. This is important because we’re seeing an increasing trend of companies publishing their benchmarks with no way to access the results, so we’ve decided to make everything public.

[1] Eval Dataset: https://github.com/Skyvern-AI/skyvern/tree/main/evaluation/d... [2] Modifications: https://github.com/Skyvern-AI/skyvern/pull/1576/commits/60dc... [3] Each run (incl prompts + responses) can be inspected here: https://eval.skyvern.com/

The full report (incl an architecture diagram) can be found here: https://blog.skyvern.com/skyvern-2-0-state-of-the-art-web-na...

If you’d like to give Skyvern a try, you can grab the open source version (https://github.com/Skyvern-AI/Skyvern) or the cloud version (https://app.skyvern.com/) and give it a go and share any feedback with us. We look forward to any and all of your comments!

read article
comments:
add comment
loading comments...