1 year ago

Thurs Mar 14, 2024 4:31pm PST

Show HN: Skyvern – Browser automation using LLMs and computer vision

Hey HN, we're building Skyvern (https://www.skyvern.com), an open-source tool that uses LLMs and computer vision to help companies automate browser-based workflows. You can see some examples here: https://github.com/Skyvern-AI/skyvern#real-world-examples-of... and there's a demo video at https://github.com/Skyvern-AI/skyvern#demo, along with some instructions on running it locally.

We provide a natural-language API to automate repetitive manual workflows that happen within the companies' backoffices. You can check out our code and play with Skyvern here: https://github.com/Skyvern-AI/Skyvern

We talked to hundreds of companies about things they do in the background and found that most of them depend on repetitive manual workflows. The breadth of these workflows surprised us – most companies started off doing things manually, and eventually either hired people to scale the manual work, or wrote scripts using Selenium-like browser automation libraries.

In these conversations, one common point stood out: scaling is a pain either way. Companies relying on hiring struggled to adjust team sizes with fluctuating demand. Companies using Selenium and similar tools had a different problem: it can take days or even weeks to get a new workflow automated, and then would require ongoing maintenance any time the underlying websites changed because their XPath based interaction logic suddenly became invalid.

We felt like there was a way to get the best of both worlds with LLMs. We could use LLMs to reason through a website’s layout, while preserving the advantage of traditional browser automations allowing it to scale alongside demand. This led us to build Skyvern with a few core functionalities:

1. Skyvern can operate on websites it’s never seen before by connecting visible elements with the natural language instructions provided to us. We use a blend of computer vision and DOM parsing to identify a set of possible actions on a website, and multi-modal LLMs to map the natural language instructions to the available actions on the page.

2. Skyvern is resistant to website layout changes, as it doesn’t depend on any predetermined XPaths or other selectors. If a layout ever changes, we can leverage the methodology in #1 to complete the user-specified goal.

3. Skyvern accepts a blob of information when navigating workflows—basically just a json blob of whatever information you want to put, and then we use LLMs to map that to information on the screen. For example: if you're generating a quote from Geico, they commonly ask “Were you eligible to drive at 21?”. The answer could be inferred from the driver receiving their license in 2012, and having a birth date of 1996.

The above strategy adapts well to a number of use cases that Skyvern is helping companies with today: (1) Automating materials procurement by searching for, adding to cart, and transacting products through vendor websites that don’t have APIs; (2) Registering accounts, filing forms, and searching for information on government websites (ex: registering franchise tax information for Delaware C-corps); (3) Generating insurance quotes by completing multi-step dynamic forms on insurance websites; (4) Automating the job application process by mapping user-specified information (such as a Resume) to a job posting.

And here are some use-cases we’re actively looking to expand into: (1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc), an (2) Doing customer research ahead of discovery calls by analyzing landing pages and other metadata about a specific business.