I am happy to open-source AI Empoye: GPT-4 Vision Powered First-ever reliable browser automation which outperforms Adept.ai.
Demo1: Automate logging your budget from email to your expense tracker
https://www.loom.com/share/f8dbe36b7e824e8c9b5e96772826de03
Demo2: Automate log details from the PDF receipt into your expense tracker
https://www.loom.com/share/2caf488bbb76411993f9a7cdfeb80cd7
Comparison with Adept.ai
https://www.loom.com/share/27d1f8983572429a8a08efdb2c336fe8
Our stack
Next.js, Rust, Postgres, MeiliSearch, and Firebase auth for authentication.
How it Works
There are several problems with current browser agents. Here, we explain the problems and how we have solved them.
Problem 1: Finding the Right Element
There are several techniques for this, ranging from sending a shortened form of HTML to GPT-3, creating a bounding box with IDs and sending it to GPT-4-vision to take actions, or directly asking GPT-4-vision to obtain the X and Y coordinates of the element. However, none of these methods were reliable; they all led to hallucinations.
To address this, we developed a new technique where we index the entire DOM in MeiliSearch, allowing GPT-4-vision to generate commands for which element's inner text to click, copy, or perform other actions. We then search the index with the generated text and retrieve the element ID to send back to the browser to take action. There are a few limitations here, but we have implemented some techniques to overcome them, such as dealing with the same text in multiple elements or clicking on an icon (we are still working on this).
Intuitively, imagine you are guiding your grandmother to use a website; you might say something like, "Click on the button that says 'sign in'." We are doing something similar here.
Problem 2: GPT Derailing from Workflow
To prevent GPT from derailing from tasks, we use a technique that is akin to retrieval-augmented generation, but we kind of call it Actions Augmented Generation. Essentially, when a user creates a workflow, we don't record the screen, microphone, or camera, but we do record the DOM element changes for every action (clicking, typing, etc.) the user takes. We then use the workflow title, objective, and recorded actions to generate a set of tasks. Whenever we execute a task, we embed all the actions the user took on that particular domain with the prompt. This way, GPT stays on track with the task, even if the user has not provided a very brief title and objective; their actions will guide GPT to complete the task.
Love to know your feedback.