Sonata is a tool I've been working on for making web scraping easier. The idea is to make web scraping declarative i.e. to provide a service that lets you say "get me this data from these URLs" rather than imperative i.e. writing a puppeteer script that describes the steps to do it.
What this means in practice is you give Sonata a few URLs and a JSON schema that describes the data you want, and then under the hood we use LLMs to take your input and spit out a compiled scraper (basically a python script) that captures your data from the URLs. You can then use this on other similar URLs. So for example, if you wanted to scrape product information from an ecommerce site you could give it three URLs from that site, a JSON schema, and then use that scraper on other URLs from the same site.
The advantages of this approach are that:
- You don't have to faff about with puppeteer, writing scraping scripts, etc.
- This scrapers we make are self-healing i.e. if the website changes we can recompile the scraper for you without you having to worry about it.
- Compiling the scrapers takes a few minutes, but once it's compiled there's no waiting for LLMs, its as fast as regular python + HTTP calls.
- We also handle proxies, schedules, all the normal scraping stuff.
We have a few users at the moment but would love to get some feedback on the value prop and whether this sounds useful!
Thanks,
Cameron