7 months ago

Mon Dec 23, 2024 6:41am PST

Show HN: A faster and compliant way to scrape the SEC

The SEC rate limits you to 5 requests per second. The SEC has more than 16 million submissions, with each submission having multiple documents. For example, Tesla's 2021 10-K SEC page has 14 documents.

Scraping every SEC submission document by document would take about 200 days.

There is a faster approach. Submissions to the SEC are stored in Standard Generalized Markup Language in one file. That means that if you have a SGML parser, you only have to download the submission .txt file. Scraping every SEC SGML submission takes about 40 days.

There is a much faster approach. SEC data is stored in SGML in daily archives accessible here going back to 1994: https://www.sec.gov/Archives/edgar/Feed/1995/. Since the SEC rate limits you to 30mb/s of egress and there is 3tb of data, downloading every archive takes slightly more than a day.

I've written a fast SGML parser here using Cython: https://github.com/john-friedman/datamule-python

read article

comments:

add comment

loading comments...