Ask HN: Scaling a targeted web crawler beyond 500M pages/day
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler"). Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern. The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems. For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance. Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated. 0 comments on Hacker News.
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler"). Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern. The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems. For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance. Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler"). Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern. The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems. For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance. Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated. 0 comments on Hacker News.
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler"). Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern. The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems. For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance. Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.
Hacker News story: Ask HN: Scaling a targeted web crawler beyond 500M pages/day
Reviewed by Tha Kur
on
April 24, 2026
Rating:
No comments: