Crawling a billion web pages in just over 24 hours
Contents
Discussion on r/programming.
tl;dr:
1.005 billion web pages
25.5 hours
$462
For some reason, nobody's written about what it takes to crawl a big chunk of the web in a while: the last point of reference I saw was Michael Nielsen's post from 2012.
Obviously lots of things have changed since then. Most bigger, better, faster: CPUs have gotten a lot more cores, spinning disks have been replaced by NVMe solid state drives with near-RAM
I/O bandwidth, network pipe widths have exploded, EC2 ha...
Read more at andrewkchan.dev