Web scraping

Turns out we can get information off the internet

2016-06-16 — 2023-01-24

Wherein web pages are parsed for structured data, parsed outputs are converted into RSS feeds by configured parsers, and deployments are orchestrated across cloud services to run the extraction at scale.

browser
computers are awful together
confidentiality
diy
doing internet
faster pussycat
Figure 1

Services to extract information from web pages.

Some of these use browser automation although that is kind of its own thing.

1 Scrapy

Scrapy is a Python library to do that. Companion project scrapy-rss converts my parsings into RSS feeds.

Also, there is a custom cloud service (scrapinghub) that will deploy it for you on a massive scale if you want.

Scrapoxy automates deployment of distributed cloud for this purpose.

2 Incoming