- I need to snapshot a web page
- I need download a thing from one social network and post it to my blog or whatever
- I want to get data from some plain public HTML site with minimal pain
- But I need to automate some API interaction
- No but it is a complicated one from a hostile walled garden
The attention economy of late capitalism demands I spend time clicking on a browser window to do things, rather than automating the world like we thought we were going to have all worked out by now.
I need to snapshot a web page
wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.
I want to get data from some plain public HTML site with minimal pain
Also there is a custom cloud service (scrapinghub) that will deploy it for you on a massive scale if you want.
But I need to automate some API interaction
No problem, there are too many automation tools here if anything.
- IFTTT is the classic here, although they’ve been a bunch of cocks recently.
noderedthe IoT graphical flow based programming system includes various boilerplate for internet automation. Here, for example, is a chatbot.
- Huginn is a self-hosted open source IFTTT replacement.
- Botize might be lesser cocks than ifttt? Their interface is nicer to my eyes
- zapier might be lesser cocks than ifttt? They cost money, but come more highly recommended, e.g. by NGO coryphée Joe Moran
- trigger-happy is an open source one and it supports pelican.
- conditionalactionprogrammer has hot AI tech to do this, by Microsoft, and an awful name as a reminder of how stupid this whole venture is
No but it is a complicated one from a hostile walled garden
At this point in history, where we are using billions of dollars of technological infrastructure to perform ritual social behaviour, I find I’d prefer just pick lice out of the pelts of my audience the old fashioned way. But maybe this is not an option for you? If so, here is some stuff I read before realising I wasn’t being paid enough.
There are some good tips in karicoss’s post on data liberation.
Turns out you can automate your local Firefox to do this in an easy way, if not a scalable one, thanks Ian Bicking. If you want something more full-featured, read on.
Chromeless, the headless chrome browser, seems to be a hip thing here for certain types of automation. And it has various easy cloud-deployment options.
browserless is a web-service that allows for remote clients to connect, drive, and execute headless work; all inside of docker. It offers first-class integrations for puppeteer, selenium’s webdriver, and a slew of handy REST APIs for doing more common work. On top of all that it takes care of other common issues such as missing system-fonts, missing external libraries, and performance improvements. We even handle edge-cases like downloading files, managing sessions, and have a fully-fledged documentation site.
Selenium is a browser testing and automation tool that seems like it could be made to automate real work on the web. But how can one automate its deployment and a bunch of user credentials with some degree of security and yet the absolute minimum of thought or effort? I do not yet know. To be continued. If absolutely necessary.
- guru99’s tutorials on this
- testing a facebook application using selenium
- Facebook login with selenium
- webdriver docs
SeLite automates browser navigation and testing. It extends Selenium. It
- improves Selenium (API, syntax and visual interface),
- enables reuse,
- supports reporting and interaction, …
SeLite enables DB-driven navigation with SQLite
You might also get some mileage out of the Firefox CLI, mozrepl.