Browse the internet for me

Introducing modern automation to the ancient craft of serfing the web and tilling the clickfarm

The attention economy of late capitalism demands I spend time clicking on a browser window to do things, rather than automating the world like we thought we were going to have all worked out by now.

I need to snapshot a web page

wkhtmltopdf

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.

I need download a thing from one social network and post it to my blog or whatever

You don’t need a web browser for that; just use their API or an automation service.

Of course, facebook probably doesn’t rank this as highly and something manually uploaded, that you obediently stared at advertisements while writing, so it’s up to you whether it’s worth your time being a clickmonkey for them.

I want to get data from some plain public HTML site with minimal pain

Scrapy is a python library to do that. Companion project scrapy-rss converts your parsings into RSS feeds.

Also there is a custom cloud service (scrapinghub) that will deploy it for you on a massive scale if you want.

Wait but I have to log in to get my data and I’m too lazy to configure that

Turns out you can automate your local Firefox to do this in an easy way, although not a scalable one, thanks Ian Bicking.

Nah, but I actually need to automate some arbitrary interaction with a javascript-heavy site

Oh dear you aren’t trying to fake being on social media for weaponised mass opinion inception are you? Well, at least that pays, I hope.

At this point in history, where we are using billions of dollars of technological infrastructure to perform ritual social behavior, I find I’d prefer just pick lice out of the pelts of my audience the old fashioned way. But maybe this is not an option for you? If so, here is some stuff I read before realising I wasn’t being paid enough.

Nickjs

nickjs is a javascript library to do browsing automation. If this is something you are doing for money, it might be worth your while paying phantombuster to automate hosting of Nickjs. See their explanatory blog post. (TODO check security guarantees)

Chromeless

Chromeless, the headless chrome browser, seems to be a hip thing here for certain types of automation. And it has various easy cloud-deployment options.

Browserless

Browserless is containerized browsers with an API, I think.

browserless is a web-service that allows for remote clients to connect, drive, and execute headless work; all inside of docker. It offers first-class integrations for puppeteer, selenium’s webdriver, and a slew of handy REST APIs for doing more common work. On top of all that it takes care of other common issues such as missing system-fonts, missing external libraries, and performance improvements. We even handle edge-cases like downloading files, managing sessions, and have a fully-fledged documentation site.

Selenium

Selenium is a browser testing and automation tool that seems like it could be made to automate real work on the web. But how can one automate its deployment and a bunch of user credentials with some degree of security and yet the absolute minimum of thought or effort? I do not yet know. To be continued. If absolutely necessary.

You might also get some mileage out of the Firefox CLI, mozrepl.

iMacros

A commercial offering for Windows, scripting your browser for e.g. data extraction. USD99-USD995 depending on features desired.