Link rot, mitigating

2020-05-27 — 2020-05-27

Suspiciously similar content

Link rot is the gradual decay of links from your site.

Here are some tricks I use to detect and manage that.

1 Link checking

1.1 hyperlink

hyperlink detects invalid and inefficient links on your site. Works with local files or websites, on the command line and as a node library. Also has a netlify plugin, although a remote server seems to be an inconvenient place to do the checking to me.

npm install -g hyperlink

Running hyperlink path/to/index.html --canonicalroot https://deployed.website.com/ -r --internal path/to/index.html will recursively explore the internal links of your website to ensure internal integrity. It is recommended to make this a part of your build pipeline and block on errors, since any error is very likely to be actually user-facing if our page is deployed.

For this blog, a list of all broken links (excluding my problematical relative-links-in-RSS-feeds) would be

hyperlink public/index.html \
  --canonicalroot https://danmackinlay.name/ -r \
  --internal --skip index.xml public/index.html |\
  grep -v -e "^ok " > linkcheck.tap

Running hyperlink path/to/index.html --canonicalroot https://deployed.website.com/ -r path/to/index.html will recursively explore all links of your website, internal and external, to ensure that you aren’t linking to external resources that have been removed or are otherwise failing. It is not recommended to block your build pipeline on a failure of external links, since they are out of your control. Run in this mode in a non-blocking way and fix the errors in the report at your leisure.

“Leisure”? Haha oh you.

I can check external links with hyperlink too but it is excessively enthusiastic and comes up with lots of errors I do not care about — do I really care about CSS typos in the sites I link to? — plus it triggers DOS throttling for lots of sites, leading to more errors such as HTTP 429 Too Many Requests, ETIMEDOUT and socket hang up. Also, some sites are bad internet citizens and instead of giving you an honest 404 error, they send you to weird spam pages or spurious search pages (I’m looking at you, viagra merchants and Google Tensorflow documentation team.) The chaos can be ameliorated by skipping over certain skittish or messy domains, and reducing concurrency:

hyperlink public/index.html   \
  --canonicalroot https://danmackinlay.name/ -r \
  --skip index.xml \
  --skip https://github.com \
  --skip https://cloud.google.com/ \
  --skip http://madoko.org/ \
  --skip https://www.azimuthproject.org/ \
  --skip https://ubuntuforums.org/ \
  --skip https://www.tensorflow.org/ \
  --skip https://matrix.org/ \
  --skip https://summerofcode.withgoogle.com/ \
  -c 8 \
  public/index.html |  grep -v -e "^ok " | tee linkcheck_external.tap

It is still neurotically detailed, though. For this blog, for example, it claims that it needs to run 28352 tests as of 2020-05-20. Easy there, buddy. I am but one blogger. There are not that many links I feel responsible for.

1.2 linkchecker

A classic (which is to say, creaky and old) alternative link checker is linkchecker/.

1.3 linkcheck

Fashionable newish linkcheck seems good. Needs a localhost dev server rather than operating on files on disk (which is fine, IMO, since it is simpler and I need such a server anyway).

Here is how to invoke linkcheck for a hugo site:

hugo serve
linkcheck :1313 | tee linkcheck_internal.log

Installation in macos is weird; the binary download did not Work For Me™. A source version does.

brew tap dart-lang/dart
brew install dart
dart pub global activate linkcheck
export PATH="$PATH":"$HOME/.pub-cache/bin" #bash
fish_add_path $HOME/.pub-cache/bin #fish

1.4 To audition

broken-link-checker.

2 Link archiving

Archivebox:

ArchiveBox takes a list of website URLs you want to archive, and creates a local, static, browsable HTML clone of the content from those websites (it saves HTML, JS, media files, PDFs, images and more).

You can use it to preserve access to websites you care about by storing them locally offline. ArchiveBox imports lists of URLs, renders the pages in a headless, authenticated, user-scriptable browser, and then archives the content in multiple redundant common formats (HTML, PDF, PNG, WARC) that will last long after the originals disappear off the internet. It automatically extracts assets and media from pages and saves them in easily-accessible folders, with out-of-the-box support for extracting git repositories, audio, video, subtitles, images, PDFs, and more.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

perma.cc is an archiving service focussing on academic citations? 🤷‍♂

Academic institutions and courts can become registrars of Perma.cc for free, and can provide accounts to their users for free as well.

Organizations (such as law firms, publishers, non-profits and others) or individuals not associated with an academic institution or court are both able to use Perma via paid subscription:

Organizations can administer unlimited Perma accounts for their users for a monthly flat group rate. These registrar accounts also include collaboration tools and administrative controls.

Individuals can access Perma via tiered subscriptions that fit their particular needs.

Subscription status does not affect the preservation, access, or visibility of already-made links, just the amount of Perma Links a user can create in a given month.