how I set up automated archiving of URLs using the Internet Archive's Wayback Machine and other tools
status: Finished
Status Indicator
The status indicator reflects the current state of the work:
- Abandoned: Work that has been discontinued
- Notes: Initial collections of thoughts and references
- Draft: Early structured version with a central thesis
- In Progress: Well-developed work actively being refined
- Finished: Completed work with no planned major changes
This helps readers understand the maturity and completeness of the content.
·
certainty: certain
Confidence Rating
The confidence tag expresses how well-supported the content is, or how likely its overall ideas are right. This uses a scale from "impossible" to "certain", based on the Kesselman List of Estimative Words:
1. "certain"
2. "highly likely"
3. "likely"
4. "possible"
5. "unlikely"
6. "highly unlikely"
7. "remote"
8. "impossible"
Even ideas that seem unlikely may be worth exploring if their potential impact is significant enough.
·
importance: 5/10
Importance Rating
The importance rating distinguishes between trivial topics and those which might change your life. Using a scale from 0-10, content is ranked based on its potential impact on:
- the reader
- the intended audience
- the world at large
For example, topics about fundamental research or transformative technologies would rank 9-10, while personal reflections or minor experiments might rank 0-1.
Without libraries, our history would be like a house without windows or doors.
— Norman Cousins
Toward a Mindset of Profusion
ince circa 1945, we have been in the "Age of Information", and as with anything in abundance, its value dwindled. People take for granted the information at their fingertips.
They think it will always be there when they need it. I, too, once held such a mindset of scarcity. It is now moving into the recently coined "Age of Intelligence", referencing the new AI wave, which has
massive potential to change the way society at large thinks. People will no longer use search engines as using a chatbot is more convenient, faster, and an experience with less friction.
This is innately dangerous. These megacorps lure people in under a false sense of comfort, and then tear the rug from under you. At that point, however, you are far too lazy and complacent to do anything about it.
The information bans, modification, and censoring—it's all "ok," right? No. This mindset is both ignorant and one of scarcity. The average middle-wage American makes more than enough to afford a couple of 8TB HDDs. This would
naturally be far more information than most would need—enough to give you the power back, to be proactive in the active preservation of valuable information that is being taken down, censored, and hidden from the public.
The price of apathy towards public affairs is to be ruled by evil men.
— Plato
The above sentiment echoes Proverbs 21:25-26 perfectly. Those who refuse to take an active approach in helping preserve valuable information will soon come to find a lack of it. A recent controversy, for example, is
the one with "Kindle". Amazon has been banning users without explanations, silently tampering with books that were already purchased, and committing many other atrocities against digital ownership.
Automated URL Archiving
The way I have settled on to accomplish this is a script that runs on my server once a day. The goal of it is to sift through my .mdx files and collect all the links, then filter out any that belong to this site in order to procure a list of urls from external sources needed to preserve context of an essay. In any decently sized essay it becomes fairly difficult for it stay self contained however with this system and [doc][/doc] route it allows me to add principles of progressive enhancement to the essays. for ex. given a essay titled "Spatial Reasoning and Problem Solving in the Daubentonia madagasceriensis" one might be wondering several things. For one, what is the definition of Spatial Reasoning? What is daubentonia madagasceriensis (aye aye)? More importantly who cares about spatial reasoning in the aye aye? Rather than expansive sections explaining the history of topics, major advancements, or the biological facts of a given species and why they are useful for certain cognitive studies. It is much easier to allow my popups to handle this via external links to well written wiki articles, blog posts, or papers that explain the topic in depth enough for the reader to continue the essay. As dense as essays can get however with field specific terminology, past research milestones, key literature, ect. these urls lead to a significant pile up and in time become at great risk of link rot.
Loading file content...
On the Automated Archiving of URLs
As mentioned in the previous section url pile up is a serious issue and not one to be fixed by hand. Of course I could allocate time to manually sifting through my markdown documents to find stray links then determining key information as to the author, date, context of such an article and the method I want save it in. Or I could automate the boring stuff. The inspiration for the script was Gwern'slink-extractor.hs. A script specifically made for the parsing of markdown files to identify links to archive.
#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
-- dependencies: libghc-pandoc-dev
-- usage: 'link-extractor.hs [--print-filenames] [file]'; prints out a newline-delimited list of hyperlinks found in
-- targeted Pandoc Markdown .md files (or simple Pandoc-readable HTML .html files) when parsed.
-- Local anchor links are rewritten assuming Gwern.net-style paths of Markdown .md files (ie. a link like `[discriminator ranking](#discriminator-ranking)` in ~/wiki/face.md will be parsed to `/face#discriminator-ranking`). Interwiki links are rewritten to their full URLs.
--
-- If no filename arguments, link-extractor will instead read stdin as Markdown and attempt to parse that instead (falling back to HTML if no URLs are parsed).
-- This makes it easy to pipe in arbitrary sections of pages or annotations, such as `$ xclip -o | runghc -i/home/gwern/wiki/static/build/ /home/gwern/wiki/static/build/link-extractor.hs`.
--
-- Hyperlinks are not necessarily to the WWW but can be internal or interwiki hyperlinks (eg.
-- '/local/file.pdf' or '!W').
-- This reads multiple files and processes them one by one, so is not parallelized, but you can parallelize it at the process level with eg. `parallel --max-args=500 --jobs 30` to use 30 cores, more or less.
module Main where
import Control.Monad (unless)
import Data.List (isSuffixOf)
import qualified Data.Text as T (append, head, pack, unlines)
import qualified Data.Text.IO as TIO (getContents, readFile, putStr, putStrLn)
import System.Directory (doesFileExist)
import System.Environment (getArgs)
import System.FilePath (takeBaseName)
import Query (extractLinks)
-- | Map over the filenames
main :: IO ()
main = do
fs <- getArgs
let printfilename = take 1 fs == ["--print-filenames"]
let fs' = if printfilename then Prelude.drop 1 fs else fs
if null fs then do stdin <- TIO.getContents
let links = extractLinks True stdin
let links' = if links /= [] then links else extractLinks False stdin
mapM_ TIO.putStrLn links'
else mapM_ (printURLs printfilename) fs'
-- | Read 1 file and print out its URLs
printURLs :: Bool -> FilePath -> IO ()
printURLs printfilename file = do
exists <- doesFileExist file
unless exists $ error ("A specified file argument is invalid/does not exist? Arguments: " ++ show printfilename ++ " : " ++ file)
input <- TIO.readFile file
let converted = extractLinks (".md"`isSuffixOf`file) input
-- rewrite self-links like "#discriminator-ranking" → "/face#discriminator-ranking" by prefixing the original Markdown filename's absolute-ized base-name;
-- this makes frequency counts more informative, eg. for deciding what sections to refactor out into standalone pages (because heavy cross-referencing
-- *inside* a page is an important indicator of a section being 'too big', just like cross-page references are).
let converted' = map (\u -> if T.head u /= '#' then u else "/" `T.append` T.pack (takeBaseName file) `T.append` u) converted
if printfilename then TIO.putStr $ T.unlines $ Prelude.map (\url -> T.pack file `T.append` ":" `T.append` url) converted' else
TIO.putStr $ T.unlines converted'
------------------