Back to Blog Posts

Automated URL Archiving

how I set up automated archiving of URLs using the Internet Archive's Wayback Machine and other tools

status: Finished

Status Indicator

The status indicator reflects the current state of the work: - Abandoned: Work that has been discontinued - Notes: Initial collections of thoughts and references - Draft: Early structured version with a central thesis - In Progress: Well-developed work actively being refined - Finished: Completed work with no planned major changes This helps readers understand the maturity and completeness of the content.

·
certainty: certain

Confidence Rating

The confidence tag expresses how well-supported the content is, or how likely its overall ideas are right. This uses a scale from "impossible" to "certain", based on the Kesselman List of Estimative Words: 1. "certain" 2. "highly likely" 3. "likely" 4. "possible" 5. "unlikely" 6. "highly unlikely" 7. "remote" 8. "impossible" Even ideas that seem unlikely may be worth exploring if their potential impact is significant enough.

·
importance: 5/10

Importance Rating

The importance rating distinguishes between trivial topics and those which might change your life. Using a scale from 0-10, content is ranked based on its potential impact on: - the reader - the intended audience - the world at large For example, topics about fundamental research or transformative technologies would rank 9-10, while personal reflections or minor experiments might rank 0-1.

Without libraries, our history would be like a house without windows or doors.

-- Norman Cousins

Toward a Mindset of Profusion

Dropcap since circa 1945, we have been in the "Age of Information", and as with anything in abundance, its value dwindled. People take for granted the information at their fingertips. They think it will always be there when they need it. I, too, once held such a mindset of scarcity. It is now moving into the recently coined "Age of Intelligence", referencing the new AI wave, which has massive potential to change the way society at large thinks. People will no longer use search engines as using a chatbot is more convenient, faster, and an experience with less friction. This is innately dangerous. These megacorps lure people in under a false sense of comfort, and then tear the rug from under you. At that point, however, you are far too lazy and complacent to do anything about it. The information bans, modification, and censoring—it's all "ok," right? No. This mindset is both ignorant and one of scarcity. The average middle-wage American makes more than enough to afford a couple of 8TB HDDs. This would naturally be far more information than most would need—enough to give you the power back, to be proactive in the active preservation of valuable information that is being taken down, censored, and hidden from the public.

The price of apathy towards public affairs is to be ruled by evil men.

-- Plato

The above sentiment echoes Proverbs 21:25-26 perfectly. Those who refuse to take an active approach in helping preserve valuable information will soon come to find a lack of it. A recent controversy, for example, is the one with "Kindle". Amazon has been banning users without explanations, silently tampering with books that were already purchased, and committing many other atrocities against digital ownership.

Automated URL Archiving

The way I have settled on to accomplish this is a script that runs on my server once a day. The goal of it is to sift through my .mdx files and collect all the links, then filter out any that belong to this site in order to procure a list of urls from external sources needed to preserve context of an essay. In any decently sized essay it becomes fairly difficult for it stay self contained however with this system and [doc][/doc] route it allows me to add principles of progressive enhancement to the essays. for ex. given a essay titled "Spatial Reasoning and Problem Solving in the Daubentonia madagasceriensis" one might be wondering several things. For one, what is the definition of Spatial Reasoning? What is daubentonia madagasceriensis (aye aye)? More importantly who cares about spatial reasoning in the aye aye? Rather than expansive sections explaining the history of topics, major advancements, or the biological facts of a given species and why they are useful for certain cognitive studies. It is much easier to allow my popups to handle this via external links to well written wiki articles, blog posts, or papers that explain the topic in depth enough for the reader to continue the essay. As dense as essays can get however with field specific terminology, past research milestones, key literature, ect. these urls lead to a significant pile up and in time become at great risk of link rot.

See url-archiver.go on GitHub.

On the Automated Archiving of URLs

As mentioned in the previous section url pile up is a serious issue and not one to be fixed by hand. Of course I could allocate time to manually sifting through my markdown documents to find stray links then determining key information as to the author, date, context of such an article and the method I want save it in. Or I could automate the boring stuff. The inspiration for the script was Gwern's link-extractor.hs. A script specifically made for the parsing of markdown files to identify links to archive.

#!/usr/bin/env runghc
{-# LANGUAGE OverloadedStrings #-}
-- dependencies: libghc-pandoc-dev
--
-- usage: link-extractor.hs [--print-filenames] [file]
-- Prints hyperlinks found in Pandoc Markdown or HTML.
-- Rewrites local anchors to absolute paths.
-- Rewrites interwiki links to full URLs.

module Main where

import Control.Monad (unless)
import Data.List (isSuffixOf)
import qualified Data.Text as T
  (append, head, pack, unlines)
import qualified Data.Text.IO as TIO
  (getContents, readFile, putStr, putStrLn)
import System.Directory (doesFileExist)
import System.Environment (getArgs)
import System.FilePath (takeBaseName)
import Query (extractLinks)

main :: IO ()
main = do
  fs <- getArgs
  let pf = take 1 fs == ["--print-filenames"]
  let fs' = if pf then drop 1 fs else fs
  if null fs
    then do
      stdin <- TIO.getContents
      let links = extractLinks True stdin
      let links' = if null links
            then extractLinks False stdin
            else links
      mapM_ TIO.putStrLn links'
    else mapM_ (printURLs pf) fs'

printURLs :: Bool -> FilePath -> IO ()
printURLs pf file = do
  exists <- doesFileExist file
  unless exists $
    error $ "File does not exist: " ++ file
  input <- TIO.readFile file
  let links = extractLinks
        (".md" `isSuffixOf` file) input
  -- rewrite "#anchor" to "/page#anchor"
  let links' = map rewrite links
        where
          base = T.pack (takeBaseName file)
          rewrite u
            | T.head u /= '#' = u
            | otherwise =
                "/" `T.append` base `T.append` u
  if pf
    then TIO.putStr $ T.unlines $
      map (\u -> T.pack file
        `T.append` ":" `T.append` u) links'
    else TIO.putStr $ T.unlines links'

Sign in with GitHub to comment

Loading comments...
Citation
Yotam, Kris · Jun 2025

Yotam, Kris. (Jun 2025). Automated URL Archiving. krisyotam.com. https://krisyotam.com/blog/technology/automated-url-archiving

@article{yotam2025automated-url-archiving,
  title   = "Automated URL Archiving",
  author  = "Yotam, Kris",
  journal = "krisyotam.com",
  year    = "2025",
  month   = "Jun",
  url     = "https://krisyotam.com/blog/technology/automated-url-archiving"
}

in Naperville, IL
Last visitor from Mitaka, Japan