sbot

Simple web archiver — self-contained GWTAR archives
git clone git clone https://git.krisyotam.com/krisyotam/sbot.git
Log | Files | Refs | README | LICENSE

README.md (6884B)


      1 # sbot
      2 
      3 **Simple Archiver Bot** -- a suckless web archiver written in C.
      4 
      5 sbot creates self-contained archives of web pages and entire websites.
      6 Every resource -- CSS, images, fonts, scripts -- is fetched and inlined
      7 directly into the HTML as base64 data URIs. The result is a single file
      8 (or directory of files) that renders perfectly offline, with no external
      9 dependencies, forever.
     10 
     11 ## Why
     12 
     13 Web pages disappear. Link rot is real. The average web page has a
     14 half-life of about two years. Bookmarks break, articles vanish,
     15 references evaporate.
     16 
     17 sbot solves this by creating archives that are:
     18 
     19 - **Self-contained.** Everything is inlined. No external requests needed.
     20 - **Human-readable.** Output is standard HTML. Open it in any browser.
     21 - **Permanent.** No database, no server, no special viewer. Just files.
     22 - **Metadata-rich.** GWTAR headers record provenance, date, and source.
     23 
     24 ## Modes
     25 
     26 ### Single Page Archive
     27 
     28 ```sh
     29 sbot https://example.com/article
     30 ```
     31 
     32 Archives a single page in **GWTAR format** (Gwern Web Tar Archive). This
     33 is the default mode and the most common use case. The output is one
     34 `.gwtar.html` file containing:
     35 
     36 - A GWTAR metadata header (HTML comment) with title, source URL, domain,
     37   author, archive date, and generator version
     38 - The full HTML with all CSS stylesheets inlined as `<style>` blocks
     39 - All images, fonts, and media encoded as `data:` URIs
     40 - A completely self-contained document that renders identically to the
     41   original
     42 
     43 GWTAR format is ideal for:
     44 
     45 - Archiving individual articles, blog posts, and essays
     46 - Preserving references and citations
     47 - Building a personal web archive / digital library
     48 - Saving pages before they disappear behind paywalls or get deleted
     49 
     50 ### Whole Site Archive
     51 
     52 ```sh
     53 sbot -r https://example.com
     54 ```
     55 
     56 Recursively crawls an entire website and archives every page. The output
     57 is a directory tree that mirrors the site structure, with each page saved
     58 as a self-contained HTML file. Internal links are rewritten to relative
     59 paths so navigation works offline.
     60 
     61 Features:
     62 
     63 - **BFS crawl order.** Breadth-first traversal ensures important pages
     64   (closer to root) are archived first.
     65 - **Same-domain only.** Never follows links to external sites.
     66 - **robots.txt compliance.** Respects Disallow rules and Crawl-delay
     67   directives by default. Override with `-R`.
     68 - **Depth control.** Set maximum crawl depth with `-d` to limit scope.
     69 - **Rate limiting.** Configurable delay between requests to be polite to
     70   servers (default: 1 second).
     71 - **Progress reporting.** Periodic status lines showing pages archived,
     72   queue depth, and elapsed time.
     73 - **Graceful degradation.** Failed resources are skipped; the crawl
     74   continues.
     75 
     76 This mode is ideal for:
     77 
     78 - Archiving entire blogs or documentation sites
     79 - Creating offline mirrors of reference material
     80 - Preserving small-to-medium websites wholesale
     81 - Building browseable offline copies of sites you depend on
     82 
     83 ## Usage
     84 
     85 ```
     86 usage: sbot [-vrR] [-d depth] [-o dir] [-a author] url
     87 
     88   -v          verbose output
     89   -r          recursive (crawl entire site)
     90   -R          ignore robots.txt
     91   -d depth    max crawl depth (default: 5)
     92   -o dir      output directory
     93   -a author   site author name
     94 ```
     95 
     96 ### Examples
     97 
     98 ```sh
     99 # Archive a single article
    100 sbot https://example.com/blog/post
    101 
    102 # Archive with author metadata
    103 sbot -a "John Doe" https://example.com/article
    104 
    105 # Crawl a blog, max depth 3
    106 sbot -r -d 3 https://blog.example.com
    107 
    108 # Verbose crawl to custom directory
    109 sbot -v -r -o ./my-archive https://docs.example.com
    110 
    111 # Crawl ignoring robots.txt restrictions
    112 sbot -r -R https://example.com
    113 ```
    114 
    115 ## GWTAR Format
    116 
    117 Every archived page includes a GWTAR (Gwern Web Tar Archive) metadata
    118 header as an HTML comment at the top of the file:
    119 
    120 ```
    121 <!--
    122 ================================================================
    123   GWTAR ARCHIVE
    124 ================================================================
    125 
    126   Title:        Example Article
    127   Source URL:   https://example.com/article
    128   Domain:       example.com
    129   Author:       John Doe
    130 
    131   Archived by:  Kris Yotam
    132   Archived on:  krisyotam.com
    133   Archive date: 2026-02-14
    134 
    135   Generator:    sbot/0.3.0
    136   Format:       GWTAR (Gwern Web Tar Archive)
    137 
    138 ================================================================
    139 -->
    140 ```
    141 
    142 This header provides full provenance tracking: what was archived, where
    143 it came from, who archived it, and when.
    144 
    145 ## Resource Inlining
    146 
    147 sbot inlines all resources to create truly self-contained archives:
    148 
    149 | Resource Type | Inlining Method |
    150 |---------------|-----------------|
    151 | CSS stylesheets | Fetched and inserted as `<style>` blocks |
    152 | Images | Base64-encoded as `data:image/*` URIs |
    153 | Fonts | Base64-encoded as `data:font/*` URIs |
    154 | Other media | Base64-encoded with appropriate MIME type |
    155 
    156 Resources that fail to fetch are silently skipped -- the archive
    157 degrades gracefully rather than failing entirely.
    158 
    159 ## Build
    160 
    161 Requires `libcurl` development headers.
    162 
    163 ```sh
    164 # Arch Linux
    165 sudo pacman -S curl
    166 
    167 # Debian/Ubuntu
    168 sudo apt install libcurl4-openssl-dev
    169 
    170 # Build
    171 make
    172 
    173 # Install to /usr/local/bin
    174 sudo make install
    175 
    176 # Clean
    177 make clean
    178 ```
    179 
    180 ## Configuration
    181 
    182 All configuration is compile-time via `config.h`:
    183 
    184 | Setting | Default | Description |
    185 |---------|---------|-------------|
    186 | `USER_AGENT` | `sbot/0.3` | HTTP User-Agent string |
    187 | `CONNECT_TIMEOUT` | 30s | Connection timeout |
    188 | `REQUEST_TIMEOUT` | 60s | Total request timeout |
    189 | `MAX_REDIRECTS` | 10 | Maximum HTTP redirects to follow |
    190 | `MAX_DEPTH` | 5 | Default recursive crawl depth |
    191 | `RATE_LIMIT_MS` | 1000ms | Delay between requests |
    192 | `MAX_FILE_SIZE` | 50 MB | Maximum size per resource |
    193 | `OUTPUT_EXT` | `.gwtar.html` | File extension for archives |
    194 
    195 Edit `config.h` and rebuild to change any setting. This is the suckless
    196 way -- no runtime configuration files, no environment variables, no
    197 hidden defaults.
    198 
    199 ## Architecture
    200 
    201 ```
    202 archiver.c   Main entry, page archiving, CSS inlining, link rewriting
    203 crawl.c      URL queue (BFS), visited set, URL normalization
    204 fetch.c      HTTP fetching via libcurl
    205 parse.c      HTML parsing, resource extraction, image inlining
    206 robots.c     robots.txt fetching, parsing, rule matching
    207 util.c       Memory wrappers, string ops, base64, MIME types
    208 config.h     Compile-time constants
    209 ```
    210 
    211 Single external dependency: libcurl. No XML parsers, no HTML5 parsers,
    212 no JavaScript engines. The HTML parsing is deliberately simple --
    213 regex-based extraction of `src`, `href`, and `url()` references. This
    214 handles the vast majority of real-world pages and keeps the codebase
    215 small and auditable.
    216 
    217 ## Philosophy
    218 
    219 sbot follows the [suckless](https://suckless.org) philosophy:
    220 
    221 - Written in C99 with POSIX.1-2008
    222 - Minimal dependencies (libcurl only)
    223 - Configuration through `config.h` (edit and recompile)
    224 - Small, readable codebase
    225 - Does one thing well
    226 
    227 ## License
    228 
    229 MIT/X Consortium License. See [LICENSE](LICENSE) for details.