README.md (6884B)
1 # sbot 2 3 **Simple Archiver Bot** -- a suckless web archiver written in C. 4 5 sbot creates self-contained archives of web pages and entire websites. 6 Every resource -- CSS, images, fonts, scripts -- is fetched and inlined 7 directly into the HTML as base64 data URIs. The result is a single file 8 (or directory of files) that renders perfectly offline, with no external 9 dependencies, forever. 10 11 ## Why 12 13 Web pages disappear. Link rot is real. The average web page has a 14 half-life of about two years. Bookmarks break, articles vanish, 15 references evaporate. 16 17 sbot solves this by creating archives that are: 18 19 - **Self-contained.** Everything is inlined. No external requests needed. 20 - **Human-readable.** Output is standard HTML. Open it in any browser. 21 - **Permanent.** No database, no server, no special viewer. Just files. 22 - **Metadata-rich.** GWTAR headers record provenance, date, and source. 23 24 ## Modes 25 26 ### Single Page Archive 27 28 ```sh 29 sbot https://example.com/article 30 ``` 31 32 Archives a single page in **GWTAR format** (Gwern Web Tar Archive). This 33 is the default mode and the most common use case. The output is one 34 `.gwtar.html` file containing: 35 36 - A GWTAR metadata header (HTML comment) with title, source URL, domain, 37 author, archive date, and generator version 38 - The full HTML with all CSS stylesheets inlined as `<style>` blocks 39 - All images, fonts, and media encoded as `data:` URIs 40 - A completely self-contained document that renders identically to the 41 original 42 43 GWTAR format is ideal for: 44 45 - Archiving individual articles, blog posts, and essays 46 - Preserving references and citations 47 - Building a personal web archive / digital library 48 - Saving pages before they disappear behind paywalls or get deleted 49 50 ### Whole Site Archive 51 52 ```sh 53 sbot -r https://example.com 54 ``` 55 56 Recursively crawls an entire website and archives every page. The output 57 is a directory tree that mirrors the site structure, with each page saved 58 as a self-contained HTML file. Internal links are rewritten to relative 59 paths so navigation works offline. 60 61 Features: 62 63 - **BFS crawl order.** Breadth-first traversal ensures important pages 64 (closer to root) are archived first. 65 - **Same-domain only.** Never follows links to external sites. 66 - **robots.txt compliance.** Respects Disallow rules and Crawl-delay 67 directives by default. Override with `-R`. 68 - **Depth control.** Set maximum crawl depth with `-d` to limit scope. 69 - **Rate limiting.** Configurable delay between requests to be polite to 70 servers (default: 1 second). 71 - **Progress reporting.** Periodic status lines showing pages archived, 72 queue depth, and elapsed time. 73 - **Graceful degradation.** Failed resources are skipped; the crawl 74 continues. 75 76 This mode is ideal for: 77 78 - Archiving entire blogs or documentation sites 79 - Creating offline mirrors of reference material 80 - Preserving small-to-medium websites wholesale 81 - Building browseable offline copies of sites you depend on 82 83 ## Usage 84 85 ``` 86 usage: sbot [-vrR] [-d depth] [-o dir] [-a author] url 87 88 -v verbose output 89 -r recursive (crawl entire site) 90 -R ignore robots.txt 91 -d depth max crawl depth (default: 5) 92 -o dir output directory 93 -a author site author name 94 ``` 95 96 ### Examples 97 98 ```sh 99 # Archive a single article 100 sbot https://example.com/blog/post 101 102 # Archive with author metadata 103 sbot -a "John Doe" https://example.com/article 104 105 # Crawl a blog, max depth 3 106 sbot -r -d 3 https://blog.example.com 107 108 # Verbose crawl to custom directory 109 sbot -v -r -o ./my-archive https://docs.example.com 110 111 # Crawl ignoring robots.txt restrictions 112 sbot -r -R https://example.com 113 ``` 114 115 ## GWTAR Format 116 117 Every archived page includes a GWTAR (Gwern Web Tar Archive) metadata 118 header as an HTML comment at the top of the file: 119 120 ``` 121 <!-- 122 ================================================================ 123 GWTAR ARCHIVE 124 ================================================================ 125 126 Title: Example Article 127 Source URL: https://example.com/article 128 Domain: example.com 129 Author: John Doe 130 131 Archived by: Kris Yotam 132 Archived on: krisyotam.com 133 Archive date: 2026-02-14 134 135 Generator: sbot/0.3.0 136 Format: GWTAR (Gwern Web Tar Archive) 137 138 ================================================================ 139 --> 140 ``` 141 142 This header provides full provenance tracking: what was archived, where 143 it came from, who archived it, and when. 144 145 ## Resource Inlining 146 147 sbot inlines all resources to create truly self-contained archives: 148 149 | Resource Type | Inlining Method | 150 |---------------|-----------------| 151 | CSS stylesheets | Fetched and inserted as `<style>` blocks | 152 | Images | Base64-encoded as `data:image/*` URIs | 153 | Fonts | Base64-encoded as `data:font/*` URIs | 154 | Other media | Base64-encoded with appropriate MIME type | 155 156 Resources that fail to fetch are silently skipped -- the archive 157 degrades gracefully rather than failing entirely. 158 159 ## Build 160 161 Requires `libcurl` development headers. 162 163 ```sh 164 # Arch Linux 165 sudo pacman -S curl 166 167 # Debian/Ubuntu 168 sudo apt install libcurl4-openssl-dev 169 170 # Build 171 make 172 173 # Install to /usr/local/bin 174 sudo make install 175 176 # Clean 177 make clean 178 ``` 179 180 ## Configuration 181 182 All configuration is compile-time via `config.h`: 183 184 | Setting | Default | Description | 185 |---------|---------|-------------| 186 | `USER_AGENT` | `sbot/0.3` | HTTP User-Agent string | 187 | `CONNECT_TIMEOUT` | 30s | Connection timeout | 188 | `REQUEST_TIMEOUT` | 60s | Total request timeout | 189 | `MAX_REDIRECTS` | 10 | Maximum HTTP redirects to follow | 190 | `MAX_DEPTH` | 5 | Default recursive crawl depth | 191 | `RATE_LIMIT_MS` | 1000ms | Delay between requests | 192 | `MAX_FILE_SIZE` | 50 MB | Maximum size per resource | 193 | `OUTPUT_EXT` | `.gwtar.html` | File extension for archives | 194 195 Edit `config.h` and rebuild to change any setting. This is the suckless 196 way -- no runtime configuration files, no environment variables, no 197 hidden defaults. 198 199 ## Architecture 200 201 ``` 202 archiver.c Main entry, page archiving, CSS inlining, link rewriting 203 crawl.c URL queue (BFS), visited set, URL normalization 204 fetch.c HTTP fetching via libcurl 205 parse.c HTML parsing, resource extraction, image inlining 206 robots.c robots.txt fetching, parsing, rule matching 207 util.c Memory wrappers, string ops, base64, MIME types 208 config.h Compile-time constants 209 ``` 210 211 Single external dependency: libcurl. No XML parsers, no HTML5 parsers, 212 no JavaScript engines. The HTML parsing is deliberately simple -- 213 regex-based extraction of `src`, `href`, and `url()` references. This 214 handles the vast majority of real-world pages and keeps the codebase 215 small and auditable. 216 217 ## Philosophy 218 219 sbot follows the [suckless](https://suckless.org) philosophy: 220 221 - Written in C99 with POSIX.1-2008 222 - Minimal dependencies (libcurl only) 223 - Configuration through `config.h` (edit and recompile) 224 - Small, readable codebase 225 - Does one thing well 226 227 ## License 228 229 MIT/X Consortium License. See [LICENSE](LICENSE) for details.