sbot

Simple web archiver — self-contained GWTAR archives
git clone git clone https://git.krisyotam.com/krisyotam/sbot.git
Log | Files | Refs | README | LICENSE

CLAUDE.md (4998B)


      1 # sbot — CLAUDE.md
      2 
      3 ## Project
      4 
      5 sbot (Simple Archiver Bot) is a suckless web archiver written in C. It
      6 creates self-contained archives of websites with all resources (CSS,
      7 images, fonts) inlined as data URIs. Supports single-page archival in
      8 GWTAR (Gwern Web Tar Archive) format and recursive whole-site archival
      9 with navigable directory structure.
     10 
     11 ## Coding Standards — Suckless C Style
     12 
     13 All code in this project MUST follow the suckless.org coding style:
     14 
     15 ### Language
     16 - C99 (ISO/IEC 9899:1999), no extensions
     17 - POSIX.1-2008 (`_POSIX_C_SOURCE 200809L`)
     18 
     19 ### Indentation & Whitespace
     20 - Tabs for indentation (1 tab = 1 level)
     21 - Spaces for alignment only, never for indentation
     22 - No tabs except at the beginning of a line
     23 - Maximum line length: 79 characters
     24 
     25 ### Comments
     26 - Use `/* */` only, never `//`
     27 - Comment fallthrough cases in switch statements
     28 
     29 ### Variables
     30 - All declarations at the top of the block
     31 - Pointer `*` adjacent to variable name: `char *p`, not `char* p`
     32 - No C99 `bool`; use `int` (0/1)
     33 - Global/static variables not used outside TU must be `static`
     34 
     35 ### Functions
     36 - Return type on its own line
     37 - Function name at column 0 on next line (enables `grep ^funcname`)
     38 - Opening `{` on its own line for functions
     39 - Functions not used outside their file: `static`
     40 
     41 ```c
     42 static void
     43 usage(void)
     44 {
     45 	fprintf(stderr, "usage: sbot [-v] [-r] url\n");
     46 	exit(1);
     47 }
     48 ```
     49 
     50 ### Braces
     51 - Opening `{` on same line for control flow (if, for, while, switch)
     52 - Closing `}` on its own line unless continuing (else, do-while)
     53 - Use braces even for single statements when sibling branches use them
     54 
     55 ### Naming
     56 - lowercase_with_underscores for functions and variables
     57 - UPPERCASE for macros and constants
     58 - CamelCase for typedef'd struct types
     59 - No `_t` suffix (reserved by POSIX)
     60 - Prefix module functions with module name
     61 
     62 ### Control Flow
     63 - Space after `if`, `for`, `while`, `switch`
     64 - No space after `(` or before `)`
     65 - Use `goto` for cleanup/unwind, not nested ifs
     66 - Return/exit early on failure
     67 - Test against 0, not -1: `if (func() < 0)`
     68 
     69 ### Error Handling
     70 - All allocation checked; goto cleanup on failure
     71 - `die()` for fatal errors (prints message, exits)
     72 - `warn()` for recoverable errors (prints, continues)
     73 
     74 ### File Organization Order
     75 1. License header
     76 2. System includes (alphabetical)
     77 3. Local includes
     78 4. Macros
     79 5. Type definitions
     80 6. Function declarations
     81 7. Global variables
     82 8. Function definitions (same order as declarations)
     83 
     84 ### Headers
     85 - System headers first, alphabetical
     86 - Local headers after blank line
     87 - No cyclic dependencies
     88 - Include only what is needed
     89 
     90 ## Architecture
     91 
     92 ### Module Layout
     93 
     94 | Module | Prefix | File | Responsibility |
     95 |--------|--------|------|----------------|
     96 | Main | — | archiver.c | Entry point, page archiving, CSS inlining, link rewriting, crawl orchestration |
     97 | Crawler | `queue_`, `visited_` | crawl.c | URL queue (BFS), visited set, URL normalization, path conversion |
     98 | Fetcher | `fetch_` | fetch.c | HTTP fetching via libcurl, response management |
     99 | Parser | `reslist_`, `parse_` | parse.c | HTML parsing, resource extraction, image inlining |
    100 | Robots | `robots_` | robots.c | robots.txt fetching, parsing, and rule matching |
    101 | Detect | `detect_`, `siteinfo_` | detect.c | CMS/framework detection (WordPress, Blogger, Hugo, Jekyll, Ghost, Drupal, MediaWiki) |
    102 | Utilities | `die`, `warn`, `x*`, `str_*`, `url_*` | util.c | Memory wrappers, string ops, URL helpers, base64, MIME types |
    103 | Config | — | config.h | Compile-time constants (timeouts, limits, user agent) |
    104 
    105 ### Architecture Rules
    106 - **Separate compilation.** Every .c file compiles independently.
    107 - **No dynamic loading.** All features compiled in.
    108 - **libcurl only.** Single external dependency for HTTP.
    109 - **No `system()` calls.** Direct file I/O and libcurl only.
    110 - **Data URIs for inlining.** Resources encoded as base64 data URIs.
    111 - **Stateless functions preferred.** Minimize mutable global state.
    112 
    113 ### Crawler Design Principles
    114 - **BFS traversal.** URL queue processes breadth-first by depth level.
    115 - **Same-domain only.** Never follow links to external domains.
    116 - **Politeness.** Rate limiting between requests (configurable).
    117 - **Depth control.** Hard limit on crawl depth to prevent runaway.
    118 - **URL normalization.** Canonical form for deduplication.
    119 - **Graceful degradation.** Skip failed resources, continue crawling.
    120 - **robots.txt compliance.** Respects Disallow/Allow rules and Crawl-delay.
    121 
    122 ## Build
    123 
    124 ```sh
    125 make            # build sbot binary
    126 make clean      # remove build artifacts
    127 make install    # install to /usr/local/bin
    128 ```
    129 
    130 Dependencies: `libcurl` (via pkg-config)
    131 
    132 ## Usage
    133 
    134 ```sh
    135 # Single page archive (GWTAR format)
    136 sbot https://example.com/article
    137 
    138 # Whole site (recursive, depth 3)
    139 sbot -r -d 3 https://example.com
    140 
    141 # Verbose with custom output dir
    142 sbot -v -r -o ./archive https://example.com
    143 ```
    144 
    145 ## Git Conventions
    146 
    147 - No `Co-Authored-By: Claude` lines
    148 - Commit messages: imperative, <72 chars, no period
    149 - One logical change per commit