CLAUDE.md (4998B)
1 # sbot — CLAUDE.md 2 3 ## Project 4 5 sbot (Simple Archiver Bot) is a suckless web archiver written in C. It 6 creates self-contained archives of websites with all resources (CSS, 7 images, fonts) inlined as data URIs. Supports single-page archival in 8 GWTAR (Gwern Web Tar Archive) format and recursive whole-site archival 9 with navigable directory structure. 10 11 ## Coding Standards — Suckless C Style 12 13 All code in this project MUST follow the suckless.org coding style: 14 15 ### Language 16 - C99 (ISO/IEC 9899:1999), no extensions 17 - POSIX.1-2008 (`_POSIX_C_SOURCE 200809L`) 18 19 ### Indentation & Whitespace 20 - Tabs for indentation (1 tab = 1 level) 21 - Spaces for alignment only, never for indentation 22 - No tabs except at the beginning of a line 23 - Maximum line length: 79 characters 24 25 ### Comments 26 - Use `/* */` only, never `//` 27 - Comment fallthrough cases in switch statements 28 29 ### Variables 30 - All declarations at the top of the block 31 - Pointer `*` adjacent to variable name: `char *p`, not `char* p` 32 - No C99 `bool`; use `int` (0/1) 33 - Global/static variables not used outside TU must be `static` 34 35 ### Functions 36 - Return type on its own line 37 - Function name at column 0 on next line (enables `grep ^funcname`) 38 - Opening `{` on its own line for functions 39 - Functions not used outside their file: `static` 40 41 ```c 42 static void 43 usage(void) 44 { 45 fprintf(stderr, "usage: sbot [-v] [-r] url\n"); 46 exit(1); 47 } 48 ``` 49 50 ### Braces 51 - Opening `{` on same line for control flow (if, for, while, switch) 52 - Closing `}` on its own line unless continuing (else, do-while) 53 - Use braces even for single statements when sibling branches use them 54 55 ### Naming 56 - lowercase_with_underscores for functions and variables 57 - UPPERCASE for macros and constants 58 - CamelCase for typedef'd struct types 59 - No `_t` suffix (reserved by POSIX) 60 - Prefix module functions with module name 61 62 ### Control Flow 63 - Space after `if`, `for`, `while`, `switch` 64 - No space after `(` or before `)` 65 - Use `goto` for cleanup/unwind, not nested ifs 66 - Return/exit early on failure 67 - Test against 0, not -1: `if (func() < 0)` 68 69 ### Error Handling 70 - All allocation checked; goto cleanup on failure 71 - `die()` for fatal errors (prints message, exits) 72 - `warn()` for recoverable errors (prints, continues) 73 74 ### File Organization Order 75 1. License header 76 2. System includes (alphabetical) 77 3. Local includes 78 4. Macros 79 5. Type definitions 80 6. Function declarations 81 7. Global variables 82 8. Function definitions (same order as declarations) 83 84 ### Headers 85 - System headers first, alphabetical 86 - Local headers after blank line 87 - No cyclic dependencies 88 - Include only what is needed 89 90 ## Architecture 91 92 ### Module Layout 93 94 | Module | Prefix | File | Responsibility | 95 |--------|--------|------|----------------| 96 | Main | — | archiver.c | Entry point, page archiving, CSS inlining, link rewriting, crawl orchestration | 97 | Crawler | `queue_`, `visited_` | crawl.c | URL queue (BFS), visited set, URL normalization, path conversion | 98 | Fetcher | `fetch_` | fetch.c | HTTP fetching via libcurl, response management | 99 | Parser | `reslist_`, `parse_` | parse.c | HTML parsing, resource extraction, image inlining | 100 | Robots | `robots_` | robots.c | robots.txt fetching, parsing, and rule matching | 101 | Detect | `detect_`, `siteinfo_` | detect.c | CMS/framework detection (WordPress, Blogger, Hugo, Jekyll, Ghost, Drupal, MediaWiki) | 102 | Utilities | `die`, `warn`, `x*`, `str_*`, `url_*` | util.c | Memory wrappers, string ops, URL helpers, base64, MIME types | 103 | Config | — | config.h | Compile-time constants (timeouts, limits, user agent) | 104 105 ### Architecture Rules 106 - **Separate compilation.** Every .c file compiles independently. 107 - **No dynamic loading.** All features compiled in. 108 - **libcurl only.** Single external dependency for HTTP. 109 - **No `system()` calls.** Direct file I/O and libcurl only. 110 - **Data URIs for inlining.** Resources encoded as base64 data URIs. 111 - **Stateless functions preferred.** Minimize mutable global state. 112 113 ### Crawler Design Principles 114 - **BFS traversal.** URL queue processes breadth-first by depth level. 115 - **Same-domain only.** Never follow links to external domains. 116 - **Politeness.** Rate limiting between requests (configurable). 117 - **Depth control.** Hard limit on crawl depth to prevent runaway. 118 - **URL normalization.** Canonical form for deduplication. 119 - **Graceful degradation.** Skip failed resources, continue crawling. 120 - **robots.txt compliance.** Respects Disallow/Allow rules and Crawl-delay. 121 122 ## Build 123 124 ```sh 125 make # build sbot binary 126 make clean # remove build artifacts 127 make install # install to /usr/local/bin 128 ``` 129 130 Dependencies: `libcurl` (via pkg-config) 131 132 ## Usage 133 134 ```sh 135 # Single page archive (GWTAR format) 136 sbot https://example.com/article 137 138 # Whole site (recursive, depth 3) 139 sbot -r -d 3 https://example.com 140 141 # Verbose with custom output dir 142 sbot -v -r -o ./archive https://example.com 143 ``` 144 145 ## Git Conventions 146 147 - No `Co-Authored-By: Claude` lines 148 - Commit messages: imperative, <72 chars, no period 149 - One logical change per commit