Back to Notes

Large File Storage

Implementation details of large file storage solutions for the website

status: Finished

Status Indicator

The status indicator reflects the current state of the work: - Abandoned: Work that has been discontinued - Notes: Initial collections of thoughts and references - Draft: Early structured version with a central thesis - In Progress: Well-developed work actively being refined - Finished: Completed work with no planned major changes This helps readers understand the maturity and completeness of the content.

·
certainty: certain

Confidence Rating

The confidence tag expresses how well-supported the content is, or how likely its overall ideas are right. This uses a scale from "impossible" to "certain", based on the Kesselman List of Estimative Words: 1. "certain" 2. "highly likely" 3. "likely" 4. "possible" 5. "unlikely" 6. "highly unlikely" 7. "remote" 8. "impossible" Even ideas that seem unlikely may be worth exploring if their potential impact is significant enough.

·
importance: 8/10

Importance Rating

The importance rating distinguishes between trivial topics and those which might change your life. Using a scale from 0-10, content is ranked based on its potential impact on: - the reader - the intended audience - the world at large For example, topics about fundamental research or transformative technologies would rank 9-10, while personal reflections or minor experiments might rank 0-1.

As of (4/28/2025) krisyotam.com now has support for Large Files (>200MB). The post here from gwern.net details much of the necessity of this for a "long-now" focused site. For those who would rather not read I will provide a synopsis of such here.

Why Large File Storage

Dropcap lots of people make use of services like OneDrive, Google Drive, iCloud Drive, and Mega for cloud storage services. These platforms tend to be failry straightforward, designed for ease of use, and []. This does come with tons of draw backs however. With these platforms there is a lot of High Level Abstraction (you interact with apps, not servers), Freemium Models (Small free amounts that scale prices terribly), Shared Hosting (your files live with a million others), Integrated Ecosystems (Locked into apple, microsoft, etc.), and severe Privacy Concerns. So how is Hetzner Storage Boxes different? HSB is Infrastructure Focused (designed for developers, and power users), and provides Low-Level Control (Access via FTP, SFTP, SCP, rsync, etc.). It can also act as a system hardrive (otehr cloud providers can use this specific feature). It has no Ecosystem Lock-in (works with any os, tool, or client).

Need for Large-File Support

Dropcap it's no secret that this site will have tons expository work done over the years. That comes with thousands of references, resources, and citations. With this number easily multiplying itself due to the ease of implementing tools like perplexity to help myself discover new sources I would have otherwise not found. I find with sources the more the better. Why lean towards a scarcity mindset? Would it not be more convicning to see 10, 20, or even 30 studies with reproducable resutls on a certain topic. Then say being motivatied to make a otherwise significant change in your life based off a one of source. So for this first reason it is less a need for the store of singular "Large Files" but the need for elphantine storage sizes caused by the consistent accumulation of small pdfs, mp3s, mp4s, etc. See also my method of countering this via my automated url archving script. Beyond the simple things my main need for this implementation of the preservation of sites. Such as that of geologist, and Independant Researcher Leuren Moret. Her site recently when down between the last time I had visited it in mid 2024. It's no telling when valuable assets of information may go down. I have more thoughts on the extended preservation of acess in situations like these. Such as the retaining of eth domains such as leurenmoret.eth and pinning her content via IPFS, or even using off-shore hosting providers which I have been deliberating of which to switch to before the content here starts to get more serious. For now at least using a viable LFS provider allows me to take my time weighing my options and retain offline access to such information while my doing so. I am still weighing the value of mp4s for long-term storage. They might be pleasing and add visual stimuli but for most content with great information density it is just not necessary. I would rather sacrifice the visual fiedlity of a mp4 for the extra stroage gained by using mp3s. In such situations I am also thinking about ideas for the transcription of videos through things like OpenAI whisper. Maybe getting a extra Hetzner server for such purposes would not be a bad idea.

The opposite of courage in our society is not cowardice... it is conformity.
Rollo May

Drawbacks of Implementation

Dropcap i have deliberated for a while about tons of alternative solutions to this problem as mentioned above in (#Why Large File Storage?). There where several drawbacks to consider such as privacy, security, ease of access, pricing over the long-term, portability, etc. a number of these factors out right eliminated the more traditional options such as OneDrive, Google Drive (G-Suite), iCloud Drive, and even Mega as a main source for LFS. Even though I am still particularly fond of Mega, and still retain my subscription due to the not infrequent use of receiving massive amounts of data from people via the platform. The biggest drawback that most of these platforms have was either the inabillity to download via HTTPS. I was also drastically influenced by the massive pricing difference which we will dicuss next.

Competitor Price Comparison

Unlike the popular consumer cloud providers above there are a number of viable alternatives to Hetzner. Those include Wasabi, Scaleway, Storj, and my own system backup provider BackBlaze B2. As you can see below however there is simply no comparison when it comes to pricing models.

Hetzner Competitor Price Comparison
Click to enlarge
As of 4/28/2025 a comparison between Hetzner, BackBlaze B2, Wasabi, Scaleway, Storj.

Use Cases

Dropcap the immediate use case that comes to mind for the newly integrated Large File Storage (LFS) is the completion of the formerly delayed archive page. It should be of no suprise that the future of this blog will discuss at length a variety of topics that must be heavily supported with documentation. Documentation like Panama Papers, Paradise Papers, Offshore Leaks, Silk Road Archives, RaidForums Dumps, and many more data sets. It will also include the storage of youtube videos I think are in danger of being silenced. Things like this as well as research papers, historical documents, etc. are to be stored for reference here on the site and the archive page to be made available to people at my discretion.

More fun things to store include the GeoCities Archives, which represent an important piece of early internet history that would otherwise be lost to time.

Implementation

Dropcap the site runs as a Next.js app behind nginx on a Hetzner server. Large files don’t go through Next.js at all. They sit on a 1TB drive mounted at /mnt/storage on the same box, and nginx serves them straight from disk. Two directories do all the work: /doc and /cdn.

/doc

This is the document archive. Mostly research papers and PDFs. Some leaked datasets, some transcripts. If I cite something in a post, I keep a copy here so the link can’t rot out from under me. The files are sorted by subject: /doc/mathematics/, /doc/psychology/, /doc/history/, etc. When someone requests a file like /doc/philosophy/some-paper.pdf, nginx matches the extension and hands it back with a 7-day cache header. Browsing /doc/ without a specific file falls through to the Next.js app, which renders a listing page.

/cdn

This is the media side. Cover images, portraits, audio files. Anything the site needs to display but that would be insane to commit to the git repo. The directory structure follows the site’s content categories, so you get paths like /cdn/images/people/authors/ and /cdn/audio/. nginx has autoindex turned on here so you can browse it as a file listing. I inject a custom theme into those listing pages through nginx’s sub_filter module.

Both directories are symlinks from the home directory to the storage drive:

~/cdn -> /mnt/storage/cdn
~/doc -> /mnt/storage/doc

How the routing works

nginx does all of it. For /doc, a regex location block catches requests ending in known file extensions (.pdf, .epub, .mp3, etc.) and serves them from /home/krisyotam/doc/ using root. That follows the symlink down to /mnt/storage/doc/. Anything that doesn’t match a file extension passes through to the Next.js app on 127.0.0.1:3080.

For /cdn, an alias directive points the URL path to the storage directory. Autoindex handles the file browser. sub_filter injects the theme files from /doc-theme/.

Uploads happen over SSH:

scp paper.pdf server:/mnt/storage/doc/mathematics/
scp portrait.jpg server:/mnt/storage/cdn/images/people/authors/

For bigger transfers I use rsync. TLS comes from Let’s Encrypt via certbot, with nginx terminating SSL on 443.

Why I did it this way

No S3. No object storage API. No monthly metering. The files sit on a drive I own, served by nginx which is already running for the site. The 1TB drive was a one-time cost. If I run out of space I add another one. If I move servers I rsync the whole thing. The URLs are just paths on my own domain, so there’s no external service involved that could change its pricing or disappear.

Sign in with GitHub to comment

Loading comments...
Citation
Yotam, Kris · Jun 2025

Yotam, Kris. (Jun 2025). Large File Storage. krisyotam.com. https://krisyotam.com/notes/website/large-file-storage

@article{yotam2025large-file-storage,
  title   = "Large File Storage",
  author  = "Yotam, Kris",
  journal = "krisyotam.com",
  year    = "2025",
  month   = "Jun",
  url     = "https://krisyotam.com/notes/website/large-file-storage"
}

in Naperville, IL
Last visitor from Mitaka, Japan