Case Study
Substack & Beehiiv Downloader
Download any Substack or Beehiiv publication, including paywalled content, as Markdown, DOCX, or PDF. No account required. Zero server-side persistence.
Why it exists
You paid for it. You should be able to keep it.
I was saving Substack articles to Notion and Readwise, then watching them become inaccessible when subscriptions lapsed. The content was gone. I had paid to read it; I hadn't paid to own it. That asymmetry felt wrong, and it applies equally to Beehiiv newsletters.
The technical insight that made the product possible: Substack has a public API that returns the full body_html of any post. For paywalled content, it returns the full body if you pass a valid session cookie (connect.sid or substack.sid) from an active subscription. Beehiiv works similarly — authenticated requests use a JWT token cookie. In both cases, the credential is already in your browser. You just need a tool that knows how to use it.
Existing alternatives either required a clunky browser extension, didn't handle paywalled content, didn't support bulk export, covered only one platform, or stored your data on their servers, which is precisely the thing you shouldn't do with someone's authenticated session cookie. The design constraint was privacy-first from the start: no database, no accounts, nothing that touches a server beyond the request it's processing right now.
System Architecture
Stateless by design
There is no database. No user accounts. No server-side state between requests. The session cookie lives in the browser's sessionStorage and is passed directly to each API call; it never persists beyond the tab that holds it.
Content extraction priority: For Substack, the system attempts to get the post body from three sources in order: (1) Substack REST API body_html, (2) window._preloads JSON in the page HTML (subscriber-specific content), (3) DOM selectors on the rendered page. The longest version by word count wins. For Beehiiv, the system uses the authenticated posts endpoint with the JWT token cookie, falling back to DOM extraction for custom-domain publications. This "longest content" heuristic exists because paywalled content is sometimes in the preloads but not the API response, and occasionally only in the rendered DOM.
Word count discrepancy warning: After export, the system compares the exported word count to the platform's reported word count. If the exported content is less than 72% of the reported length (and the article is over 400 words), the user sees a warning that the export may be incomplete, usually indicating a paywall that the session cookie didn't fully bypass.
Design Decisions
The calls that shaped it
Zero persistence, by principle
Users pass their Substack session cookie (connect.sid / substack.sid) or Beehiiv JWT token to this tool. Those credentials give access to all their paid subscriptions on each platform. Storing either, even temporarily, would be a meaningful security risk and a meaningful trust violation. The credential lives in sessionStorage (not localStorage), is never sent to any logging or analytics service, and is cleared when the tab closes.
Tradeoff: Users have to re-paste the cookie each new session. The "Keep me signed in" checkbox just prevents the field from clearing on page reload; it still clears on tab close. Slightly more friction, significantly more trustworthy.
Folder export over ZIP for large libraries
A ZIP export is a single atomic request; it either succeeds or fails entirely. For a publication with 200 posts, one network hiccup at post 180 loses everything. Folder export (via the File System Access API) is N sequential requests that write files progressively. A failure at post 180 means you have 179 files and a manifest explaining what's missing. On a resume, it skips files that already exist.
Tradeoff: File System Access API only works on Chrome and Edge, not Firefox, not Safari. ZIP export is the fallback for other browsers. The folder mode is clearly labelled with browser requirements in the UI.
SSRF protection from day one
The tool fetches URLs that users supply. Without validation, a malicious user could point it at an internal network resource: a metadata endpoint, a private API, a local admin panel. The validator blocks HTTP (HTTPS only), blocks all RFC 1918 private IP ranges, and blocks common internal hostnames (*.local, *.internal, *.corp, etc.).
Tradeoff: None worth mentioning. SSRF protection is non-negotiable for any tool that fetches user-supplied URLs. The only cost is a few milliseconds of DNS and IP validation per request.
Deterministic filenames for resumable exports
Exported files are named YYYY-MM-DD-slug.md, date plus slug, derived from the post metadata. This means the same post always produces the same filename. Resuming an interrupted bulk export skips files that already exist. Re-running an export after new posts are published adds only the new files without touching existing ones.
Tradeoff: Titles that change after publication (both Substack and Beehiiv allow this) produce the same filename as the original, not the updated title. The slug is more stable than the title as a key.
Operational Thinking
Zero-infra product at real usage
AI System Thinking
Why this product has no AI, and why that's right
This tool has no AI. That's not a limitation; it's a design choice that took some conscious resistance to make.
The problem here is extraction fidelity, not generation. The user wants their content transferred from Substack's or Beehiiv's servers to their own files, accurately, completely, without distortion. Adding an LLM to "summarize" or "enhance" the content would defeat the entire purpose: the user wants the original words, not an interpretation of them.
The "intelligence" in the system is in three places: the longest-content selection heuristic (picking the most complete version of the HTML from multiple sources across both platforms), the word count discrepancy math (flagging incomplete exports), and the credential normalization logic (handling the different formats a Substack cookie or Beehiiv JWT token might arrive in from browser DevTools). These are deterministic algorithms, not probabilistic models. They're right 100% of the time or they're clearly wrong: there's no "mostly right."
A possible AI addition that would actually add value: an LLM-powered "export health" check that reads the exported Markdown and identifies if it looks like article content or like a 403/paywall page rendered as text. Right now, a user whose session cookie expired mid-export gets 50 files that say "Subscribe to read" in them. Detecting that without manual inspection would be genuinely useful. Not on the roadmap yet, but worth the build.
Let's talk.
Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.