Substack & Beehiiv Downloader: How I Built It

Why it exists

You paid for it. You should be able to keep it.

I was saving Substack articles to Notion and Readwise, then watching them become inaccessible when subscriptions lapsed. The content was gone. I had paid to read it; I hadn't paid to own it. That asymmetry felt wrong, and it applies equally to Beehiiv newsletters.

The technical insight that made the product possible: Substack has a public API that returns the full body_html of any post. For paywalled content, it returns the full body if you pass a valid session cookie (connect.sid or substack.sid) from an active subscription. Beehiiv works similarly — authenticated requests use a JWT token cookie. In both cases, the credential is already in your browser. You just need a tool that knows how to use it.

Existing alternatives either required a clunky browser extension, didn't handle paywalled content, didn't support bulk export, covered only one platform, or stored your data on their servers, which is precisely the thing you shouldn't do with someone's authenticated session cookie. The design constraint was privacy-first from the start: no database, no accounts, nothing that touches a server beyond the request it's processing right now.

System Architecture

Stateless by design

There is no database. No user accounts. No server-side state between requests. The session cookie lives in the browser's sessionStorage and is passed directly to each API call; it never persists beyond the tab that holds it.

Single article export

User

URL + optional session cookie

SSRF Validation

Block private IPs · HTTPS only

Platform API

Substack /api/v1/posts/{slug} · Beehiiv posts endpoint

Format Conversion

Turndown → MD · docx → DOCX · pdf-lib → PDF

Download

Direct to browser

Bulk export (folder mode)

Publication URL

+ session cookie

List All Posts

25/page · loop until done

Folder Picker

File System Access API

Sequential Export

Skip existing · manifest.json · YAML frontmatter

Content extraction priority: For Substack, the system attempts to get the post body from three sources in order: (1) Substack REST API body_html, (2) window._preloads JSON in the page HTML (subscriber-specific content), (3) DOM selectors on the rendered page. The longest version by word count wins. For Beehiiv, the system uses the authenticated posts endpoint with the JWT token cookie, falling back to DOM extraction for custom-domain publications. This "longest content" heuristic exists because paywalled content is sometimes in the preloads but not the API response, and occasionally only in the rendered DOM.

Word count discrepancy warning: After export, the system compares the exported word count to the platform's reported word count. If the exported content is less than 72% of the reported length (and the article is over 400 words), the user sees a warning that the export may be incomplete, usually indicating a paywall that the session cookie didn't fully bypass.

Design Decisions

The calls that shaped it

Zero persistence, by principle

Users pass their Substack session cookie (connect.sid / substack.sid) or Beehiiv JWT token to this tool. Those credentials give access to all their paid subscriptions on each platform. Storing either, even temporarily, would be a meaningful security risk and a meaningful trust violation. The credential lives in sessionStorage (not localStorage), is never sent to any logging or analytics service, and is cleared when the tab closes.

Tradeoff: Users have to re-paste the cookie each new session. The "Keep me signed in" checkbox just prevents the field from clearing on page reload; it still clears on tab close. Slightly more friction, significantly more trustworthy.

Folder export over ZIP for large libraries

A ZIP export is a single atomic request; it either succeeds or fails entirely. For a publication with 200 posts, one network hiccup at post 180 loses everything. Folder export (via the File System Access API) is N sequential requests that write files progressively. A failure at post 180 means you have 179 files and a manifest explaining what's missing. On a resume, it skips files that already exist.

Tradeoff: File System Access API only works on Chrome and Edge, not Firefox, not Safari. ZIP export is the fallback for other browsers. The folder mode is clearly labelled with browser requirements in the UI.

SSRF protection from day one

The tool fetches URLs that users supply. Without validation, a malicious user could point it at an internal network resource: a metadata endpoint, a private API, a local admin panel. The validator blocks HTTP (HTTPS only), blocks all RFC 1918 private IP ranges, and blocks common internal hostnames (*.local, *.internal, *.corp, etc.).

Tradeoff: None worth mentioning. SSRF protection is non-negotiable for any tool that fetches user-supplied URLs. The only cost is a few milliseconds of DNS and IP validation per request.

Deterministic filenames for resumable exports

Exported files are named YYYY-MM-DD-slug.md, date plus slug, derived from the post metadata. This means the same post always produces the same filename. Resuming an interrupted bulk export skips files that already exist. Re-running an export after new posts are published adds only the new files without touching existing ones.

Tradeoff: Titles that change after publication (both Substack and Beehiiv allow this) produce the same filename as the original, not the updated title. The slug is more stable than the title as a key.

Operational Thinking

Zero-infra product at real usage

Monthly users

200–300

Launched May 2026. No paid acquisition. Organic from ProductHunt, Reddit, and Substack/Beehiiv community threads. Growth is slow and word-of-mouth, which is appropriate for a privacy-first tool.

Infrastructure cost

~$0/month

Vercel free tier. No database, no storage, no caching layer. The only cost would be Vercel function execution, which at current volume stays within the free tier's 100GB-hours/month allocation.

Scalability ceiling

Vercel limits

Bulk export functions run for up to 300 seconds (Vercel max). A publication with 500+ posts at moderate speed will approach this limit. Chunked export or a queue mechanism would be needed above ~300 posts per session.

Rate limiting

3 retries / 800ms backoff

Per-request retry logic for Substack and Beehiiv API calls. Not a global rate limiter on the tool itself, which means a single user doing a large bulk export isn't blocked by another user's traffic.

Discrepancy threshold

72% word match

If the exported content is less than 72% of the platform's reported word count, a warning surfaces. The threshold was calibrated against known paywalled exports on both Substack and Beehiiv to minimize false positives while catching real truncation.

Session security

sessionStorage only

Cookie never leaves the browser except as a request header. No logging, no analytics on the value. Server logs are suppressed for cookie-containing requests. The privacy model is stronger than most tools in this category.

AI System Thinking

Why this product has no AI, and why that's right

This tool has no AI. That's not a limitation; it's a design choice that took some conscious resistance to make.

The problem here is extraction fidelity, not generation. The user wants their content transferred from Substack's or Beehiiv's servers to their own files, accurately, completely, without distortion. Adding an LLM to "summarize" or "enhance" the content would defeat the entire purpose: the user wants the original words, not an interpretation of them.

The "intelligence" in the system is in three places: the longest-content selection heuristic (picking the most complete version of the HTML from multiple sources across both platforms), the word count discrepancy math (flagging incomplete exports), and the credential normalization logic (handling the different formats a Substack cookie or Beehiiv JWT token might arrive in from browser DevTools). These are deterministic algorithms, not probabilistic models. They're right 100% of the time or they're clearly wrong: there's no "mostly right."

A possible AI addition that would actually add value: an LLM-powered "export health" check that reads the exported Markdown and identifies if it looks like article content or like a 403/paywall page rendered as text. Right now, a user whose session cookie expired mid-export gets 50 files that say "Subscribe to read" in them. Detecting that without manual inspection would be genuinely useful. Not on the roadmap yet, but worth the build.

← Back to all projects

Visit site ↗ GitHub ↗

Let's talk.

Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.

Email me LinkedIn Twitter