Building AI Apps That Actually Ship

I’ve built a lot of AI prototypes. Most of them are dead. A few of them are live, getting used, and making money. The difference isn’t the AI model — it’s every decision you make around it.

Here’s what I’ve learned.

The demo trap

AI apps are dangerously easy to demo. Prompt a model, get impressive output, screen-record it, post it on X. Everyone’s impressed. Then you try to turn it into a product and realize the demo only worked because you hand-crafted the input.

Real users don’t hand-craft their input. They type quickly, leave out context, ask the wrong question, and expect the system to figure it out. Building for that is the actual work.

Start with the failure mode, not the happy path

Before writing a single line of code, I ask: what happens when the AI is wrong? For Orion, wrong means a marketing campaign brief that misrepresents the brand. For the HBS Campus Guide, wrong means sending someone to the wrong building. These are very different failure modes with very different acceptable rates.

Once you know your failure mode, you can design around it — human review steps, confidence thresholds, fallback responses, clear UX that sets expectations.

The latency problem is real

GPT-4 quality responses take 3–8 seconds. Users tolerate 1–2 seconds for web apps. Everything past that needs streaming, loading states, or an async pattern where you kick off the job and return the result later.

I default to streaming for anything user-facing. It makes 4-second responses feel fast because the user sees progress immediately.

Architecture decisions that matter

Separate your prompts from your code. Prompt iteration is a different workflow from code iteration. If your prompt is hardcoded in a function, every tweak requires a deployment. Store prompts in config files or a simple CMS so you can improve them without touching code.

Cache aggressively. LLM calls are expensive and slow. If there’s any chance a query will repeat — and there usually is — cache the result. Even a 1-hour cache on common queries can cut costs dramatically.

Log everything in the beginning. Input, output, latency, cost, user feedback. You cannot improve what you cannot measure. In early production I log every single LLM call. Once I understand the patterns I get selective.

The deployment question

Cloudflare Pages + a static Astro frontend + a lightweight backend (Workers or a Node server) is my current default stack for AI apps. It’s fast, cheap, and scales without thinking about it. I use Supabase for anything that needs persistence.

The HBS Campus Guide runs entirely static — no backend at all, no AI calls at runtime. All the “intelligence” is baked into the data model at build time. That’s the ideal: push complexity to build time, not runtime.

Ship ugly, iterate fast

The apps I’m most proud of shipped ugly. The first version of Orion had no design — it was a text box, a submit button, and a wall of text output. Users told me what they needed. I added design after I understood the product.

The AI part of an AI app is usually the easiest part. The hard part is the same as any other software: what problem are you solving, for whom, and how do you know when you’ve solved it?