Engineering

My Battle Scars from Building a System That Serves Millions

Suman Khadka • Jun 10, 2025 • 8 min read

Senior Software Engineer building scalable systems.

What nobody tells you about the messy reality of scaling

The Day Everything Changed

I still remember the exact moment our system first buckled. It was 2 AM, my phone was buzzing with alerts, and our users were flooding support with “the app won’t load” messages. We had just crossed 10,000 active users, and I was staring at server metrics that looked like a heart attack in progress.

That night taught me something no computer science textbook ever mentioned: scaling hurts before it helps.

My Naive Beginning: The Beauty of Not Knowing Better

Looking back, I’m almost nostalgic for our original setup. We had built something beautifully stupid:

A single Django app that did everything
One PostgreSQL database (because we thought we were too cool for MySQL)
A $30/month DigitalOcean droplet
Pure optimism as our load balancer

For the first 500 users, this thing purred like a content cat. I could deploy by literally typing git push heroku main. Debugging meant adding a few print statements. When something broke, I knew exactly where to look because there was only one place it could break.

Those were the golden days when I actually slept through the night.

The Great Server Shopping Spree (Spoiler: It Didn’t Work)

When performance started degrading, I did what every desperate developer does: I threw money at it.

“Let’s get bigger servers!” I declared, upgrading from our humble droplet to a beefy 8-core machine with 32GB RAM. When that wasn’t enough, I spun up three more identical servers behind a load balancer.

The irony hit me later: I had just built a very expensive way to overwhelm the same single database. All those shiny new servers were still fighting over one tiny PostgreSQL instance like customers at a Black Friday sale.

My first hard lesson: Horizontal scaling is only as strong as your weakest link.

The Redis Epiphany That Saved My Sanity

The breakthrough came during a particularly brutal weekend. My team mate suggested we try Redis, and honestly, I was skeptical. “Another technology to learn? Another thing that can break?”

But desperation makes you open-minded.

I started small—caching user profiles and the most frequently accessed posts. The implementation was embarrassingly simple:

 
 python
 # Check cache first, database second
 def get_user_profile(user_id):
    cached = redis.get(f"profile:{user_id}")
    if cached:
        return json.loads(cached)

    profile = database.fetch_profile(user_id)
    redis.setex(f"profile:{user_id}", 3600, json.dumps(profile))
    return profile

The results were immediate. Page load times dropped from 3-4 seconds to under 500ms. Database CPU usage fell off a cliff. I felt like I had discovered fire.

My second lesson: Sometimes the simplest solutions have the biggest impact.

The Microservices Mistake I Had to Make

Success made me cocky. “If caching works this well,” I thought, “imagine what proper architecture could do!”

I spent three months breaking our monolith into microservices:

Auth Service for user management
Content Service for posts and media
Social Service for likes, follows, and feeds

On paper, it looked professional. In practice, it was a coordination nightmare.

A simple “load user timeline” request now involved:

Auth service validates the token
Social service finds who they follow
Content service fetches recent posts
Social service adds engagement data
Everything gets stitched together

What used to be one database query became a spider web of network calls. Latency went up. Complexity exploded. I spent more time debugging service-to-service communication than actual features.

My third lesson: Microservices solve organizational problems, not performance problems. Don’t use them just because Netflix does.

When Kafka Taught Me the Art of Letting Go

The real breakthrough came when I stopped trying to make everything synchronous. I discovered that most of what felt “urgent” really wasn’t.

When a user posts something, do we really need to immediately:

Update their follower count?
Send push notifications?
Update recommendation algorithms?
Log analytics events?

The answer was no. We needed to acknowledge the post immediately, but everything else could happen eventually.

Enter Kafka. Now when someone posts, we:

Save the post to the database
Return success to the user
Drop a “PostCreated” event into Kafka
Let other services process it in their own time

Suddenly, posting felt instant. Users were happy. Servers were happy. I was happy.

My fourth lesson: The user only cares about their immediate action. Everything else can wait.

The Database Wars: When PostgreSQL Waved the White Flag

At around 100,000 users, our database started sending distress signals. Query times that used to be 50ms were hitting 2-3 seconds. The CPU was pegged at 100%. Connection pools were exhausted.

I had three choices:

Scale up: Bigger, more expensive database servers
Scale out: Multiple databases with careful planning
Optimize: Better queries, better indexing, better schema

I tried all three, in that order.

Scaling up bought us time but not a solution. Eventually, I implemented read replicas—one master for writes, three replicas for reads. Then I started sharding users by geographic region.

The hardest part wasn’t the technical implementation—it was accepting that my beautiful, simple database architecture was gone forever.

My fifth lesson: Scaling is a series of beautiful architectures you have to abandon.

Building My Crystal Ball: The Monitoring Revolution

For the first year, debugging felt like being a detective with no evidence. Something would break, users would complain, and I’d frantically dig through logs trying to piece together what happened and when.

Then I built what I call my “system crystal ball”:

Grafana dashboards showing every metric that mattered
ELK stack for searchable, filterable logs
Custom alerts that woke me up before users noticed problems
Distributed tracing to follow requests through our service maze

The transformation was incredible. Instead of “something is slow,” I could see exactly which service, which endpoint, which database query was the problem.

More importantly, I started seeing patterns. Traffic spikes at lunch time. Memory leaks that took exactly 6 hours to manifest. Database queries that got slower as our user table grew.

My sixth lesson: You can’t fix what you can’t see, and you can’t predict what you don’t measure.

The CDN Revelation: Why I Was Sending Images to Mars

I discovered we were doing something embarrassingly stupid. A user in Tokyo would request our CSS file, and we’d send it from our server in Virginia. Every time. For every user. Even though the file never changed.

Implementing Cloudflare CDN was like upgrading from a bicycle to a rocket ship. Static assets—images, CSS, JavaScript—now served from edge locations worldwide.

The impact was immediate:

Page load times improved globally
Server bandwidth usage dropped 60%
Users stopped complaining about slow image loading

My seventh lesson: The internet is fast, but physics is still physics. Put your content close to your users.

What I Wish Someone Had Told Me on Day One

If I could send a message back to my naive, optimistic past self, here’s what I’d say:

Start With These From Day One

Redis caching – Not when you need it, but before you think you do
Proper monitoring – Grafana and good logging from commit #1
Database indexing strategy – Plan for queries you don’t have yet
Async job processing – Not everything needs to happen right now
CDN for static assets – It’s cheap insurance against geography

Avoid These Tempting Mistakes

Don’t scale by adding identical servers – Find your actual bottleneck first
Don’t rush to microservices – Monoliths can scale further than you think
Don’t optimize prematurely – But do measure everything
Don’t assume cloud auto-scaling solves everything – It just makes problems more expensive

The Uncomfortable Truth About Scaling

Here’s what nobody tells you: Every scaling solution creates new problems.

Redis solved our database bottleneck but gave us cache invalidation headaches. Microservices improved our team’s autonomy but complicated our deployment pipeline. Load balancers improved our reliability but made debugging distributed issues harder.

Scaling isn’t about finding the perfect architecture—it’s about trading problems you can’t handle for problems you can.

My Current Philosophy: Embrace the Inevitable Break

These days, I don’t try to build systems that never break. I build systems that break gracefully and recover quickly.

Every bottleneck is a lesson. Every outage is a teacher. Every user complaint is data about what really matters.

When you’re serving hundreds of thousands of users and something goes wrong at 3 AM, remember: this is the luxury problem of success.

That spinning loader isn’t just a frustrated user—it’s proof that you built something people actually want to use.

The Real Secret

After five years of battle scars, server crashes, and 3 AM emergency fixes, I’ve learned the real secret of scaling:

It’s not about the perfect architecture. It’s about building something users love enough to stick with while you figure out the architecture.

Your system will break. Users will complain. Servers will crash.

And then you’ll fix it, learn from it, and build something better.

That’s not a bug in the process—that’s the process.

The best systems aren’t built by people who knew everything from the start. They’re built by people who learned everything the hard way and kept going anyway.

Start where you are. Use what you have. Fix what breaks. Repeat until millions.

Tags: application performance backend architecture CDN implementation database scaling distributed systems engineering lessons microservices lessons performance optimization PostgreSQL optimization production scaling Redis caching scaling mistakes scaling strategy scaling web applications startup scaling system architecture system design system monitoring web application scaling

Found this helpful?

Contents

My Battle Scars from Building a System That Serves Millions

The Day Everything Changed

My Naive Beginning: The Beauty of Not Knowing Better

The Great Server Shopping Spree (Spoiler: It Didn’t Work)

The Redis Epiphany That Saved My Sanity

The Microservices Mistake I Had to Make

When Kafka Taught Me the Art of Letting Go

The Database Wars: When PostgreSQL Waved the White Flag

Building My Crystal Ball: The Monitoring Revolution

The CDN Revelation: Why I Was Sending Images to Mars

What I Wish Someone Had Told Me on Day One

Start With These From Day One

Avoid These Tempting Mistakes

The Uncomfortable Truth About Scaling

My Current Philosophy: Embrace the Inevitable Break

The Real Secret

Have thoughts to share?

More on Engineering

My Battle Scars from Building a System That Serves Millions

The Day Everything Changed

My Naive Beginning: The Beauty of Not Knowing Better

The Great Server Shopping Spree (Spoiler: It Didn’t Work)

The Redis Epiphany That Saved My Sanity

The Microservices Mistake I Had to Make

When Kafka Taught Me the Art of Letting Go

The Database Wars: When PostgreSQL Waved the White Flag

Building My Crystal Ball: The Monitoring Revolution

The CDN Revelation: Why I Was Sending Images to Mars

What I Wish Someone Had Told Me on Day One

Start With These From Day One

Avoid These Tempting Mistakes

The Uncomfortable Truth About Scaling

My Current Philosophy: Embrace the Inevitable Break

The Real Secret

Have thoughts to share?

Ink Drops, Fresh Ideas

More on Engineering

Building APIs People Actually Want to Use

Top 5 Programming Languages You Should Learn in 2024

Ink Drops, Fresh Ideas