Why Sri Lanka’s Fuel QR System Crashed in 2026: Lessons on Why “Simple” Systems Fail at National Scale

I just finished watching Krish Dinesh’s excellent video “Fuel QR System Failure Explained | Why Simple Systems Fail at Scale”, and I couldn’t stop thinking about it. As someone who builds software for a living, the breakdown hit hard. What looked like a straightforward web app with QR codes for fuel rationing turned out to be a textbook example of how even well-intentioned systems collapse when they hit real-world scale, legacy dependencies, and human behaviour under crisis.

The Context Most People Missed

In 2023, Sri Lanka introduced a QR-based fuel quota system during the economic crisis. In March 2026, with fresh global fuel shortages, the government reintroduced it. On the surface, the idea was simple: link a vehicle to a mobile number, generate a QR code, and let fuel stations scan it to issue the daily/weekly quota.

But within hours (or even minutes) of the 2026 relaunch, things started breaking. Long queues at petrol stations, error messages everywhere, people unable to register or fetch their QR, and widespread frustration. Many outsiders (and even some in tech) asked the obvious question: “How can such a simple system fail so badly?”

Krish’s answer is spot on: this was never a simple system. The visible web app was just the tip of the iceberg.

The Real Complexity Beneath the Surface

The system had to handle:

- Millions of users are trying to register or validate at the same time

- Real-time (or near real-time) validation against the Department of Motor Vehicles (RMV) database

- Legacy QR data from 2023 that was never properly cleaned

- Users in panic mode — refreshing, retrying, borrowing family phones, and submitting duplicates

- Fuel stations scanning codes under pressure

The biggest single point of failure was the **synchronous dependency on the RMV API**. Every time someone tried to register or refresh their QR code, the app sent a direct request to the RMV system to verify vehicle ownership. That legacy API was never designed for high concurrency. When thousands of requests hit it simultaneously, it slowed down or timed out. Users saw errors and retried even more aggressively — creating a classic retry storm that brought everything down.

On top of that, the 2026 rollout reused old 2023 QR data without proper cleanup. Vehicles had been sold, mobile numbers had changed or been disconnected, and ownership had been transferred. Suddenly, the same QR code or mobile number was being claimed by multiple people, or the system rejected valid users because the records no longer matched.

The Architectural Lessons That Stood Out to Me

Krish walked through several practical fixes that should have been in place from day one. These are the parts I found most valuable:

1. Never hit a legacy system synchronously at scale

Instead of calling the RMV API directly from the web app or frontend, the system should have queued the validation requests. Process them in controlled batches (e.g., 20 requests per minute per instance) so the downstream system isn’t overwhelmed. Once validation succeeds or fails, notify the user asynchronously via SMS or push notification with a reference number.

2. Event-driven architecture and decoupling

Use something like Kafka (or any reliable message queue) to separate concerns: registration → validation → QR generation → notification. This way, even if the RMV service is slow, the rest of the system stays responsive and can retry intelligently.

3. Idempotency and proper user feedback

Every request should carry a unique idempotency key. If a user retries, the system should recognise it’s the same request and not create duplicates. Give users immediate acknowledgement (“Your request has been received, ref #ABC123 — we’ll notify you shortly”) instead of making them wait in uncertainty.

4. Data integrity and cleanup strategy

Before re-launch, the team should have taken a fresh data dump from RMV, purged outdated QR records, and handled ownership transfers properly. Relying on mobile numbers as the primary identifier was flawed — vehicles get sold, numbers get recycled. The chassis number or engine number would have been far more reliable.

5. Phased rollout instead of big bang

Launching nationwide overnight was asking for trouble. A phased approach (by province, by vehicle type, or even by district) would have allowed the team to spot and fix issues while the load was still manageable.

6. Resilience patterns

Rate limiting, circuit breakers, caching of frequent validations, read replicas, and graceful degradation (e.g., allowing stations to work offline with periodic sync) were all missing or insufficient.

Human Behaviour in a Crisis

One point Krish made that really resonated: during shortages, people don’t behave like calm, rational users in a normal app. They panic. They refresh constantly. They borrow phones. They try every workaround. Any system that doesn’t account for that “anxious user multiplier” is going to fail.

My Personal Takeaways as a Developer

Watching this video reminded me why architecture and operational thinking matter more than ever. Today, with AI tools generating code so quickly, the easy part really is writing the functionality. The hard part is:

- Designing for failure

- Planning for scale before you need it

- Understanding legacy integrations

- Thinking about rollout strategy and user psychology

It also made me reflect on how often we (as engineers) underestimate non-functional requirements. “It works on my machine” or “it works with 100 users” is very different from “it works when the entire country is trying to fill fuel at the same time.”

The video ends with a strong message: **Learn to think like an architect, not just a coder.** I couldn’t agree more.

If you’re a software engineer, tech lead, or architect, especially if you work on public or high-scale systems. I highly recommend watching Krish Dinesh’s full video. It’s one of the best real-world case studies I’ve seen in a while.

https://www.youtube.com/watch?v=iwygBhXy-wU