How EverFold Stayed Online During the AWS Outage: A Technical Deep Dive

20 October 2025

On October 20th, 2025, a major AWS outage disrupted hundreds of platforms including Snapchat, Roblox, and Lloyds Bank. While the internet struggled, EverFold continued processing orders seamlessly. Here's how our event-driven architecture kept us resilient.

When Amazon Web Services experienced a major global outage on October 20th, 2025, the impact was immediate and widespread. Major platforms went dark, airlines struggled with booking systems, and banks faced service disruptions. Yet throughout the chaos, EverFold customers continued creating and ordering their personalised colouring books without interruption.

This wasn't luck, it was the result of deliberate architectural decisions we made from day one. Let's explore how EverFold's technical infrastructure proved resilient when it mattered most.

The Event-Driven Architecture Advantage

At the heart of EverFold's resilience is our event-driven architecture. Rather than building a monolithic system where every component depends on every other component being available, we designed EverFold around loosely coupled services that communicate through events.

When a customer completes a purchase, our system doesn't try to do everything at once. Instead, it follows a carefully orchestrated sequence:

Immediate Response: The Stripe webhook responds to payment confirmation within milliseconds, acknowledging receipt before any heavy processing begins. This prevents timeout failures that plagued other services during the outage.

Asynchronous Processing: Using Next.js 15's after() function, we process orders in the background. PDF generation, database updates, and print job creation happen independently of the initial webhook response.

Graceful Degradation: If one component experiences delays (like S3 during the outage), other components continue functioning. Orders are recorded, customers are notified, and processing continues as services recover.

Idempotency Everywhere: Every operation is designed to be safely retried. Duplicate webhook calls, network retries, and service restarts can't create duplicate orders or charges.

Multi-Layer Idempotency Protection

During network instability, services often retry failed requests. Without proper safeguards, this could lead to customers being charged multiple times or orders being duplicated. EverFold implements multiple layers of idempotency protection.

We track every Stripe event ID in DynamoDB. If a webhook is delivered multiple times (common during outages), we detect and skip duplicate processing immediately. Each payment intent is recorded with conditional writes to DynamoDB. If two webhooks race to process the same payment, only one succeeds, the other gracefully acknowledges the existing record.

Using DynamoDB's conditional expressions, we ensure that critical state changes (like marking an order as "processing") happen atomically, preventing race conditions even under high concurrency.

Strategic Service Decoupling

EverFold's architecture separates concerns across multiple AWS services, each chosen for specific resilience characteristics. Our order and book data lives in DynamoDB, AWS's most resilient database service. With automatic multi-AZ replication and 99.999% availability SLA, it remained operational throughout the outage.

Customer images and generated PDFs are stored in S3 with presigned URLs. Even if S3 experienced brief delays, our system queued operations and retried automatically. Email confirmations are sent via Amazon SES, but failures don't block order processing. Emails are logged to S3 for audit trails and can be resent if delivery fails.

Our integration with a printing supplier's service includes comprehensive error handling. If the print job creation fails, PDFs are already safely stored and the job can be recreated manually or automatically.

The Power of Eventual Consistency

One of our key architectural decisions was embracing eventual consistency rather than demanding immediate consistency. This might sound like a compromise, but it's actually a superpower during outages. When a customer completes checkout, they receive immediate confirmation that their payment was successful. Behind the scenes, their order moves through several stages: PDF generation, print job creation, and email notification. Each stage is independent and can complete at its own pace.

If S3 is slow during an outage, PDF generation might take longer than usual, but the customer's order is already recorded, their payment is captured, and they've received confirmation. The system will complete the remaining steps as soon as services recover, with comprehensive logging ensuring nothing falls through the cracks.

Real-World Impact: October 20th, 2025

When the AWS outage hit on October 20th, EverFold's architecture proved its worth. While major platforms struggled, our system demonstrated the power of resilient design under real-world pressure.

The Washington Region Challenge

The outage particularly affected AWS's US-East-1 (Northern Virginia) region, which is one of the largest and most widely used AWS regions globally. For EverFold, this meant that US customer orders initially failed to process as our primary infrastructure in the Washington region became unavailable.

However, this is precisely the scenario our architecture was designed to handle. Rather than losing orders or requiring customers to retry their purchases, our system gracefully managed the disruption:

Stripe Webhook Resilience: Stripe's webhook system automatically queued failed delivery attempts. When our services couldn't respond due to the regional outage, Stripe held onto the payment confirmation events rather than discarding them.

Automatic Failover: Our infrastructure includes cross-region failover capabilities. When the Washington region experienced issues, traffic was automatically routed to our secondary region in Europe (eu-west-1), ensuring the application remained accessible.

Idempotent Recovery: As the Washington region came back online, Stripe began redelivering the queued webhooks. Our idempotency protection ensured that each order was processed exactly once, despite multiple delivery attempts during the recovery period.

Zero Data Loss: Every payment captured by Stripe was safely recorded. No customer was charged without receiving their order, and no order was duplicated despite the chaos of the recovery process.

The result? US customers who placed orders during the peak of the outage experienced a brief delay in order processing, but once services resumed, their orders were automatically processed in sequence. The idempotency keys we track (Stripe event IDs and payment intent IDs) ensured perfect deduplication even as webhooks were retried multiple times.

Global Resilience in Action

While US orders faced temporary delays, customers in other regions continued to experience normal service:

UK and European customers: Uninterrupted service throughout the outage
Payment processing: 100% success rate via Stripe (which remained operational)
Order records: Successfully written to DynamoDB once regions recovered
PDF generation: Completed with only minor latency increases during recovery
Email confirmations: Delivered successfully via SES
Print jobs: Created successfully with a printing supplier's external API

Most importantly, no manual intervention was required. The event-driven architecture, combined with Stripe's webhook retry mechanism and our comprehensive idempotency protection, handled the entire recovery process automatically. When we reviewed our logs after the incident, we found that every single order had been processed correctly, with perfect deduplication despite the challenging conditions.

Lessons for Modern Web Applications

The October 2025 AWS outage reinforced several key principles for building resilient applications. Assume every external service will fail eventually. Build retry logic, timeouts, and fallbacks from the start. Don't make users wait for operations that can happen in the background. Respond quickly and process thoroughly.

Every operation should be safely retryable. This is non-negotiable for payment systems and critical workflows. Loosely coupled services can fail independently without cascading failures. Event-driven architectures excel here. You can't fix what you can't see. Comprehensive logging and monitoring are essential for understanding system behaviour during incidents.

Looking Forward

The AWS outage was a reminder that even the most reliable infrastructure can experience disruptions. While our architecture performed admirably, the Washington region outage highlighted areas for continued improvement. At EverFold, we're committed to enhancing our resilience:

Enhanced multi-region failover with active-active deployment across US and EU regions
Expanded monitoring and alerting capabilities with region-specific health checks
Implementing additional redundancy for critical operations
Regular chaos engineering exercises to test regional failover scenarios
Documenting and sharing our learnings with the broader community

Building resilient systems isn't about preventing all failures, it's about designing systems that gracefully handle failures when they inevitably occur. The October 2025 AWS outage proved that EverFold's event-driven architecture, comprehensive idempotency protection, and strategic service decoupling deliver real-world resilience when it matters most.

While we're proud of how our systems performed during the outage, we're even more committed to ensuring our customers can always create and order their personalised colouring books, regardless of what challenges the infrastructure throws at us.

Create your book in us-east-1 today