AWS Connectivity Issues Causing API Downtime

Incident Report for RevenueCat

Resolved

Marking resolved. AWS has communicated with us that they've resolved the underlying issue and we don't expect further downtime.

What to expect:
1. New purchases could not be processed by RevenueCat during the downtime. If your app uses one of RevenueCat’s SDKs, it will retry sending the purchase to RevenueCat the next time users open the app.
2. Existing purchases that were already in RevenueCat were not affected aside from delayed event generation and dispatch. Those users’ purchases continue to be tracked.
3. If you use our API directly (for example, you use Stripe with RevenueCat or use our API directly from your app without using our SDKs) you should retry any failed requests.
4. Any requests that are retried will cause events to be generated and dispatched to any integrations you have configured in RevenueCat.
5. ETLs, charts, and customer lists are all operational and up to date.

Posted Feb 10, 2023 - 22:31 UTC

Monitoring

Event dispatching has been restored. There are a few things left to stand up but mostly we're back up and running. Happy friday.

Posted Feb 10, 2023 - 21:45 UTC

Update

The API and Dashboard are back to 100% and fully functioning. We still have our event dispatching system paused and will begin restoring it soon.

Posted Feb 10, 2023 - 21:28 UTC

Update

Bringing the traffic up to 75% now, so far so good. AWS is saying the outage is "intermittent", so we're monitoring for any regressions.

Posted Feb 10, 2023 - 21:23 UTC

Update

Ramping traffic to 50% ... officially a partial outage... yay?

Posted Feb 10, 2023 - 21:14 UTC

Update

Raising traffic to 20% ... stay on target.

Posted Feb 10, 2023 - 21:10 UTC

Identified

We're slowly ramping up traffic to the production databases. If all goes well issue will be resolved soon.

Posted Feb 10, 2023 - 21:09 UTC

Update

We just were able to reconnect to the database again. Going to see if we can begin restoring service.

Posted Feb 10, 2023 - 21:02 UTC

Update

AWS has confirmed the issue an hour and half later:
"Feb 10 12:53 PM PST We are investigating connectivity issues affecting some instances in a the US-EAST-1 Region."

We're restoring a 15 hour old snapshot that will allow us to restore service should AWS not resolve the issue soon. Should we go down this path there will be some partial data loss from today, but most of that will be recoverable. Will continue with updates as we progress.

Posted Feb 10, 2023 - 20:59 UTC

Update

The most recent message from AWS:
"Unfortunately many DBs have been impacted on a large scale, so it is not specific to this DB"

This is unexplored territory for us, our backup database is also exhibiting the same failure. We're running down some possible mitigations assuming AWS doesn't resolve the issue soon.

Posted Feb 10, 2023 - 20:38 UTC

Update

Message from AWS support: "There appears to be some issues for us-east-1 and our internal team is aware and working on mitigating the issue."

Posted Feb 10, 2023 - 20:16 UTC

Update

We're in contact with AWS, they believe it's an incident on their end. We've failed over the database to a separate availability zone which didn't resolve the issue. We're beginning work on additional mitigations but still do not have an ETA.

Posted Feb 10, 2023 - 20:11 UTC

Update

We are continuing to investigate this issue.

Posted Feb 10, 2023 - 19:50 UTC

Update

We are continuing to investigate this issue.

Posted Feb 10, 2023 - 19:32 UTC

Investigating

The dashboard is currently inaccessible. We are investigating the issue.

Posted Feb 10, 2023 - 19:29 UTC

This incident affected: API Uptime, Event Dispatching and Dashboard (Overview, Charts, Customer Lists).