Degraded load times and live updates

Incident Report for ProductOS

Postmortem

Some API servers began consuming large amounts of CPU, resulting in them being restarted due to failing health checks. This resulted in periodic failed or slow network requests and degraded performance for real time updates. We scaled up the number of API instances to compensate, but simultaneously we hit a bug in our load balancer due to a misconfiguration that prevented network traffic from hitting our backend instances. The load balancer issue was fixed quickly and service resumed as normal.

Posted Jul 27, 2020 - 15:29 UTC

Resolved

This incident has been resolved.

Posted Jul 27, 2020 - 15:26 UTC

Update

We are continuing to monitor for any further issues.

Posted Jul 27, 2020 - 15:16 UTC

Monitoring

The number of available API servers has been increased and we're monitoring the situation

Posted Jul 27, 2020 - 14:12 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 27, 2020 - 13:43 UTC

Identified

The jellyfish API servers are experiencing elevated CPU usage, resulting in regular instance restarts. This is affecting API query times and realtime updates, including messaging. We are increasing the number of available servers to help tackle the issue.

Posted Jul 27, 2020 - 13:43 UTC

This incident affected: Jellyfish (API).