- At 4:17pm WIB, we picked up an unusual spike in activities with our messaging queue system. Our team started investigating the issue and worked with our service provider for resolution.
- Based on the investigation, our capacity messaging queue system hit the threshold for the number of connections to a single host.
- At 6:20pm we re-triggered redeployments to close current connections. 4 minutes after the deployment, temporary memory usage spiked due to the way the messaging system was designed by the vendor to handle the closing of connections.
- This spike caused one of our instances to be degraded and resulted in bad gateway errors and delayed callbacks.
- At 6:40pm, we re-triggered redeployments to recover our systems.
- At 7:23pm, our systems fully recovered and all delayed callbacks were sent.
What measures have we taken to prevent this issue in the future?
Action items we are taking to prevent issues from happening again in the future:
- Improving our monitoring systems: We will lower the thresholds for identifying anomalous messaging connections so that our on-call teams are alerted earlier to rectify spikes.
- Improving our development processes: We are adding reconnection logic on queue clients to reestablish the connections to healthy instance nodes, to mitigate errors when any single instance goes down.
We understand that this was an upsetting situation for you as well as your customers, and know that you are counting on our reliability for the smooth operation of your business. We are truly sorry and apologize for the negative impact of this incident on your customers and business. We are committed to learn from this event and to improve our services even further to serve you better.
If you require any assistance or have further questions, please contact us at email@example.com or through live chat at https://www.xendit.co/.
We strive to continue improving our services every day and do our best to prevent a repeat of this incident. Thank you for your trust in using Xendit to power your business.