While Slack said its own serving systems scaled quickly to meet such peaks in demand, its TGWs did not scale fast enough. We go from our quietest time of the whole year to one of our biggest days quite literally overnight." "On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. However, Slack's annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). The TGWs are managed by AWS and are intended to scale transparently to us. "On January 4th, one of our Transit Gateways became overloaded. "By the time Slack had recovered, engineers at AWS had found the trigger for our problems: Part of our AWS networking infrastructure had indeed become saturated and was dropping packets," it said. SLACK STARTS MASSIVE OUTAGE PLUS"This - plus retries and circuit breaking - got us back to serving," it said.īy around 9.15am PST, Slack was "degraded, not down". The load balancers "panic mode" feature kicked in and instances that were failing health checks were balanced. Its web tier, however, had a sufficient number of functioning hosts to serve traffic, but its load balancing tier was still showing an extremely high rate of health check failures to its web application instances due to network problems. We still had some less-critical production issues which were mitigated or being worked on, and we still had increased packet loss in our network," Slack said. "We saw an improvement as healthy instances entered service.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |