Microsoft had three staff at Australian data centre campus when Azure went out

By Ry Crozier

Sep 4 2023 6:55AM

Cascading failures and root causes revealed.

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.

Microsoft had three staff at Australian data centre campus when Azure went out

The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.

The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down - or had components fried - in the incident that data, and all replicas of it, were offline.

In addition, after storage nodes were finally recovered, a "tenant ring" hosting over 250,000 databases, failed - albeit with uneven impact on customers.

Chillers offline

Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.

A power sag - voltage dip - caused the five operating chillers to fault. In addition, only one of the standby units worked.

Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”

The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.

“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.

“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

On its EOP, Microsoft said: “The EOP for restarting chillers is slow to execute for an event with such a significant blast radius.”

“We are exploring ways to improve existing automation to be more resilient to various voltage sag event types.”

While there weren’t enough staff to execute the documented procedures, having more staff would’ve gotten to the same result faster, as the chillers themselves have issues.

Preliminary investigations showed the chiller plant did not automatically restart “because the corresponding pumps did not get the run signal from the chillers.”

“This is important as it is integral to the successful restarting of the chiller units,” Microsoft said.

“We are partnering with our OEM vendor to investigate why the chillers did not command their respective pump to start.”

Microsoft said the faulted chillers could not be manually restarted “as the chilled water loop temperature had exceeded the threshold.”

With rising temperatures, and thermal warnings from infrastructure, Microsoft had no choice but to shut down servers.

“This successfully allowed the chilled water loop temperature to drop below the required threshold and enabled the restoration of the cooling capacity,” it said.

Storage, SQL database recovery

Still, not everything recovered smoothly.

The incident impacted seven storage tenants - five “standard”, two “premium”.

Some storage hardware was “damaged by the data hall temperatures”, Microsoft said.

Diagnostics weren’t available for troubleshooting because the storage nodes were offline.

“As a result, our onsite data centre team needed to remove components manually, and re-seat them one by one to identify which particular component(s) were preventing each node from booting,” Microsoft said.

“Several components needed to be replaced for successful data recovery and to restore impacted nodes.

“In order to completely recover data, some of the original/faulty components were required to be temporarily re-installed in individual servers.”

An infrastructure-as-code automation also failed, “incorrectly approving stale requests, and marking some healthy nodes as unhealthy, which slowed storage recovery efforts.”

The failure of a tenant ring hosting over 250,000 SQL databases further slowed recovery, Microsoft said.

“As we attempted to migrate databases out of the degraded ring, SQL did not have well tested tools on hand that were built to move databases when the source ring was in [a] degraded health scenario,” the company said.

“Soon this became our largest impediment to mitigating impact.”

A final PIR is expected to be completed in a few weeks.

Got a news tip for our journalists? Share it with us anonymously here.

Tags:

Partner Content

Partner Content In a world of targeted attacks database security is more crucial than ever

Partner Content How can you safeguard your enterprise from sophisticated cyber threats?

Partner Content Entries open for the 2024 Australian IoT Awards

Partner Content Interactive boosts UCC Coffee’s network to optimise employee experience

Most Read Articles

CBA backs GitHub automations to get new features to customers faster

Health tech startup Kismet raises $4m in pre-seed funding

More than half of loyalty members concerned about their data

How eBay uses interaction analytics to improve CX

AI isn't secure, says America's NIST

"Forgotten" debugging registers enabled Triangulation exploit against iPhones

Microsoft adds AI button to keyboards to call up chatbot

Intel spins out AI software firm

US Supreme Court chief justice urges 'caution' as AI reshapes legal field

Microsoft had three staff at Australian data centre campus when Azure went out

Cascading failures and root causes revealed.

Partner Content

Sponsored Whitepapers

Most Read Articles

CBA backs GitHub automations to get new features to customers faster

NAB decommissions 26-year-old Teradata platform