Affected services:

  • Crane
  • Anvil
  • Attic
  • /common
  • Authentication
  • Networking
  • Rhino
  • Crane Open OnDemand
  • Rhino Open OnDemand

Issues with PKI datacenter (CRANE, ATTIC, ANVIL, COMMON, RHINO)

Opened on Tuesday 27th October 2020, last updated

Resolved

The power infrastructure issues at our PKI datacenter have been corrected at this time and CRANE, ANVIL, ATTIC, COMMON, and the associated resources are back online and available for access. As with any complex system there may be a few lingering issues, so if you have any issues or questions, please contact hcc-support@unl.edu. Please see the important notes below for actions you should take. A summary of this unexpected and ultimately out of our control outage is at the bottom.

*** IMPORTANT ***

CRANE: As this was a full power loss for the entire PKI datacenter, all running workflows on Crane are guaranteed to have been disrupted. You will most certainly need to check your workflows to ensure consistency and restart anything that was running when this power outage occurred that wasn't already re-queued by the scheduler.

ANVIL: All running instances on the Anvil Openstack cloud were stopped due to the power loss and will need to be started via the anvil.unl.edu web interface. There is a possibility an instance may show as "up" and "running" but have no network access. If you experience this, please do a "Soft Reboot" on your instance and after a short time that doesn't resolve the issue please contact hcc-support@unl.edu for assistance.

ATTIC: No data loss is expected for Attic or the offsite replication, however any file transfers that were in-progress when the power outage occurred should be double checked for consistency.

COMMON: No data loss is expected for /common. If you had jobs running on the RHINO cluster located in Lincoln when the power outage occurred that use /common, they most likely will have failed as /common is served from our PKI datacenter. Jobs that were not accessing /common on RHINO are likely to have survived without issue. As always, please double check your jobs and restart them as necessary.

Outage summary: This outage was one of the more lengthy and wide spread in HCC's history as it covered our entire PKI datacenter and was fully unexpected. At this time, it appears work on fire alarm systems outside of our datacenter inadvertently triggered the emergency cutoff to our whole-room UPS. This appears to have been completely accidental and brought to light a dependency between the building fire alarms and our power infrastructure that was not widely known. In fact, we were unaware such maintenance was taking place as until yesterday there was no reason to think our datacenter would be affected. We will work with the building facilities personnel to better understand, document, and prevent this from happening in the future.

Posted by Garhan Attebury

Identified

The UPS power infrastructure at our PKI datacenter is back online at this time, and we are moving through the final steps of recovering services. At this point expectations are that we can make all services fully available again by 5pm today (October 28th).

Once services are fully available we will send a final announcement with important details about the affected services and any actions you as users need to take.

Posted by Garhan Attebury

Identified

Shortly after 9am this morning the emergency cutoff for the whole-room UPS at our PKI datacenter was trigged. It isn't fully clear why this happened, but nearby electrical work on fire alarms seems to be involved. Steps will of course be taken to understand the root cause and prevent it from happening in the future.

At this time the PKI datacenter has power but in bypass mode as trying to restore UPS power to the room failed. We are scheduled to meet with the company supporting this UPS tomorrow around noon to diagnose the problem. Compounding the remaining power problems some of the supporting compute and network infrastructure for the datacenter has also failed. We are working to remedy this as quickly as possible of course, but at this point in the day it is unlikely that this outage won't cross into tomorrow. We also cannot predict whether the UPS maintenance tomorrow will require the room to be powered down again or not, and it makes little sense to bring everything online only to need to power it all down hours later.

While we understand this is inconvenient and certainly unplanned, we will as always try to restore services as quickly and reasonably as possible. In total this outage affects the Crane cluster, Anvil cluster and Attic service. The Rhino cluster located in Lincoln is mostly unaffected by this, although the /common filesystem would have disappeared earlier today likely causing jobs to halt or fail. If you have workflows running there it would be advisable to double check them.

To summarize, the PKI datacenter lost all power (despite full room UPS and backup generators) and may need to lose power again tomorrow. We will provide further details once technicians have diagnosed the issue with the UPS and we have services fully restored.

Posted by Garhan Attebury

Identified

The root cause of this appears to be power related and rather substantial with the whole-room battery backup system for the datacenter being off (implying the building backup generators failed in some way as well).

As this is a significant outage of major cooling and power infrastructure recovery will certainly take equally significant time. We will attempt to provide an ETA whenever possible, but at this time it is safe to assume all of the HCC resources hosted at PKI will be inaccessible for the remainder of the day.

Further updates will be provided if they are significant and relevant, and certainly when the systems are all recovered.

Posted by Garhan Attebury

Investigating

The root cause is unknown at this time (mentions of power issues with OPPD at Omaha) but HCC's PKI datacenter is unreachable at this time. This is a major outage affecting Crane, Attic, Common, and to an extent Rhino (with the loss of /common connectivity).

An ETA is unknown at this time but we will certainly prioritize diagnosing and correcting the issue.

Posted by Garhan Attebury