Affected services:

  • Crane

Crane: /work filesystem unplanned downtime

Opened on Monday 5th August 2019, last updated

Resolved

The final filesystem checks on Crane /work have completed and all data looks to be intact. Crane and JupyterHub are now available for use. While we understand this outage was unexpected, it was necessary to ensure data integrity for the /work filesystem.

Jobs that were running or in queue at the start of the outage will in theory have been restarted or still be queued. We encourage all users to double check their jobs anyway.

If you encounter any issues with the resources please contact us at hcc-support@unl.edu or visit one of our office locations (https://hcc.unl.edu/location).

Posted by Garhan Attebury

Identified

Initial checks on the servers backing Crane's /work filesystem have been completed and look good. Due to the nature of the issues and to ensure data consistency we are performing a final Lustre filesystem check before bringing the Crane back online. This check is expected to run overnight.

We do not believe any data loss has occurred as a result of this outage, but please remember that /work is NOT BACKED UP and is essentially a scratch filesystem intended for running jobs.

Barring unforeseen circumstances, a final announcement will be made tomorrow (August 7th) when Crane is fully back online.

Posted by Garhan Attebury

Identified

Due to the issue with the Crane /work filesystem, JupyterHub and Sandstone services are not accessible. Access to these services will be restored once the Crane issue has been resolved.

Posted

Identified

Crane's /work filesystem requires more exhaustive repairs than are possible with the filesystem online. All Crane servers including login and worker nodes will require a reboot as part of this process and all currently running jobs will be canceled / killed. No new logins or jobs will be allowed until this issue is resolved which is expected to be tomorrow (August 6th) at the earliest.

The Crane cluster is now fully down while the filesystem issues are worked on. Additional updates will be made once the cluster is back online.

Posted by Garhan Attebury

Identified

Crane's /work lustre filesystem is experiencing additional problems similar to those this weekend. Accessibility and performance are impacted. The problem is being worked on and updates will be made as we know more.

Posted by Garhan Attebury