Affected services:

  • Crane
  • Crane Open OnDemand

CRANE downtime postponed, cluster remains operation for time being

Scheduled for Monday 25th January 2021 at 08:00 (Central Time (US & Canada))

Schedule/description of work

The downtime of CRANE scheduled for January 25th through February 1st will not take place as originally scheduled. During final testing of software combinations used by NU researchers, an issue was discovered preventing the use of MPI codes with the original 2013 era Crane hardware. This hardware utilizes an Intel TrueScale InfiniBand fabric which while excellent at the time of purchase has been unsupported by Intel since 2017 and is falling out of relevance in modern operating systems as the industry moves forward. This original Crane hardware still amounts to a significant portion of the tightly coupled resources for those using MPI codes making it an issue we could not ignore.

Fortunately, we believe we have a workaround which will allow us to continue our migration to the EL8 operating system while avoiding any performance/stability issues with this particular combination of MPI code and hardware. Unfortunately, it took a significant amount of time and effort to diagnose the issue and discover this workaround, and we will need additional time to ensure NU researchers will not be impacted by this workaround.

We will announce a new downtime schedule as soon as we are able and make every effort to limit the impact and duration of the downtime.

Scheduled start time
January 25, 2021 08:00
Duration
7 days
Status
Finished

Updates

The upcoming downtime of the CRANE cluster previously scheduled for the week of January 4th through January 11th has been rescheduled to the week of January 25th through February 1st. This rescheduling is necessary to allow additional preparation and time for testing as we work through some unexpected compatibility issues with the newer and more supported software stacks. While we understand that this rescheduling, and the downtime in general, may be inconvenient it will help us minimize the amount of time CRANE is unavailable for use and hopefully result in fewer unexpected issues once the system is back online.

As a reminder this is an extended downtime involving a major upgrade of the operating system used on CRANE and its associated components. It is necessary because the existing system has reached the end of its supported lifecycle. The original message with corrected dates can be viewed at https://status.hcc.unl.edu/maintenance/11ce5b0f-3fca-450c-93bf-b88cc0d2ccb4

Posted by Garhan Attebury