Uppsala Multidisciplinary Center for Advanced Computational Science

Rackham's storage system -- MONDAY: Queues released

2018-02-12

Due to an issue with the storage system Crex the Slurm queue on Rackham is currently on hold. This is a summary of the problem.

Friday 2018-02-09

Crex as planned was powered off and physically moved to make room for the Rackham extension. One Crex was started, multiple disks appears as faulty. Upon examination it is discovered that all disks belong to the same enclosure (one of five big box of disks).

This is the same exact problem we saw a few months ago the last time we powered off Crex. It was then believed to have been solved by a software update from the manufacturer. It is very unprobable that multiple disks gets broken at the same time, and even less probable that they all belong to the same enclosure. However, as the same problem arose again, the problem clearly was not solved.

As the manfucaturer will likely require us to power down Crex (or parts of it) we placed a service reservation which prevents new jobs to start. You are still able to submit jobs and read your files.

We have reopened and elevated our support case at the manufacturer.

Monday 2018-02-12

We are working closely with the manufacturer to fix this problem.

Tuesday 2018-02-13

The issues appears to be software related according to the manufacturer. We are preparing for a firmware update, which involves rebuilding the storage pools containing the broken disks. This is a time consuming operation however, and Rackham will most likely not be online today.

Wednesday 2018-02-14

Storage pools still rebuilding.

Friday 2018-02-16

We have now received a plan for how to solve the issue from the vendor. It involves upgrades and a shutoff of Crex.

Running jobs will be cancelled. Jobs in the queue will be kept. The primary plan for login nodes is to have them open, but without Crex, you won't be able to access files on Crex like project directories, access the module system or use several commands. Home directories will be available.

Firmware upgrades and restarts are planned to be completed during the day. If it works well, we will run tests over the weekend. If everything is OK on Monday, we hope to get Rackham up and running.

21:30 update

Upgrades were succesful. Crex was shut off and brought back up. But after only a few hours, disks again reported failures. We continue to work on the problem together with the manufacturer.

Monday 2018-02-19

The firmware of the enclosure was updated and the disks rebuilt successfully. No more disks are reported as broken. We are keeping contact with the manufacturer, but have decided to release the queues on Rackham.

System News