Uppsala Multidisciplinary Center for Advanced Computational Science

Maintenance window Wednesday 2016-09-07 -- NOW FINISHED

2016-08-31

As usual, monthly maintenance window begins at 0900 hours on the first Wednesday of the month.

This time we will:

  • Upgrade kernel and other system software on all nodes of clusters Fysast1, Irma, Milou, and Tintin.
  • Do a major upgrade of Slurm on clusters Fysast1, Irma, Milou, and Tintin.
  • Run benchmarks on clusters Irma, Milou, and Tintin.
  • Replace most cables of storage system Lupus, belonging to cluster Irma, for security reasons.

There is a risk of perhaps 10% that running jobs on clusters Fysast1, Milou, and Tintin will be lost during the Slurm upgrade. If you want to fully eliminate that risk, please plan so you have no jobs on these clusters during the maintenance.

Cluster Irma will be totally unavailable during the cable work of Lupus. The cables are replaced, so we will be able to put lockable doors on the storage system.

Slurm will be unavailable on clusters Fysast1, Milou, and Tintin during the Slurm upgrade, but files will be available (and jobs will probably keep running, as mentioned above). Login nodes will be rebooted once.

During the maintenance day, we will keep you informed about our progress by updates of this text.

Update at 1100 hours

We are restarting on login servers on Tintin and Milou. We have also started replacement of cables of Lupus.

Update at 1300 hours

Login servers on Milou and Tintin have restarted successfully. Cables on Lupus are replaced and Irma is back in production.

In the afternoon we'll upgrade SLURM, which will affect functionality of SLURM for approximately 1.5 hours. As before, we believe currently running jobs will not be affected.

Update at 1545 hours

Benchmarking has finished with good results on Irma. SLURM is upgraded on Irma without effecting running jobs. We need to restart the login nodes once more, today or tomorrow, otherwise service on Irma is finished. For Milou and Tintin we will now upgrade SLURM and restart their login nodes.

Update at 17.30

SLURM was upgraded smoothly on all clusters. All login nodes are restarted and tested fine. Our maintenance window has now finished!

Update Thursday at 0830 hours

There remained some problems. We are sorry about that and are now fixing them.

Slurm programs on login nodes of Fysast1, Milou, and Tintin were not updated and that gave strange output sometimes, e.g. within the first message to you when you log in. That is now fixed.

A lot of compute nodes on Fysast1, Milou, and Tintin were restarted late yesterday afternoon and did not automatically go into production, as we had expected. We are now putting them into production. Most of them are already in production, the remaining nodes should be in production within an hour from now.

Update Thursday at 0850 hours

The new Slurm version gave us a new problem, making it more difficult to put the compute nodes into production.

We are just now thinking about what to do. Perhaps we temorarily go back to the old Slurm version on the compute nodes.

Before 1000 hours we plan to get back to you with a new update.

Update Thursday at 1000 hours

We think that we have solved the problem now. Slowly, more and more compute nodes will go into production within  two hours.

Please report other problems, that you notice.

Old System News