Maintenance window Wednesday 2017-05-03 -- finished
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
- Physically move one of the OpenStack server machines of Bianca from one chassi to another.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm jobs on Fysast1, Irma, Milou and Rackham will continue to run, but access to Slurm commands will be unavailable sometimes during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
Update at 1210 hours
Part of Bianca and Castor is updated.
We have some unexpected problems with the new Slurm version. First machine we are testing this on is Irma, so Slurm is unavailable on Irma. We are sorry about that.
Update at 1605 hours
We are now giving up on the new Slurm version and goes back to the old one.
Update at 1730 hours
We have changed back to the Slurm version of yesterday.
Some login nodes are not yet restarted, and will soon be.
Service of Bianca continues tomorrow. Restart of Milou-f will be done tomorrow, or this evening.
Update Thursday at 0845 hours
We are soon restarting the login node of Fysast1.
Maintenance of Bianca continues today. We try to improve the compute nodes of the project clusters.
Irma, Rackham, and the UPPNEX part of Milou are back in production. Compute nodes will upgrade themselves automatically, so the waiting time in Slurm queues will be longer than normal today.
Update Thursday at 1545 hours
We have lost part of the connection to compute nodes of Fysast1, and are busy trying to get it back.
Maintenance on Bianca has finished and we will soon allow new logins.
Update Thursday at 1600 hours
Bianca is back in production.
Update Friday at 0920 hours
Now most compute nodes of Fysast1 are available. We will probably soon close the maintenance window.
Update Friday at 1135 hours
The connection to compute nodes of Fysast1 is fully recovered. We have now finished maintenance.
Next maintenance day is June 7th.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes
We're restarting irma-q for technical reasons. The slurm queue system may be unavailable for submitting/verifying job status for a few minutes.
milou2 rebooted August 19
milou2 rebooted on Saturday 2017-08-19.
Bianca's storage system Castor had a hiccup yesterday Thursday -- FIXED
Maintenance window Wednesday 2017-08-02 -- FINISHED
Unexpected reboot of Pica at Monday morning.
Restart of two Milou login servers today Thursday
Lower service level during UPPMAX holidays
Part of storage system Pica is still very slow
Pica was partly restarted just now, please look for problems in your job output
UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.
The restart was done a little after 1300 hours.
For Rackham users, this meant that you might have had problems with reading and writing to your home directory.
For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.
Please look one extra time for problems in your job output, for jobs running at this time.
We are sorry for the inconvenience.
On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED
UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on Milou, and to home directories on Rackham.
Because of that, it is very difficult to login to Milou and Rackham.
We will investigate the source of this problem, and will report any success as updates here.
Update at 1310 hours
We restarted part of Pica, and that solved the problem
Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.
SUPR and C3SE website down
SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later
No maintenance planned for today's maintenance window
First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.
But today we will do no maintenance.
Next maintenance window is 2nd of August.
Restart of login server milou-f Tuesday morning -- FINISHED
File system mounts of Pica volumes was not working correctly.
This was fixed by a restart of the server. Now it works much better.
We are sorry about any inconvenience for you due to this.
Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED
From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.
Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.
Singularity is available
Urgent kernel upgrade -- FINISHED
Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.
UPDATE 16:00 - Update completed.
Intelmpi performance issues
Bianca graphical login now working
Uses Thinlinc Web Access. Not X-forwardning.
Bianca's storage system Castor has problems -- FIXED
Maintenance window Wednesday 2017-06-07 -- FINISHED