Maintenance window Wednesday 2017-04-05 -- finished
In addition to software upgrades, today we'll also switch to a new speedy internet access. Here is our plan in a bit more detail:
- Upgrade kernel, upgrade Slurm to version 17.02.1, and upgrade other system software on all nodes of Fysast1, Irma, Milou, and Rackham.
- Switch to a new redundant 100 Gbit/s connection to the internet and close down the old connection. Because of this UPPMAX will change all its public IP-adresses. All systems will be unavailable from internet during the change.
- Decommission Smog.
Slurm queues on Milou, Fysast1, and Rackham will be stopped. Queues on Irma and Bianca will not be affected.
Note that these changes might take longer than one day. We will keep you informed on the progress here in our system news.
Update at 0900 hours
Started maintenance window.
Update at 1130 hours
We have swithced to our new internet connection, and have upgradeed all kernels and other system software.
Now we are changing IP addresses, and making everything work again, which is a chore with many details to consider.
Update at 1450 hours
We are still busy trying to rebuild all IP address relationships between our servers. Nothing seems to have broken yet, though.
Our guess is that UPPMAX's systems will not be available today, but sometime tomorrow.
Update at 1615 hours
The maintenance continues tomorrow.
Update Thursday at 1020 hours
Bianca, Fysast1, Irma, and Milou are available again. Plese tell us if you notice problems.
Maintenance on Rackham will probably finish during the afternoon.
Update Thursday at 1740 hours
We got a lot of problems, due to the major change of IP addresses, that we are busy fixing.
Sorry, but we were not able to fix Slurm on Rackham today. We continue tomorrow, Friday
Update Friday at 1200 hours
Rackham is available again. Please tell us if you notice any problems.
Update Friday at 1520 hours
We have finished our extra long maintenance for April. Next maintenance is Wednesday, May 3rd.
Unexpected reboot of Pica at Monday morning.
Restart of two Milou login servers today Thursday
Lower service level during UPPMAX holidays
Part of storage system Pica is still very slow
Pica was partly restarted just now, please look for problems in your job output
UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.
The restart was done a little after 1300 hours.
For Rackham users, this meant that you might have had problems with reading and writing to your home directory.
For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.
Please look one extra time for problems in your job output, for jobs running at this time.
We are sorry for the inconvenience.
On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED
UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on Milou, and to home directories on Rackham.
Because of that, it is very difficult to login to Milou and Rackham.
We will investigate the source of this problem, and will report any success as updates here.
Update at 1310 hours
We restarted part of Pica, and that solved the problem
Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.
SUPR and C3SE website down
SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later
No maintenance planned for today's maintenance window
First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.
But today we will do no maintenance.
Next maintenance window is 2nd of August.
Restart of login server milou-f Tuesday morning -- FINISHED
File system mounts of Pica volumes was not working correctly.
This was fixed by a restart of the server. Now it works much better.
We are sorry about any inconvenience for you due to this.
Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED
From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.
Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.
Singularity is available
Urgent kernel upgrade -- FINISHED
Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.
UPDATE 16:00 - Update completed.
Intelmpi performance issues
Bianca graphical login now working
Uses Thinlinc Web Access. Not X-forwardning.
Bianca's storage system Castor has problems -- FIXED
Maintenance window Wednesday 2017-06-07 -- FINISHED