Maintenance window Wednesday 2017-05-03 -- finished
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
- Physically move one of the OpenStack server machines of Bianca from one chassi to another.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm jobs on Fysast1, Irma, Milou and Rackham will continue to run, but access to Slurm commands will be unavailable sometimes during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
Update at 1210 hours
Part of Bianca and Castor is updated.
We have some unexpected problems with the new Slurm version. First machine we are testing this on is Irma, so Slurm is unavailable on Irma. We are sorry about that.
Update at 1605 hours
We are now giving up on the new Slurm version and goes back to the old one.
Update at 1730 hours
We have changed back to the Slurm version of yesterday.
Some login nodes are not yet restarted, and will soon be.
Service of Bianca continues tomorrow. Restart of Milou-f will be done tomorrow, or this evening.
Update Thursday at 0845 hours
We are soon restarting the login node of Fysast1.
Maintenance of Bianca continues today. We try to improve the compute nodes of the project clusters.
Irma, Rackham, and the UPPNEX part of Milou are back in production. Compute nodes will upgrade themselves automatically, so the waiting time in Slurm queues will be longer than normal today.
Update Thursday at 1545 hours
We have lost part of the connection to compute nodes of Fysast1, and are busy trying to get it back.
Maintenance on Bianca has finished and we will soon allow new logins.
Update Thursday at 1600 hours
Bianca is back in production.
Update Friday at 0920 hours
Now most compute nodes of Fysast1 are available. We will probably soon close the maintenance window.
Update Friday at 1135 hours
The connection to compute nodes of Fysast1 is fully recovered. We have now finished maintenance.
Next maintenance day is June 7th.
Cooling stop at 17.00 hours the 23rd of May -- CANCELLED
Issues with certain project volumes for milou/pica 20170515 and onwards.
Some project volumes on pica are very heavily loaded and slow/next to unusable for interactive use. We're doing what we can to resolve this but can not promise any set time for when things will behave as normal again.
UPDATE: We've had some continuing issues with this due to some nodes not realizing when resources behave better, we're working on these issues but this may have caused disturbances like failed jobs or missing output.
Support may be slow May 11th and 12th due to conference
The UPPMAX system group hosts the spring 'SONC' conference where administrators from all SNIC-centers meet and discuss how to improve our centers. With many UPPMAX adminstrators being out of office during the conference (Thursday 11th and Friday 12th) the support will likely be less responsive.
slurm disturbance on milou 2017-05-10
Due to a misconfiguration active on a certain number of nodes around 12AM today, some jobs that were launched on milou could not start.
If you have jobs that were victims of this, they will likely show up as completed although with a very short run time (a few seconds).
Disturbances in Slurm today Tuesday -- finished
Maintenance window Wednesday 2017-05-03 -- finished
Slurm problems on Rackham -- fixed
Intel license server not responding --fixed
We have gotten reports that the Intel license server is not responding. We are investigating it. This might manifest itself with hangs or freezes during compilations.
Problem "Invalid account or account/partition..." --solved
We have identified a problem with the Slurm account database. If you just got added or created a new project you might get the following message when scheduling jobs "Invalid account or account/partition...". It affects primarily Rackham and Milou.
Problem with Slurm on Milou -- fixed
Interrupts in Slurm service on Rackham -- fixed
Bianca's storage system Castor has problems -- fixed
Resetting your password from the homepage is not working --fixed
Resetting your password from this page is currently not working. If you need to reset your password please contact firstname.lastname@example.org
Update 2017-04-18: This issue should now be fixed.
Funk-accounts and new certificates
Some of the shared funk-accounts used on Irma and Milou might stop working due to the IP-address change.
Maintenance window Wednesday 2017-04-05 -- finished
Smog will be decommissioned on Wednesday 5th of April
Smog will be decommissioned on Wednesday 5th of April. As previously mentioned the SNIC Cloud Team is currently working on bringing up a new cloud to replace Smog and join the other two regions in the SNIC Science Cloud project.
For questions ,please contact email@example.com (and not the UPPMAX support queues).
Rackham2, one of Rackham's login nodes, got into problems -- now fixed
Maintenance window for Bianca Wednesday 2017-03-22 -- finished