Maintenance window Wednesday 2017-03-01 -- finished
Maintenance starts at 0900 hours and will probably last all day long. This time, we will:
Install new main ethernet switches, which will affect access to all login nodes in the morning, because UPPMAX will be disconnected from internet. Installing the new switch will also affect connection to Pica over the day, meaning we will stop queues on Fysast1, Milou, Rackham, and Tintin.
Upgrade kernel and other system software on all nodes of Bianca, Fysast1, Irma, Milou, Mosler, and Rackham.
Decommission Tintin. UPPMAX starts to migrate all active Tintin projects to Rackham. Migration is expected to last for up to three days. When the migration has finished, all Tintin users will be able to login to Rackham and we will allow their projects to continue their activities on Rackham. Home directories will be located on Pica, as they are now, but project directories will move to to Crex, a new storage system on Rackham. Note that job queue on Tintin will be dropped, so any jobs still queuing when maintenance window starts will need to be requeued on Rackham.
Update at 1130 hours:
The new ethernet swhitches have been installed, and are now configured. We have also corrected a few errors in our electrical UPS (uninterrruptible power supply) system.
Please do not login yet. Everything is not yet working as intended.
Update at 1540 hours:
We are still configuring our networks and upgrading our clusters.
Update at 1600 hours:
We haved finished today's maintenance for Mosler.
Update at 1810 hours:
Also Bianca and Irma is back in production.
Update at 1830 hours:
We have finished maintenance on Fysast1 and Milou.
Rackham is still in maintenance. We also have a problem with internet access from compute nodes, which we hopefully can fix tomorrow Thursday.
Update at 2030 hours:
Now the maintenance of all compute resources has finished.
We still can not reach internet from compute nodes. We try to fix that problem tomorrow. Please tell us if you notice other problems.
Update Thursday at 1710 hours:
We have not yet been able to give our compute nodes internet access, but will continue tomorrow.
Update Friday at 1100 hours:
Now compute nodes have internet acess again, as before the maintenance, so we close the maintenance window.
Next maintenance window is planned for Wednesday, April 5th.
Cooling stop at 17.00 hours the 23rd of May
Issues with certain project volumes for milou/pica 20170515 and onwards.
Some project volumes on pica are very heavily loaded and slow/next to unusable for interactive use. We're doing what we can to resolve this but can not promise any set time for when things will behave as normal again.
UPDATE: We've had some continuing issues with this due to some nodes not realizing when resources behave better, we're working on these issues but this may have caused disturbances like failed jobs or missing output.
Support may be slow May 11th and 12th due to conference
The UPPMAX system group hosts the spring 'SONC' conference where administrators from all SNIC-centers meet and discuss how to improve our centers. With many UPPMAX adminstrators being out of office during the conference (Thursday 11th and Friday 12th) the support will likely be less responsive.
slurm disturbance on milou 2017-05-10
Due to a misconfiguration active on a certain number of nodes around 12AM today, some jobs that were launched on milou could not start.
If you have jobs that were victims of this, they will likely show up as completed although with a very short run time (a few seconds).
Disturbances in Slurm today Tuesday -- finished
Maintenance window Wednesday 2017-05-03 -- finished
Slurm problems on Rackham -- fixed
Intel license server not responding --fixed
We have gotten reports that the Intel license server is not responding. We are investigating it. This might manifest itself with hangs or freezes during compilations.
Problem "Invalid account or account/partition..." --solved
We have identified a problem with the Slurm account database. If you just got added or created a new project you might get the following message when scheduling jobs "Invalid account or account/partition...". It affects primarily Rackham and Milou.
Problem with Slurm on Milou -- fixed
Interrupts in Slurm service on Rackham -- fixed
Bianca's storage system Castor has problems -- fixed
Resetting your password from the homepage is not working --fixed
Resetting your password from this page is currently not working. If you need to reset your password please contact firstname.lastname@example.org
Update 2017-04-18: This issue should now be fixed.
Funk-accounts and new certificates
Some of the shared funk-accounts used on Irma and Milou might stop working due to the IP-address change.
Maintenance window Wednesday 2017-04-05 -- finished
Smog will be decommissioned on Wednesday 5th of April
Smog will be decommissioned on Wednesday 5th of April. As previously mentioned the SNIC Cloud Team is currently working on bringing up a new cloud to replace Smog and join the other two regions in the SNIC Science Cloud project.
For questions ,please contact email@example.com (and not the UPPMAX support queues).
Rackham2, one of Rackham's login nodes, got into problems -- now fixed
Maintenance window for Bianca Wednesday 2017-03-22 -- finished