UPPMAX shutdown due to cooling failure -- FIXED
The external cooling failed for (as of yet) unknown reasons. All clusters and storage systems were shutdown in order to prevent permanent hardware damage.
Please refrain from polling the support for updates and questions. We will update this article when new information becomes available.
Around 19:40 today the alarms about high temperatures in the computer room started to reach UPPMAX staff.
At 20:03 the temperature in the computer room reached critical levels and we where forced to shut down several systems including Irma, Milou and Rackham.
We still have no idea what caused the supply of cooling to the computer room to fail but we will of course investigate this.
We are sorry for the problems this might have caused you and your research but it was necessary to shut down the systems in order to prevent permanent damage to the hardware.
UPDATE WEDNESDAY AT 0815 HOURS
For some reason the main cooling curcuit at Ångströmlab had stopped and the two main pumps where not running. They had commenced emergency shutdown due to low pressure in the system.
Bravida and Akademiska hus where at the site approx 19:30 and they finally got the pumps running again around 23:15.
This morning at 07:50 we began to restart our systems. This will most likley take the whole day and maybe more. We will continue to update this post about our progress.
UPDATE WEDNESDAY AT 1250 HOURS
Please note that any jobs that were still running yesterday evening, when we had to stop all systems, will need to be resubmitted. When you run "finishedjobinfo", they will probably be marked with jobstate=NODE_FAIL. Jobs that started after that might run into strange problems because of bad connections to storage systems. We are sorry about these problems.
Jobs that are still waiting in a Slurm queue will probably run without problems, when we put the systems back in production
The cooling medium (water) in the house complex (Ångströmlab), where UPPMAX's computer room is located, is leaking somewhere, but no one knows yet where.
UPPMAX can probably not put the systems in production until that problem is solved, because future repair work might set our computer room (again) without cooling. (And any jobs that we allow to run at that time would crash.)
We have decided to spend the waiting time doing already now, what we had planned for the maintenance on Wednesday next week.
So we are going to upgrade Bianca, Fysast1, Grus, Milou, and Rackham. And instead we no longer plan any maintenance for next week.
We plan to upgrade also Irma, today or tomorrow. That will be a little more difficult, due to current problems with storage system Lupus.
UPDATE WEDNESDAY AT 1640 HOURS
We have upgraded Fysast1, Milou and Rackham, and now allow you to login, if you have a project there.
Upgrade of Bianca and Irma will continue tomorrow.
The cooling problem is not solved. Someone will continue to add new water to the system, day and night, but the leak is not found.
UPPMAX anticipates that the future repair will create too much heat in the computer room. We do not want to crash running jobs, when we will (again) need to stop the compute nodes, so we will not unlock the Slurm queues yet.
Hopefully this will be solved tomorrow, Thursday.
UPDATE THURSDAY AT 1150 HOURS
Akademiska hus (AU) who is responsible for the cooling system of the Ångström laboratory (where our server hall is located) reports that the leak have still not yet been discovered. AU is refilling the coolant regularly, and will continue to do so until the leak is found.
We have started the queues on Milou, Rackham and Fysast1 again, but we may be forced to stop the queues and shutdown the hall once again depending on the outcomes of the ongoing investigation by AU.
Update Monday at 1615 hours
The leak is not yet found, but the system does not leak any more. It looks like it has self-repaired.
Today we have started also Irma and Bianca, and thus everything is back in production.
Issue with 'interactive' and creating slurm.out files
Problem with Slurm on Rackham and Milou
There is currently a problem with the Slurm master node which affects users on Rackham and Milou. We are investigating.
March maintenance day -- UPDATED Thursday 07:00
Wednesday 7th of March UPPMAX began our monthly service window. Systems and services may become unreachable during the day.
Files and directories may be hidden on Bianca -- SOLVED Wednesday
We have received reports of missing files and directories inside the /proj and /proj/nobackup directories on Bianca. Upon inspection the files are actually there, but are not shown by the "ls" command. If you are working on Bianca, you should be aware of this as for example jobs of type “process all files in directory X and compile the result” might finish fine but create false results due to missing input, thus risking incorrect results and conclusions.
A workaround was implemented on Wednesday 2018-03-08 that mitigates this issue.
Configuration problems on Milou and Irma - SOLVED
Slow home directories
Home directories have occasionally been extremely slow today. Nothing seems broken but the system is under a lot of pressure from time to time.
Rackham login issues -- SOLVED
We are currently seeing and receiving reports on login issues on Rackham.
The fat (256GB) Rackham nodes is currently unavailable -- SOLVED
The fat (256GB) Rackham nodes is currently unavailable due to an issue with Slurm. We are investigating this issue.
Rackham's storage system -- MONDAY: Queues released
Due to an issue with the storage system Crex the Slurm queue on Rackham is currently on hold. This is a summary of the problem.
No new jobs on Rackham 2018-02-09 11:15
We are experiencing problems with crex, the file system on Rackham. In order to not put more strain on the filesystems we will not allow new jobs to start at the moment. If you submit jobs they will be held in the queue.
The fysast1 cluster is back online
The Milou cluster is back on line
The Rackham cluster is back online
Bianca online again
The Bianca cluster is back online following our service window.
Maintenance window Wednesday 2018-02-07 -- CLOSED
For the February service we will install our new UPS, update Slurm on all clusters, extend the capacity of the storage system Lupus (for Irma), and of course perform the standard kernel and security updates.
The UPPMAX Cloud region will be unavailable Thursday 17:00-20:00 CET
A central switch will be restarted tomorrow Thursday 2018-01-31. The cloud will become temporarily unavailable from the outside i.e. Internet.
Problems with the 'interactive' and Slurm commands on Rackham
The Slurm master on Rackham is currently overloaded and you may experience sluggish Slurm behavior or timeout issues when running commands such as interactive, jobinfo and squeue. We are investigating this issue.
Some projects volumes on pica are slow
Some projects volumes on pica are slow, this may also possibly affect home directories.
Login issue for new Bianca projects -- FIXED
A network problem has been detected on Bianca causing logins to fail for a few of the most recent Bianca projects . We are working on fixing the problem, and expect to Bianca fully working again very soon.
Maintenance window -- COMPLETED
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Dis, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm queues on Fysast1, Irma, Milou and Rackham will be stopped, but access to Slurm commands will mostly work during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
UPDATE 2018-01-10, 16:00
Irma is up and running. Bianca, Milou, Rackham and Fysast1 are still down. We will continue security upgrades tomorrow (Thursday) morning.
UPDATE 2018-01-11, 15:15
Irma, Milou, Rackham and Fysast1 are up and running. Bianca is still being tested. Hopefully Bianca will be back today.
UPDATE 2018-01-11, 16:00
Bianca is now up, however, graphical login is not working right now. Text login works fine (http://www.uppmax.uu.se/support/user-guides/bianca-user-guide/).
We're still working on Dis and expect it to be up by tomorrow.
Extension of lupus
The vendor visited us last week and did the physical installation of the lupus extension. Unfortunately, some parts were not correct and we're currently waiting for exchanges that are expected to arrive this week.
UPPMAX staff back after the holidays
We hope 2018 has been good to you so far! UPPMAX staff is back after the holidays and we're focusing on support tickets that have built up over the holidays.
Reduced staff availability over the coming holidays combined with lots of tickets
Most of our staff is on vacation over the coming holidays. You can contact us using regular channels, but response times for support questions might be longer than normal. We are sorry for the inconvenience.
First week of January, most of us are back again.
We also have a lot of tickets about transfer from Milou to Rackham/Bianca and we think there might be hundreds of last minute requests in January. Be prepared the process of getting a transfer project takes some time.
If you want to continue your Milou project, make sure you have applied for a storage project and compute project on Rackham (for non-sensitive data), or a project on Bianca (for sensitive data). http://uppmax.uu.se/support/getting-started/moving-your-research-from-milou-to-rackham/
Creation of new Bianca projects currently on hold -- FIXED
The creation of new Bianca project are currently on hold. If your project is scheduled to start today you will be unable to login.
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Wednesday 2017-12-06 at 03:58
milou2 rebooted on Wednesday 2017-12-06 at 03:58
Updates from SUPR are temporarily disabled
We are performing a change in our infrastructure today starting at 13:00. This change will temporarily stop updates from SUPR reaching UPPMAX. If you have for example recently joined or added a member to a project, you will have to wait before the change becomes visible at UPPMAX.
Fix for broken SSH-connections to the UPPMAX Cloud
If you regularly end up with broken SSH-connections ("broken pipe") to your virtual machine in the UPPMAX region, please use the SSH option ServerAliveInterval. See below for an example.
Issue with the Intel License server
At this moment there is an issue with the Intel license server. You will be unable to use the icc compiler and Intel tools until this issue is resolved.
UPPMAX support low on staff Monday 20/11
The UPPMAX support will be low on staff on Monday 2017-11-20 due to conference.
How to get a high job priority on Bianca
Support ticket system temporarily down --FIXED
Our support email address email@example.com was down for a couple of hours, but is back in service again.
Logging in to Bianca without Rackham
Bianca users outside of SUNET will be unable to login using rackham.uppmax.uu.se. We have created a temporary workaround.
Rackham unavailable -- SOLVED: Rackham available
2017-11-17 09:35 Rackham is now back in regular service.
Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.
It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.
UPPMAX power outage -- FIXED
UPPMAX experienced a power outage in the server hall on Tuesday.
Problems with /sw on Bianca (now fixed)
The /sw part of Bianca was lost around 07:30 this morning due to an issue with the storage system. This may have caused failed jobs. The system was fixed 08:40.
Quick upgrade of Slurm 2017-11-02 -- COMPLETED
Maintenance window Wednesday 2017-11-01 -- COMPLETED
Monthly maintenance window begins at 09:00 hours on the first Wednesday of the month. (That is today.)
Issues with /sw/data during the week end
/sw/data from pica may have been unavailable for some jobs during the week end and some jobs may have failed because of this.
UPPMAX support system is down -- SOLVED
RT, the support system UPPMAX and all the rest of SNIC is using, is down.
It is located at NSC at Linköping University and the whole university has network problems.
This will make all email to and from firstname.lastname@example.org delayed until the network problem is fixed. So answers to Your support tickets will be delayed.
We now have contact with our support system and emails to email@example.com are reaching us again.
Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE