Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE
We're restarting irma-q for technical reasons. The slurm queue system may be unavailable for submitting/verifying job status for a few minutes.
milou2 rebooted August 19
milou2 rebooted on Saturday 2017-08-19.
Bianca's storage system Castor had a hiccup yesterday Thursday -- FIXED
Maintenance window Wednesday 2017-08-02 -- FINISHED
Unexpected reboot of Pica at Monday morning.
Restart of two Milou login servers today Thursday
Lower service level during UPPMAX holidays
Part of storage system Pica is still very slow
Pica was partly restarted just now, please look for problems in your job output
UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.
The restart was done a little after 1300 hours.
For Rackham users, this meant that you might have had problems with reading and writing to your home directory.
For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.
Please look one extra time for problems in your job output, for jobs running at this time.
We are sorry for the inconvenience.
On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED
UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on Milou, and to home directories on Rackham.
Because of that, it is very difficult to login to Milou and Rackham.
We will investigate the source of this problem, and will report any success as updates here.
Update at 1310 hours
We restarted part of Pica, and that solved the problem
Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.
SUPR and C3SE website down
SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later
No maintenance planned for today's maintenance window
First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.
But today we will do no maintenance.
Next maintenance window is 2nd of August.
Restart of login server milou-f Tuesday morning -- FINISHED
File system mounts of Pica volumes was not working correctly.
This was fixed by a restart of the server. Now it works much better.
We are sorry about any inconvenience for you due to this.
Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED
From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.
Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.
Singularity is available
Urgent kernel upgrade -- FINISHED
Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.
UPDATE 16:00 - Update completed.
Intelmpi performance issues
Bianca graphical login now working
Uses Thinlinc Web Access. Not X-forwardning.
Bianca's storage system Castor has problems -- FIXED
Maintenance window Wednesday 2017-06-07 -- FINISHED
UPPMAX shutdown due to cooling failure -- FIXED
The external cooling failed for (as of yet) unknown reasons. All clusters and storage systems were shutdown in order to prevent permanent hardware damage.
Please refrain from polling the support for updates and questions. We will update this article when new information becomes available.
Around 19:40 today the alarms about high temperatures in the computer room started to reach UPPMAX staff.
At 20:03 the temperature in the computer room reached critical levels and we where forced to shut down several systems including Irma, Milou and Rackham.
We still have no idea what caused the supply of cooling to the computer room to fail but we will of course investigate this.
We are sorry for the problems this might have caused you and your research but it was necessary to shut down the systems in order to prevent permanent damage to the hardware.
UPDATE WEDNESDAY AT 0815 HOURS
For some reason the main cooling curcuit at Ångströmlab had stopped and the two main pumps where not running. They had commenced emergency shutdown due to low pressure in the system.
Bravida and Akademiska hus where at the site approx 19:30 and they finally got the pumps running again around 23:15.
This morning at 07:50 we began to restart our systems. This will most likley take the whole day and maybe more. We will continue to update this post about our progress.
UPDATE WEDNESDAY AT 1250 HOURS
Please note that any jobs that were still running yesterday evening, when we had to stop all systems, will need to be resubmitted. When you run "finishedjobinfo", they will probably be marked with jobstate=NODE_FAIL. Jobs that started after that might run into strange problems because of bad connections to storage systems. We are sorry about these problems.
Jobs that are still waiting in a Slurm queue will probably run without problems, when we put the systems back in production
The cooling medium (water) in the house complex (Ångströmlab), where UPPMAX's computer room is located, is leaking somewhere, but no one knows yet where.
UPPMAX can probably not put the systems in production until that problem is solved, because future repair work might set our computer room (again) without cooling. (And any jobs that we allow to run at that time would crash.)
We have decided to spend the waiting time doing already now, what we had planned for the maintenance on Wednesday next week.
So we are going to upgrade Bianca, Fysast1, Grus, Milou, and Rackham. And instead we no longer plan any maintenance for next week.
We plan to upgrade also Irma, today or tomorrow. That will be a little more difficult, due to current problems with storage system Lupus.
UPDATE WEDNESDAY AT 1640 HOURS
We have upgraded Fysast1, Milou and Rackham, and now allow you to login, if you have a project there.
Upgrade of Bianca and Irma will continue tomorrow.
The cooling problem is not solved. Someone will continue to add new water to the system, day and night, but the leak is not found.
UPPMAX anticipates that the future repair will create too much heat in the computer room. We do not want to crash running jobs, when we will (again) need to stop the compute nodes, so we will not unlock the Slurm queues yet.
Hopefully this will be solved tomorrow, Thursday.
UPDATE THURSDAY AT 1150 HOURS
Akademiska hus (AU) who is responsible for the cooling system of the Ångström laboratory (where our server hall is located) reports that the leak have still not yet been discovered. AU is refilling the coolant regularly, and will continue to do so until the leak is found.
We have started the queues on Milou, Rackham and Fysast1 again, but we may be forced to stop the queues and shutdown the hall once again depending on the outcomes of the ongoing investigation by AU.
Update Monday at 1615 hours
The leak is not yet found, but the system does not leak any more. It looks like it has self-repaired.
Today we have started also Irma and Bianca, and thus everything is back in production.