UPPMAX support system is down -- SOLVED
RT, the support system UPPMAX and all the rest of SNIC is using, is down.
It is located at NSC at Linköping University and the whole university has network problems.
This will make all email to and from email@example.com delayed until the network problem is fixed. So answers to Your support tickets will be delayed.
We now have contact with our support system and emails to firstname.lastname@example.org are reaching us again.
Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE
We're restarting irma-q for technical reasons. The slurm queue system may be unavailable for submitting/verifying job status for a few minutes.
milou2 rebooted August 19
milou2 rebooted on Saturday 2017-08-19.
Bianca's storage system Castor had a hiccup yesterday Thursday -- FIXED
Maintenance window Wednesday 2017-08-02 -- FINISHED
Unexpected reboot of Pica at Monday morning.
Restart of two Milou login servers today Thursday
Lower service level during UPPMAX holidays
Part of storage system Pica is still very slow
Pica was partly restarted just now, please look for problems in your job output
UPPMAX had to restart part of storage system Pica, because it worked too slowly with nearly no read/write traffic.
The restart was done a little after 1300 hours.
For Rackham users, this meant that you might have had problems with reading and writing to your home directory.
For Milou users, this meant that you also might have had problems with reading and writing to your home directory. But for Milou users, also reading from /sw (where the modules live) and reading and writing to some project directories were affected.
Please look one extra time for problems in your job output, for jobs running at this time.
We are sorry for the inconvenience.
On Milou and Rackham, very difficult to login or otherwise use /home directories -- FIXED
UPPMAX has problem with an extremely slow access to /sw (where e.g. modules live) and home directories on Milou, and to home directories on Rackham.
Because of that, it is very difficult to login to Milou and Rackham.
We will investigate the source of this problem, and will report any success as updates here.
Update at 1310 hours
We restarted part of Pica, and that solved the problem
Hopefully your jobs will continue without problems, but please be careful and look once extra time for errors in your job output.
SUPR and C3SE website down
SUPR and C3SE websites are down at the moment. This prevents you from using SUPR at the moment. Please try again later
No maintenance planned for today's maintenance window
First (non-holiday) Wednesday of each month is UPPMAX's normal, planned maintenance window.
But today we will do no maintenance.
Next maintenance window is 2nd of August.
Restart of login server milou-f Tuesday morning -- FINISHED
File system mounts of Pica volumes was not working correctly.
This was fixed by a restart of the server. Now it works much better.
We are sorry about any inconvenience for you due to this.
Lost contact with Milou nodes m[1-48] for an hour this morning -- FIXED
From approximately 0800 hours to 0910 hours this morning, an ethernet switch in Milou lost power, making 48 nodes unavailable.
Two jobs got NODE_FAIL when trying to start, and interactive work on these nodes was denied. Otherwise, we seem to have had no problems with the temporary network loss.
Singularity is available
Urgent kernel upgrade -- FINISHED
Today we are performing an urgent kernel upgrade on Milou, Fysast1, Rackham, Irma, and Bianca. Login nodes will be restarted during the day. No running jorbs or queues are stopped. We will update on the progress here in System News during the day.
UPDATE 16:00 - Update completed.
Intelmpi performance issues
Bianca graphical login now working
Uses Thinlinc Web Access. Not X-forwardning.
Bianca's storage system Castor has problems -- FIXED
Maintenance window Wednesday 2017-06-07 -- FINISHED
Rackham is now open for all users
All active Tintin projects (exception Tintin-Fysast1, please see below) have been migrated to Rackham. All UPPMAX users should now have access to Rackham.
Dear former Tintin user,
UPPMAX has now migrated all active Tintin projects to Rackham, except
for combined Tintin-Fysast1 projects (you should know, if you belong
to one of those three projects.)
Fysast1 projects have not moved to Rackham, because there is no
common storage system for Fysast1 and Rackham. (If needed, please
apply for a project on Rackham.)
If you belong to an active Tintin project, it is now changed into a
Rackham project, and as a project member you should be able to login
If you have any questions about Rackham, you are very welcome to
get in touch at the same, old e-mail address "email@example.com".
For a start: There are some differences between Tintin and Rackham,
as described below.
Each compute node on Rackham contains 20 compute cores, instead of the
16 compute cores on Tintin. This means that you have to rethink how
many nodes and cores you want to allocate in your jobs.
UPPMAX has moved the project directories from Tintin (storage system Pica)
to Rackham (storage system Crex).
The project directory on Tintin was divided into two parts, one backed-up
part, and a no-backup part. Each of these parts had a file space limit
of 512 GB.
On Rackham these two parts have joined, into one part, that must not
exceed 128 GB. It is backed up, except for the subdirectory that is named
For those project directories that exceeded 128 GB in size, we have
given you one month from now to shrink your usage to below 128 GB.
The uquota command will give you information about your usage and
If your project needs more space than those 128 GB, you may apply for
a storage project that gives an additional storage directory, which is
not backed up. You apply in SUPR, in round UPPMAX Storage 2017:
As with Tintin, you log in with ssh to one of the (four) login nodes,
that have the common name "rackham.uppmax.uu.se". So, please open a
terminal and run
If you want to run graphical applications you must specify -X or -Y,
ssh -X firstname.lastname@example.org
YOUR PROJECTS ON RACKHAM
As usual, you get a list of your projects with the projinfo command.
Please note that most software that you may have compiled on Tintin,
should be recompiled on Rackham, because of the more modern computer
For a complete list of currently installed software please run after
You can search for modules with the "module spider"
module spider name-of-software
The list of available software will be updated in the coming weeks. At
this time we have most of the compilers (icc, mpicc, gcc, gfortran and
javac) and interpreters (Python, Perl, R) and software (MATLAB,
GAUSSIAN, COMSOL, RStudio, OpenFOAM, GROMACS) installed.
If you are missing software and are unable to install it yourself, you
may ask for support at email@example.com.
SOME MORE INFORMATION
You are also welcome to read web page