Rackham is now available
We are happy to announce that UPPMAX's cluster Rackham is now available!
Rackham is now available for all local projects with names starting with "snic2017", and for all new course projects.
First of March we will decommission Tintin, and begin to move all Tintin projects to Rackham. This migration will probably be finished within a few days.
Rackham consists of four login nodes and 304 compute nodes. Each compute node contains two 10-core Intel Xeon CPUs together with 128GB ("thin") or 256GB (fat") of memory. Your project data will be stored on Crex, Rackham's storage system, currently capable of storing 1PB of data. Crex is a high performance file storage system from DDN that uses the Lustre filesystem.
If you are used to Tintin, we ask you to pay attention to the following
* More nodes!
Rackham has 304 nodes (with more on its way!) while Tintin
in the end only had 150 nodes. Do not however assume that Rackham's nodes are identical to Tintin's, they're not. You will find that on Rackham fewer nodes are needed to perform the same work and you will need to adjust your job scripts accordingly.
* More cores!
Rackham has 20 cores per node, a 25% increase from Tintin's 16 cores per node. Remember that when scheduling your node jobs! For the tech interested users: Each Rackham core is an Intel Xeon E5-2630 v4 2.2GHz CPU with 25MB shared memory and a maximum turbo frequency of 3.1 GHz. If the previous sentence means nothing to you don't worry - the only thing you'll need to know is that your core jobs will finish much faster due to the newer generation of CPUs.
An important mention is that if you've built your own applications tailored for Tintin's AMD Bulldozer CPUs you will need to recompile on Rackham to take advantage of the Intel CPUs. Tip: Try compiling using the Intel compilers and tools from "module load intel" and you will likely see a jump in performance. Remember, faster code equals less compute time and less billing of your project core hours.
* More memory!
Each node comes with 128GB of memory (or 6.4GB per core) vs. Tintin's 64GB (or 4GB per core). For the most memory intensive applications you may also request up to 32 fat nodes each containing 256GB of memory.
The biggest differences you will find having your project directory on Crex instead of Pica is:
* No .snapshot directory. If you lost a file you need to contact email@example.com and we will reach into the backups. The .snapshot directory as previously found inside any directory of your project is no longer supported (for you home directory, the .snapshot is still available.)
* Smaller initial storage for your project data. The default size of the project and nobackup areas will be 128GB in total. Applying for more data if needed will be possible.
* We no longer support Webexport.
For Fysast1, Milou, and Tintin, UPPMAX provides a webexport service:
The service is based on some storage space on Pica, that will not be available on Rackham. Rackham has no available space for the webexport service, so it will not be provided.
Lastly, how do you get access to Rackham? If you already have a project on Tintin, UPPMAX will migrate it to Rackham in the beginning of March.
If you don't have a Tintin project and are interested in working on Rackham and Crex, you may apply for a SNIC-project on https://supr.snic.se/round/2017smalluppmax/.
* A note on Software
For a complete list of currently installed software please run after logging in:
As on Milou, you can search for modules with the "module spider" command:
module spider name-of-software
The list of available software will be updated in the coming weeks. At this time we have most of the compilers (icc, mpicc, gcc, gfortan and javac) and interpreters (Python, Perl, R) and software (MATLAB, GAUSSIAN, COMSOL, RStudio) installed. OpenFOAM, VASP and GROMACS have been scheduled for installment and will soon be available. If you are missing software and are unable to install it yourself, you may ask for support at firstname.lastname@example.org.
We look forward hearing your thoughts and feedback on Rackham!
Issue with 'interactive' and creating slurm.out files
Problem with Slurm on Rackham and Milou
There is currently a problem with the Slurm master node which affects users on Rackham and Milou. We are investigating.
March maintenance day -- UPDATED Thursday 07:00
Wednesday 7th of March UPPMAX began our monthly service window. Systems and services may become unreachable during the day.
Files and directories may be hidden on Bianca -- SOLVED Wednesday
We have received reports of missing files and directories inside the /proj and /proj/nobackup directories on Bianca. Upon inspection the files are actually there, but are not shown by the "ls" command. If you are working on Bianca, you should be aware of this as for example jobs of type “process all files in directory X and compile the result” might finish fine but create false results due to missing input, thus risking incorrect results and conclusions.
A workaround was implemented on Wednesday 2018-03-08 that mitigates this issue.
Configuration problems on Milou and Irma - SOLVED
Slow home directories
Home directories have occasionally been extremely slow today. Nothing seems broken but the system is under a lot of pressure from time to time.
Rackham login issues -- SOLVED
We are currently seeing and receiving reports on login issues on Rackham.
The fat (256GB) Rackham nodes is currently unavailable -- SOLVED
The fat (256GB) Rackham nodes is currently unavailable due to an issue with Slurm. We are investigating this issue.
Rackham's storage system -- MONDAY: Queues released
Due to an issue with the storage system Crex the Slurm queue on Rackham is currently on hold. This is a summary of the problem.
No new jobs on Rackham 2018-02-09 11:15
We are experiencing problems with crex, the file system on Rackham. In order to not put more strain on the filesystems we will not allow new jobs to start at the moment. If you submit jobs they will be held in the queue.
The fysast1 cluster is back online
The Milou cluster is back on line
The Rackham cluster is back online
Bianca online again
The Bianca cluster is back online following our service window.
Maintenance window Wednesday 2018-02-07 -- CLOSED
For the February service we will install our new UPS, update Slurm on all clusters, extend the capacity of the storage system Lupus (for Irma), and of course perform the standard kernel and security updates.
The UPPMAX Cloud region will be unavailable Thursday 17:00-20:00 CET
A central switch will be restarted tomorrow Thursday 2018-01-31. The cloud will become temporarily unavailable from the outside i.e. Internet.
Problems with the 'interactive' and Slurm commands on Rackham
The Slurm master on Rackham is currently overloaded and you may experience sluggish Slurm behavior or timeout issues when running commands such as interactive, jobinfo and squeue. We are investigating this issue.
Some projects volumes on pica are slow
Some projects volumes on pica are slow, this may also possibly affect home directories.
Login issue for new Bianca projects -- FIXED
A network problem has been detected on Bianca causing logins to fail for a few of the most recent Bianca projects . We are working on fixing the problem, and expect to Bianca fully working again very soon.
Maintenance window -- COMPLETED
Monthly maintenance window begins at 0900 hours on the first Wednesday of the month. (That is today.)
This time we will:
- Upgrade Slurm, Linux kernel and other system software on Bianca, Dis, Fysast1, Irma, Milou, and Rackham.
- Upgrade Linux kernel and other system software on Castor and Grus.
Bianca and Grus will be unavailable while we service them.
We will restart all login nodes of Fysast1, Irma, Milou and Rackham, probably only once.
Slurm queues on Fysast1, Irma, Milou and Rackham will be stopped, but access to Slurm commands will mostly work during the day.
Slurm queues on Bianca will be stopped and, most of the day, logins to Bianca will not be possible.
We plan to keep you informed about out progress with the maintenance with updates here.
UPDATE 2018-01-10, 16:00
Irma is up and running. Bianca, Milou, Rackham and Fysast1 are still down. We will continue security upgrades tomorrow (Thursday) morning.
UPDATE 2018-01-11, 15:15
Irma, Milou, Rackham and Fysast1 are up and running. Bianca is still being tested. Hopefully Bianca will be back today.
UPDATE 2018-01-11, 16:00
Bianca is now up, however, graphical login is not working right now. Text login works fine (http://www.uppmax.uu.se/support/user-guides/bianca-user-guide/).
We're still working on Dis and expect it to be up by tomorrow.
Extension of lupus
The vendor visited us last week and did the physical installation of the lupus extension. Unfortunately, some parts were not correct and we're currently waiting for exchanges that are expected to arrive this week.
UPPMAX staff back after the holidays
We hope 2018 has been good to you so far! UPPMAX staff is back after the holidays and we're focusing on support tickets that have built up over the holidays.
Reduced staff availability over the coming holidays combined with lots of tickets
Most of our staff is on vacation over the coming holidays. You can contact us using regular channels, but response times for support questions might be longer than normal. We are sorry for the inconvenience.
First week of January, most of us are back again.
We also have a lot of tickets about transfer from Milou to Rackham/Bianca and we think there might be hundreds of last minute requests in January. Be prepared the process of getting a transfer project takes some time.
If you want to continue your Milou project, make sure you have applied for a storage project and compute project on Rackham (for non-sensitive data), or a project on Bianca (for sensitive data). http://uppmax.uu.se/support/getting-started/moving-your-research-from-milou-to-rackham/
Creation of new Bianca projects currently on hold -- FIXED
The creation of new Bianca project are currently on hold. If your project is scheduled to start today you will be unable to login.
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Friday 2017-12-08 at 03:52
milou2 rebooted on Wednesday 2017-12-06 at 03:58
milou2 rebooted on Wednesday 2017-12-06 at 03:58
Updates from SUPR are temporarily disabled
We are performing a change in our infrastructure today starting at 13:00. This change will temporarily stop updates from SUPR reaching UPPMAX. If you have for example recently joined or added a member to a project, you will have to wait before the change becomes visible at UPPMAX.
Fix for broken SSH-connections to the UPPMAX Cloud
If you regularly end up with broken SSH-connections ("broken pipe") to your virtual machine in the UPPMAX region, please use the SSH option ServerAliveInterval. See below for an example.
Issue with the Intel License server
At this moment there is an issue with the Intel license server. You will be unable to use the icc compiler and Intel tools until this issue is resolved.
UPPMAX support low on staff Monday 20/11
The UPPMAX support will be low on staff on Monday 2017-11-20 due to conference.
How to get a high job priority on Bianca
Support ticket system temporarily down --FIXED
Our support email address email@example.com was down for a couple of hours, but is back in service again.
Logging in to Bianca without Rackham
Bianca users outside of SUNET will be unable to login using rackham.uppmax.uu.se. We have created a temporary workaround.
Rackham unavailable -- SOLVED: Rackham available
2017-11-17 09:35 Rackham is now back in regular service.
Login nodes are now open on Rackham, and jobs are expected to run as usual on Friday morning.
It was decided to temporarily close down the Rackham cluster last Thursday when several disks on Crex reported themselves broken. The problems now seems solved, and we're awaiting results from the last tests before Rackham is fully back in service.
UPPMAX power outage -- FIXED
UPPMAX experienced a power outage in the server hall on Tuesday.
Problems with /sw on Bianca (now fixed)
The /sw part of Bianca was lost around 07:30 this morning due to an issue with the storage system. This may have caused failed jobs. The system was fixed 08:40.
Quick upgrade of Slurm 2017-11-02 -- COMPLETED
Maintenance window Wednesday 2017-11-01 -- COMPLETED
Monthly maintenance window begins at 09:00 hours on the first Wednesday of the month. (That is today.)
Issues with /sw/data during the week end
/sw/data from pica may have been unavailable for some jobs during the week end and some jobs may have failed because of this.
UPPMAX support system is down -- SOLVED
RT, the support system UPPMAX and all the rest of SNIC is using, is down.
It is located at NSC at Linköping University and the whole university has network problems.
This will make all email to and from firstname.lastname@example.org delayed until the network problem is fixed. So answers to Your support tickets will be delayed.
We now have contact with our support system and emails to email@example.com are reaching us again.
Slow home direcotories
Someone seems to be running something very I/O-heavy from the home directories. We are looking for these jobs and will terminate them if found, but it's less than certain that we'll find them.
We found the guilty jobs and are termintating them and have notified the user not to do that again.
Accident on Irma caused jobs to fail with status NODE_FAIL
We sadly inform you that today at 17:02:37 a human error caused the compute nodes on Irma to reboot. The jobs running was canceled and will show up with status NODE_FAIL. The accident occured while investigating an issue with the storage network. We are very sorry about this.
UPPMAX shutdown due to cooling failure -- FIXED
lupus failover issue -- FIXED
Maintenance indication in output from command jobinfo
UPPMAX made a small change in "jobinfo" output.
In the REASON column for waiting jobs, "(Maintenance)" is shown for jobs that can not start before the next maintenance reservation.
Please note that maintenance reservations many times are moved forward to next month before the actual maintenance window.
Many Irma compute nodes lost electric power -- FIXED
Three racks of Irma's compute nodes lost power,because an automatic fuse shut down.
Some jobs were lost due to this. We are very sorry about that. Please rerun those jobs that were affected.
It looks like nodes i[167-250] were affected.
So what was the reason? It looks like an ethernet switch diied, possibly short circuited, so the automatic fuse shut down, getting more switches and the compute nodes to go down.
We have error reported to our support vendor. Until the bad ethernet switch has been repaired or replaced, Irma runs with a fewer number of compute nodes.
Update at 0950 hours
Now only nodes i[179-226] are down.
Maintenance window Wednesday 2017-09-06" -- FINISHED
milou2 rebooted August 28
milou2 rebooted Monday 2017-08-28 at 19:51.
Replacing (nearly) all disks on Irma's compute nodes -- DONE