Uppsala Multidisciplinary Center for Advanced Computational Science

Most of the Milou nodes went down on Thursday evening

2016-07-15

Milou contains 248 compute nodes. 40 of those is set up as a separate cluster, named Fysast1.

125 compute nodes stopped yesterday evening. We are now starting them again and also investigate why they stopped working.

Of course many jobs crashed because of this, and you need to resubmit them. When investigating those jobs with e.g. the tool "finishedjobinfo", the jobstate will most probably be given as "NODE_FAIL".

We are sorry about the inconvenience.

Update at 1145 hours

We guess that we had a dip in the power supply, that made 125 nodes think that needed to shutdown in an orderly way, like if we (or a gremlin) had lightly pushed the power button.

Old System News