In close collaboration with Seoul National University's Structural Complexity Laboratory

 

This is an old revision of the document!


Booting SC Lab Machines

There is a problem with the cluster machines, which have very old power supplies that are no longer obtainable. The power surges when everything in the machine room comes up at once often kill one or two of these power supplies. So if there is a scheduled powerdown, all machines need to be turned off and physically disconnected from the power supply ahead of time. If there is an unscheduled powerdown, if possible please physically disconnect anyway, if you can, before powerup occurs.

The process for rebooting the machines is very specific and needs to be done in a particular order. Here it is:

SC

sc is the top machine in the rack

  1. Check that the external backup disk is turned on (it might not restart after a crash) - sc can boot without the backup disk, but it won't be able to backup properly if the disk is turned on after it boots
  2. Boot the machine (this can take 15 minutes, and for the first 5 minutes there may not be anything showing on the screen while it does its memory check)
  3. Restart scteach (preferably by logging into sc on an nx console and running scteach in that) - normally, Bob will look after this

SC1

sc1 is the second bottom machine in the rack

  1. Turn on the external RAID array (the bottom machine in the rack) - sc1 _cannot boot without this_.
  2. Check that the external backup disk is turned on (it might not restart after a crash) - sc1 can boot without the backup disk, but it won't be able to backup properly if the disk is turned on after it boots
  3. Turn on at least one blade (it won't boot properly, but if none are booted, sc1's cluster control software won't come up properly)
  4. Boot the machine (this can take a long time, though not as long as sc)
    1. If, after about ten minutes, the system is still synching the RAID disks - all the RAID lights are flashing - then the boot has probably failed, and you should reboot again
      1. This problem might have been due to a flakey disk that eventually failed completely and had to be replaced, but we aren't completely sure
  5. Boot or reboot the blades once sc1 is running fully. You should do this even if pbsmon shows them as running.
  6. Run pbsmon (in the education menu on sc1) to check that all the blades boot properly
  7. Restart sc1a (preferably by logging into sc on an nx console and running scteach in that) - normally, Bob will look after this