In close collaboration with Seoul National University's Structural Complexity Laboratory

 

Booting SC Lab Machines

There is a problem with the cluster machines, which have very old power supplies that are no longer obtainable. The power surges when everything in the machine room comes up at once often kill one or two of these power supplies. So if there is a scheduled powerdown, all machines need to be turned off and physically disconnected from the power supply ahead of time. If there is an unscheduled powerdown, if possible please physically disconnect anyway, if you can, before powerup occurs.

The process for rebooting the machines is very specific and needs to be done in a particular order. Here it is:

SC1

sc1 needs to be rebooted first. It is the second bottom machine in the rack

  1. Turn on the external RAID array (the bottom machine in the rack) - sc1 cannot boot without this.
  2. Check that the external backup disk is turned on (it might not restart after a crash) - sc1 can boot without the backup disk, but it won't be able to backup properly if the disk is turned on after it boots
  3. Turn on at least one blade (it won't boot properly, but if none are booted, sc1's cluster control software won't come up properly)
  4. Boot the machine (this can take a long time, though not as long as sc)
    1. If, after about ten minutes, the system is still synching the RAID disks - all the RAID lights are flashing - then the boot has probably failed, and you should reboot
      1. This problem might have been due to a flakey disk that eventually failed completely and had to be replaced, but we aren't completely sure
  5. Boot or reboot the blades once sc1 is running fully. You should do this even if pbsmon shows them as running. Specifically, you need to reboot the blade you started a couple of steps ago because it won't be running properly.
  6. Run pbsmon (in the education menu on sc1) to check that all the blades boot properly
  7. Restart sc1a (preferably by logging into sc on an nx console and running scteach in that) - normally, Bob will look after this

SC

sc is the top machine in the rack. Its http server depends on some resources from sc1, so it won't boot properly until sc1 has completely rebooted

  1. Check that the external backup disk is turned on (it might not restart after a crash) - sc can boot without the backup disk, but it won't be able to backup properly if the disk is turned on after it boots
  2. Boot the machine (this can take 15 minutes, and for the first 5 minutes there may not be anything showing on the screen while it does its memory check)
  3. Restart scteach (preferably by logging into sc on an nx console and running scteach in that) - normally, Bob will look after this