Incident log

This page is the incident log for the halibut systems. This exists in order to keep our user community informed of incidents that might affect them.

Date: 2003.12.21
Posted by: Josh

HalNet was down for about 2 hours due to a power outage (which was in turn due to an earthquake centered about 50 miles north of Mark's house). The generator was pulled out, but not connected before the power came back up.

Date: 2003.12.15
Posted by: Josh

The power outage killed the SDSL router (it locked up every half hour or so), so HalNet experienced intermittent connectivity (shortly after someone was around to kick the router) for the later portion of the 14th, and most of the 15th. We swapped out the router on the evening of the 15th, which seems to have cleared things up. We used the downtime as an opportunity to install more memory into chiba, and perform some other hardware stuff.

Date: 2003.12.14
Posted by: Josh

Power outage for 2 hours, from roughly 7:55am PST to 10:00am PST. The UPS only held up for 15 minutes.

Date: 2003.11.22
Posted by: Josh

We installed a new RAID system (3ware based), with 4 drives total (3 used in a RAID 5 array, 1 hot spare). We've gone from 98-100% usage on /home to 6% usage. Happy day! Now to see if the new hardware works out well...

Date: 2003.07.02
Posted by: Josh

After 620 days of uptime, chiba was rebooted for some scheduled maintenance on July 2, 2003. Chiba is now running a newer kernel, and supports IPSec. Chiba has been setup to support our next line of hardware upgrades (to a 3ware RAID controller and new larger IDE disks...), and the necessary drivers to support Mark's software repeater project.

This clears the way for chiba's new incarnation, Mecha-Chiba, which will likely be based on OpenWall's Secure Owl Linux distribution.

Date: 2001.04.18
Posted by: Josh

We brought down the system to add another 256MB of memory, now with 30% more silicon. Sweet Jesus!

Date: 2000.08.17
Posted by: Josh

Yesterday, around 1040 drive 0's SCSI interface stopped responding to the outside world. About 3 minutes after that, drive 1 was disabled use to a faulty RAID header. The machine's drives were at this point dead to the world. We then started playing the 'how does the Linux VM pagecache work'. Programs that were fully loaded in memory continued to work. Any program that had to access the disk to load the binary or data died. Programs that don't check return values from system calls started acting odd.

I determined that the drives and/or RAID controller had failed. I could not access the firmware RAID configuration software. Attempting to do so locked the machine.

I setup a new IDE disk as a replacement. All files were transfered from the drive to the new drive and verified.

Comments:

(nearly) simultaneous failure of the drives could indicate that there was some external event that triggered it. It could also be easily explained by a manufacturing problem that affected the entire lot of drives. (both drives came from the same manufacturing lot). This effect is commonly seen. I have heard several stories about 24 hour periods that resulted in the loss of 5 of the 7 drives in a RAID system. lesson: get drives from separate lots.

What was apparently a RAID controller in early August could have been a transitory hard drive failure.

The RAID card's BIOS and the system BIOS appear to not work-and-play well with each other.