Saturday, June 18, 2011

G5 iMac Blue Screen on logout and startup - and a review of G5 debugging

[See the update. I don't think it's safe to use a 10.5.8 machine with AirPort Wifi and Epson HL-2140 printer drivers as a print server.]

Back in the days of Quarterdeck Extended Memory Manager and DOS Carousel/DesqView/Sidekick debugging was simple. Our one machine crashed many times an hour, and there were only a few things to relentless tweak in the hopes of getting a bit of work done. Hardware problems were extremely rare; my original Panasonic 8086 was vastly overengineered [1].

Despite vastly more security threats, life for a modern Mac household is much improved. We currently run 3 OS X machines (two 10.6, one 10.5) and five iPhones for five users. My life as network admin is mostly focused on porn control (which is a lot like fighting with QEMM really).

Software has come a very long way. Hardware is less robust than it once was, but Apple gear isn't as bad as it could be. Overall we're ahead.

When problems hit though, they can be really tough. This is the story of one of those problems. Like my Troubleshooting MacBook wake from sleep problems of 12/2010 I think I've got it under control, but I'll probably never know the precise cause. I think that's because, barring obvious hardware failure [2] and malware infection [3] most modern tech problems are multifactorial. Multiple software bugs interact with occasionally flaky hardware to produce bad outcomes; curing the problem often means fixing several contributors and thus reducing, but not eliminating, crashes.

In other words, modern computer reliability has quite a bit in common with murder rates in Manhattan and diseases in humans [4]. Chaotic and multifactorial causes and fixes.

Today's example involves our old G5 iMac, (introduced 2004, got mine @2005). These were among the last G5 machines Apple made, and like all the G5 line they were plagued by heat problems. And more heat problems. Heavens, but the G5 chip sucked. [5]

This machine has been pretty stable, though noisy, but yesterday it hung on the login screen. I did the usual first OS X troubleshooting step -- I did a power-cycle restart holding the shift key. This forces cache cleanup and a directory structure integrity check. That seemed to work, but soon I was seeing blue screen on logout, particularly with my son's parental control managed account -- and blue screens on login -- just before I'd expect to see the user select screen on this multi-user machine.

There was no obvious software explanation, so in today's world the next step is to test hardware. I booted the (Mac Classic OS!) hardware test disk for the G5 and ran it in loop mode. Surprisingly the machine passed. This would have been a good time to inspect the G5's notoriously shortlived capacitors, but in fact I only did that the next day. The capacitors were fine, and I vacuumed the machine [6].

At that point I could have started rooting around with Verbose login and Console looking for blue screen errors but this is an old machine with a lot of software cruft. Since the hardware looked good I did a 10.5 archive and install [7] and let it run overnight.

This morning it booted up nicely -- right into a foul 10.5.1 bug with Admin account passwords [8]. I was amazed, I never thought to see that one again. After I did the tortuous unix command line fix [9] password reset I updated to 10.5.8 and rebooted into a clean Admin account that logged out to a blue screen. About then I realized the problem hadn't gone away and I had to do real troubleshooting.

I'll skip the tedious details. Instead I'll enumerate what I think the contributing factors were, then I'll describe what worked.

My best guess is that it was not malware related, but involved several of these items;

  • An AirPort network problem. I could see the iMac was showing up on the network with different network machine names, such as BigMac, BigMac-2 and so on. It eventually got up to BigMac-4. I suspect a hardware problem with our Time Capsule/AirPort Extreme router. [10]
  • A software issue with OS X 10.5 Location - there were some odd behaviors when I tried changing Location settings. Maybe a corrupted preference file.
  • An Epson printer driver issue interacting with WiFi shared printer access. The drivers on this machine are a few years old and they didn't come from Apple (in 10.6 they do). I found some curious print jobs dangling in the printer queues on different accounts as I cleaned up accounts.
  • Something funny with Parental Controls. Several controlled accounts had password failures, even after the Archive/Install. However resetting the account passwords to the original password still matched the login keychain password.
  • Spotlight and turning off Spotlight indexing of a network mounted Time Capsule backup drive.
  • Network shares connecting on startup
  • Time Machine backup to a Time Capsule external drive's disk images.
  • Flash - which is the source of all evil on a Mac
  • MacAlley external keyboard driver

This is what I did to get things working, beyond the archive and install (if I could have I would have switched to wired connections to our router):

  • I removed all Locations except Automatic (I could have also simply deleted the appropriate preference file)
  • I created a new Location and set it to fixed IP address rather than request via DHCP (so no hang if there's an IP request issue)
  • Turned off Time Machine
  • Turned off Spotlight (drag and drop the internal drive icon to the Privacy list)
  • Installed the Click to Flash Safari extension to reduce Flash activity when my son is using the machine (ads mostly)
  • Reinstalled printer drivers but didn't reinstall the MacAlley external kb drivers. It works without them.
  • Created a new Admin account and trashed the one that hit the 10.5.1 bug. Incidentally, the best practice for admin accounts is to use a 6 letter Sentence case string. Then the short name will be a lower case version of the long name and you can use either when typing.
  • Reviewed and refreshed all the parental controls.
  • Removed all login items, esp. network drive mounts, from all accounts.
  • Restored the original Apple kb instead of the MacAlley I was using (eliminates 1 driver, I did this after things were working but it's a good stabilizing move)

It's now working. I wouldn't be surprised if there is a hardware problem in the G5 and the problem returns, but it is clearly much improved. I think any hardware problem is more likely in the Time Capsule.

With Click To Flash, Spotlight disabled, and no Time Capsule backup this machine is quieter than it's ever been. I'll wait a while before I reenable backup [11] and I think Spotlight is a goner (love it on 10.6 though).

If I had to put my money on a single fix, I would wonder about the fixed IP and perhaps redoing Location settings (removing would have been better).

Update: I can't say it's cured, but it's clearly far better. No more blue screen today. The only device driver I've installed is for the printer, there's no native support in 10.5. It's also astoundingly quiet. It hasn't been this quiet in memory, the fans barely run. Coincidentally I just now came across mention of Tiger (10.4) kernel panics with newer AirPort base stations. I wonder if there's a relationship ...

Update 6/18: It came back. I was stunned. I can't imagine hardware errors behaving this way. My son had used the computer, but we monitor him closely. All he'd done was print ...

and print ...

and ....

He likes to print. We hardly ever print. Sometimes he was printing from a 10.6 laptop.

The only driver I'd installed yesterday was the Epson 10.5 printer drivers. They're older of course, and the OS has been patched since they were current.

I tried several things, but this, dramatically, worked when multiple other things (PRAM reset, SMU reset, hold shift to start in safe mode, etc) didn't. I swear this was an occult geek ninja move.

  1. Startup in single user mode (cmd-S on startup)
  2. Run the command that allows me to modify the file system (you see it on startup prompt)
  3. Navigate to where OS X stores the CUPS printer spool (I knew the driver was CUPS): ~\var\spool\cups\cache (to use the unix cd command to navigate here while in single user mode, you have to do sudo -s first to get a root shell).
  4. Delete all the files - rm *
  5. exit

The machine started up normally. I then turned off print sharing and deleted the Epson printer driver.

Wow.

Update 6/27: I reinstalled print drivers but made a 10.6 machine my print server. No further problems. I posted on this to Apple Discussions ... Leopard login blue screen fixed by...: Apple Support Communities

- fn -

[1] After years of use, during a marathon debugging session managing DOS interrupt conflicts, I pulled a board with the machine running. It shorted out only half the bus, the machine continued to work for years. Panasonic must have lost money on each of these monsters. They left the market when the US Congress, rightly fearing the disemboweling of the US computer industry by superior competitors, used plausible threats of protectionist intervention to save the US hardware and software industry from Japanese competitors. Nobody remembers this.
[2] Hardware failure can be very hard to spot. I had a series of issues with my 27" iMac that were vexing, but not bad enough to really investigate. Finally my Kensington USB hub got flaky enough to be replaced. With the new hub the seemingly unrelated system issues vanished. Wireless routers are famous for failing in flaky and hard to diagnose ways.
[3] Not yet a real issue for our Macs, but it's clearly coming. I suspect that's why Apple is delaying their hot new MacBook Air so it ships with Lion. They want to move us to signed code platforms within the year, so they need to kill 10.6.
[4] How many physicians think humans get only one viral infection at a time? I've believed for years that a lot of "viral syndromes" are the result of multiple simultaneous viral infections. I bet that meme makes it the med schools in my lifetime.
[5] Remember the liquid cooled G5 towers? Apple was desperate. Miracle they survived the Intel transition. That's when we learned how good Apple was.
[6] I used to blow dust out with compressed air. That was fun! Then I killed a Mac Mini optical drive in a dust cloud. Now I carefully vacuum using a mini-shop vac. The machine looks archaic now -- huge capacitors, everything enormous. Especially compared to working on the Mini, much less an old iPhone.
[7] No longer available in 10.6, which I think does the equivalent on every install.
[8] In the old days you could trade up from the equivalent of 10.5.1 to 10.5.6 at Apple stores. No more. Disappointing.
[9] Yes, password reset is easy to do on any account. You can also just mount drives in firewire target disk mode. Passwords exist for other reasons, not to secure the drive. For physical security you need encrypted images.
[10] The generation I have are notorious for power supply failures with extended flakiness prior to total collapse.
[11] There's very little data stored on this machine. It may not be worth backing up.

See also:

A series of mostly 2004 and 2005 posts about the heat and fan problems that plagued iMac G5 buyers, though the rev 2 machine I bought had fewer of the capacitor and firewire and fan issues. These posts predate Twitter and Google Reader Share, back then my posts were more tweetish.

No comments: