Tuesday, June 05, 2007

A stalled restore from screen saver in 10.3.9

Debugging an ill-behaved XP or OS X machine can strain anyone. My latest OS X adventure is illustrative.

The ancient G3 10.3.9 iBook was being found unresponsive. If I touched the mouse pad a cursor would move about, but the screen remained uniformly gray. I couldn't force quit or logout, and I had to power cycle the machine. This went on for several days. The only recent change was installing a pre-release version of Camino 1.5 (excellent browser, btw).

I spent a day or two trying various experiments before I came upon a fix. At first I treated this as a "wake from sleep" problem; OS X 10.3.9 had quite a few of these. I even reset the PMU
The Power Manager is an integrated circuit (computer chip) that is usually on the logic board of the PowerBook and iBook. As the name implies, it is responsible for power management of the computer. It controls backlighting, hard disk spin down, sleep and wake, some charging aspects, trackpad control, and some input/output as it relates to the computer sleeping.

Over time, the settings in the Power Manager may become unusable, which can result in operational anomalies with the computer. Examples include not turning on, not waking from sleep, not charging the battery, or not seeing the AC Adapter, among others.
but that was a waste. The machine was not asleep. It had the look it has when it's waiting to start the screen saver -- a JPG slideshow that takes minutes to start on this old, slow, machine. I wondered then about a corrupted JPG messing up the screen saver, so I refreshed my images and tested my ability to copy them. I switched to a different screen saver temporarily, but the problem returned.

Next I wondered about a network issue. 10.3.9 has a lot of issues with losing connectivity (esp. SMB) so I made sure none of my 4 users (mom, dad, child, admin) had any direct or indirect automatic network connections. Along the way I solved an unrelated Airport bug. In 10.3.9 if you set a client to automatically login to a network, then renamed the network, the client would simply not login (rather than default to the last used network). I also figured out a very annoying behavior with security updates, key chains, multiple users and WLANs. If one user makes a WLAN (WEP 2) connection, every user inherits that connection. If they inherit, however, they never do the OS X keychain update magic that happens the first time you make a connection after a security update.

Eventually I fixed all the network issues and all my users stored the WEP password correctly in their keychain, but the problem still persisted.

The clue was noticing that sometimes the machine was responsive. When it did respond I'd find one of two things. Either I'd find the last user had logged out or Camino 1.5 pre-release had canceled the logout -- because I'd not responded to a user dialog. I dug down into the user prefs and I found two relevant settings and one probably irrelevant:
  • users were being logged after 6 minutes of inactivity (security)
  • the screen was going blank in about 30 minutes (power)
  • (probably irrelevant) the screen saver was set to run around the inactivity time, but it took a very long time to startup because the image folder had thousands of JPGs and the old iBook is very slow ...
This is what I think was happening:
  • the machine tried to log a user out
  • Camino blocked the logout with a dialog
  • the power setting tried to blank the screen
  • around the same time the screen saver kicked in and blocked interactions ...
I think some combination of the above setup the locked state.

I made these changes:
  • turned off auto-logout
  • set dim screen to 3 hours (because I wanted to display the family slideshow)
That seems to have done the trick. There's no more auto-logout, so Camino doesn't block anything.

Now that was a hard one to figure out!

I think these machines have too much of the emergent behavior of evolved systems without the built-in homeostatic mechanisms ...

Update 6/5/07: No more occurrences over the past week, so this problem has been fixed.

No comments: