Saturday, February 27, 2010

My 10.6 iMac is crashing - a debugging exercise

I hate when this happens. It's been a while fortunately.

My now flicker-free fairly new 27" quad core iMac is crashing.

There are two common failure modes. One is that it locks up when displaying photos in the screensaver. The clue is that the same images display repeatedly. The second mode is that iTunes becomes unresponsive, and then the Finder as well. I get the SBOD/SPOD/spinning beachball of death.

If I kill both I can restart iTunes, but not the Finder. I've tried several fixes. First I ran Safe Boot (shift restart). Then I installed the latest version of Onyx and ran the usual clean, check, maintenance, etc. Everything passed but the problems have persisted.

So I've begun uninstalling anything invasive. First I removed CrashPlan, and today I uninstalled 1Password (a quite involved uninstall, see OS X defect: The missing uninstaller).

I've been suspicious of 1Password for a while. I'm also monitoring Console.app, which is full of "missing bundle identifier" Office 2008 related messages. I've found mention of this problem in association with kernel panics last November, with a more recent responses. For example:
... 2/27/10 8:10:35 PM [0x0-0x28028].com.vmware.fusionDaemon[296] 2010-02-27 20:10:35.581 pkgutil[299:60f] PackageKit: *** Missing bundle identifier: /Library/Receipts/Office2008_en_proofing_brazilian.pkg...
Soo why the heck does vmware Fusion put this message out? (com.apple.backupd also complains, but that make sense). VMWare Fusion is doing quite a bit on startup, even though it's "not running".

 Tonight I'll run the Apple hardware test that came with my machine in loop mode. (I think you need to attach a physical kb to get this to work -- holding 'D' on my bluetooth kb didn't seem to do anything.) If the problem persists, VMWare will be the next to go (For example). I want it to run my XP VM, but if need be I can move that VM to a MacBook running 10.5 and Fusion 2. If I figure this one out, I'll update this post.

PS. Bundle Identifier via Apple Dev:
The bundle identifier string identifies your application to the system. This string must be a uniform type identifier (UTI) that contains only alphanumeric (A-Z,a-z,0-9), hyphen (-), and period (.) characters. The string should also be in reverse-DNS format. For example, if your company’s domain is Ajax.com and you create an application named Hello, you could assign the string com.Ajax.Hello as your application’s bundle identifier.
Update:
  • I moved the suspicious Office 2008 .pkg files to another drive. I'll see if I can reduce the Console messages.
  • I realized that when I drag things to the trash they're being deleted immediately, rather than put in the trash. I found an old article on this that's being cited by 10.6 users recently. Naturally it's a permissions problem. I wonder if it's related to the odd way my account was created when I used migration assistant (long story). I'll try some of those fixes.
Update b: Getting somewhere.

Ok, this is interesting. As described by the X Lab document I was getting the "will be deleted immediately" trash message because I had lost read/write privileges to my home directory. On inspection my User directory folders all refer to an user "_unknown". I think this was a side-effect of how I set up my account, which was:
  1. I setup an account called "jfaughnan" on my new machine.
  2. After a while I deleted it an migrated the "jfaughnan" account from my old machine.
  3. The new account was given the home folder name "jfaughnan 1". On inspection I found that deleting the user "jfaughnan" failed to remove the original home directory.
I'll keep working this one. I also would wonder about VMware messing up my privileges and about migration assistant inheriting known permissions problems from my old machine.


Update 2/28/2010: Went through 18 loops without a problem, so hardware seems good. I'll keep hitting on the software issues - esp. VMWare.


Update 3/2/2010: I thought I was getting somewhere, but today it was unresponsive with a faint glowing blue screen. I tried putting it to sleep by pressing the power button, but nothing happened. Then, on a hunch, I turned off an attached firewire 400 drive. The system responded, then went to sleep. It woke up normally.

On resumption there was an iTunes complaint about loss of Apple Store connectivity. Meanwhile my console showed many instances of this message:
3/2/10 3:45:14 PM com.apple.launchd.peruser.502[155] (com.apple.AddressBook.abd[13839]) Exited with exit code: 1
Now I'm suspicious that 10.6.2 has a problem with sleeping firewire 400 drives. Spanning Sync and MobileMe get frownies too because of the Address Book message, and Time Machine because my last TM backup concluded just before the lockup.


Update 3/11/10: It's still locking up, sometimes with awake from sleep and sometimes with switching users. It might be less frequent - maybe once a week. I'm suspicious of VMWare 3.02 on 10.6.2. I don't see any useful Console messages. I will try restarting the machine every few days as a prophylactic measure. Maybe 10.6.3 will help, it should be out soon.


Update 3/12/10: Got the faint blue screen, and again turning off the external drive did the trick. The screensaver slideshow uses that Firewire 400 drive. I ran Disk Utility and cleaned up permissions on the drive. It's set to ignore ownership. Looks like yet another OS/firewire problem.


Update 3/13/10: Found iTunes was not responding. Time Machine backup not working. When I tried to add a share got spinning beachball (noted, however, a share reference to a user account that was deleted - bug there). Console said slide show found a corrupted jpeg. Unable to shutdown -- until I turned off external firewire drive. I think 10.6 and new iMacs have serious problems with Firewire 400 enclosures. Errors seem to cause the OS to blow up - perhaps some kind of memory overow issue.


Update 3/13/10b: Huge discussion thread on firewire issues in 10.6. I think I found my problem.


Update 3/13/10c: Apple just truncated the massive firewire complaint thread. Really, it didn't have anything to do with me. I guess they took care of that problem. I've switched to USB for now, I'll try firewire again with 10.6.3.


Update 3/29/2010: Same crash - locked up screen saver - but this time I was using a USB drive. Turning the drive off then on again cleared the problem. So this isn't a firewire problem after all! It's a screensaver and external drive bug. The console is showing "corrupt JPEG data", but that might just be from power cycling the drive.


Update 5/12/2010: I think this was fixed by 10.6.3. I haven't seen it since. I only have a USB drive attached however, so I can't rule out a Firewire bug. The "corrupt JPEG" console message was a red herring caused by loss of the drive when trying to display in the screensaver.

Update 10/27/2010: Things have been pretty stable since 10.6.3.

No comments: