Monday, October 21, 2013

Mac OS X Kernel Panic -- some debugging tips

Two  Four months ago my 2009 iMac was, if not rock solid, at least compacted dirt solid.

Then my 1TB internal Seagate drive (a replacement of a recalled drive) failed its SMART test. Since then it's been one odd thing after another; I haven't had a long enough trouble-free period to say anything sensible about the 1TB Samsung SSD I had FirstTech install in place of the failing HDD.

Today it's been a bit worse than odd -- I'm getting kernel panics. The multilingual white screen of death to be precise.

Usually this means a hardware error, and since I recently had a 1 TB SSD installed that's an obvious suspect. On the other hand, I've seen some odd firewire behaviors over the past few months. Maybe something in my 3 device firewire chain is malfunctioning (shades of SCSI chains of old, but I've also seen failing USB hubs cause weird crashes).

I saved one of my kernel logs and went into basic diagnostic mode, removing all peripherals. Apple Hardware extended test seemed to work, but it got stuck when my hard drive fan "Failed" (Presumably with my hard drive gone this result is misleading?).

Next I looked at Activity Monitor for anything running I could uninstall. So far I've removed:
  • Tech Tools Pro
  • Citrix Receiver
  • Carbon Copy Cloner
Of these I'm most suspicious of Citrix Receiver, next Tech Tools (esp with the SSD installed) and least of all CCC.
Then I went looking in LaunchDaemons and I found some odd ones:
  • com.google.keystone.daemon.plist: Google's updater I think
  • com.barebones.textwrangler.plist: probably benign
  • com.mac.adg.SquidMan.plist: Ok, this was amazing. Looks like I installed it in 2001. It had a running process and a file in privileged helper tools
  • com.bombich.ccc.plist
I cleared those out. Next I researched how to interpret kernel panic logs (/Library/Logs/DiagnosticReports), these articles seem to cover the domain ...
Alas, the easiest thing to interpret is the kext list, and mine looked benign even before I removed the above items.

Since my memory is old, and since it passed extended memory tests using Apple Hardware and Tech Tools, I'm most suspicious of a hardware failure in my Firewire chain, secondarily a failing USB hub. I don't know any way to debug those other than slowly adding things in until I get a panic, then removing them and seeing if the panic resolves.

Update: My firewire 400 to 800 cable is ripped apart near its insertion point. Subtle, but I may have a suspect.

Update 10/21/2013: Good app.net thread on this. My overall recommendations on a kernel panic problem is now this sequence
  1. Look at the kernel panic crash log list of loaded kext. Do any of them look odd or unfamiliar? Google and any that are not Apple native. [@gaelicwizard says this is low yield in modern versions of OS X]
  2. From admin account look at Activity Monitor and scan processes for anything odd.  [@gaelicwizard says this is very low yield in modern versions of OS X -- OS is pretty robust to this, see also @clarkgoble's tips below]
  3. Remove all peripherals. Run Apple's Hardware Check (used to come on DVD -- seems to run MacOS Classic!) to check memory or use Apple Diagnostics for machines post June 2013 (issue list documents what it can test). If don't have that use 3rd party app. Apple has its own "Apple Service Diagnostics" which is not legally available to us.
  4. Inspect all peripheral cables for defects (esp. firewire).
  5. Add USB devices. I don't know of any way to check for faults other than running and observing. Try boosting traffic (copy to drive) to stress.
  6. Add Firewire devices, one cable/device at a time.
Some Panics need to be diagnosed via remote network connections, and a networking Panic may need to be diagnosed remotely using a Firewire connection.

Note that if the fault is outside of the Mac a trip to Apple Store is unlikely to help. We can hardly bring all of our peripherals to the store.

There's a way to read a crashlog kext list and identify the kexts associated with Firewire, USB or Bluetooth and thus know where to look. From @gaelicwizard
  • The backtrace of the crash tells you where the actual panic happened. It is often, but certainly not always, within or close to where the error happened (at least as far as hardware faults go, IME). It should say a bunch of IOBluetoothHIDDrive ... AHCI is SATA. OHCI is FireWire (and some old USB). UHCI is USB. Often, the extension will just say FireWire or USB or whatever, but sometimes you see those acronyms too....
@clarkgoble summarized his approach as well (See also: Clark's Regular Maintenance Script) including general maintenance ...
  • I've regularly audited kexts and then launchd agents. With kexts it was relatively easy since 32 bit ones were disabled on newer machines. That meant it was harder for old ones to persist. With launchd I try and know what's there, do a Google search if I don't recognize it, and frequently disable things. (I had problems with a Google updater a few years back for example) 
  • The other debugging I do is to check Secrets (a pref pane for standard defaults modifications) and see if I set something that is a no no with new UI. Then I check pref panes, menulings, and running programs... I also run my maintenance script since the caches often cause some problems in my experience. Especially with older versions of Safari...

No comments: