Thursday, July 14, 2005

OS X 10.4 hotfile clustering bug lockup - using hfsdebug

I've a hunch this bug may be fixed in drives formatted with 10.4.2, but I wonder if there's any other real fix.
Increasing Free Disk Space Improves Performance

Robert Mohns

Jason Mark writes: "Crashes appeared to happen at random [...] One clue I did find was within minutes of restarting a process called "update" would take up to 97% of my processor, and the hard drive would spin like crazy. [...] I did some housecleaning, and now have 11gigs of free drive space, and Tiger is just humming along. Crashes went away. Speed is great. I've been up for 6 days now (for a while my record was 3 hours)."

I strongly suspect that free disk space is a red herring. What Mr. Mark is seeing appears to be the same issue I and others have experienced, in which the update daemon fails to relocate a small file as part of hotfile clustering, retries intensively, spiking both itself and syslog, causing disk IO to peak and the system to eventually reach an unresponsive state. The only way out is to reboot by key sequence or reset button. For more, see my notes in the May 16th MacInTouch Tiger Report:

Low disk space may exacerbate this issue, but It is wildly unlikely that it is the cause. Something is causing update to fail -- perhaps bad sectors on the disk? -- and it handles that failure badly, eventually taking down the system. You can use the (non-Apple) hfsdebug utility and your system log to find the offending file, then delete it manually. I have found that the same file will cause this time and time again; removing the offending file puts the problem at bay ... for a while. The problem goes away ... until some other file causes update to lose its mind. The only solution that really works may be to erase the drive and reinstall...

In "housecleaning" his drive, Mr. Mark may have accidentally deleted the offending file, or he may have cleared space needed for hotfile clustering .... or something else entirely, since so far, nobody really knows what's going on. Apple may, but they aren't talking. He may not encounter the problem again ... or, like me, he may have a month or so of grace, then start experiencing the same old problems.

Unfortunately, this does not appear to be a simple file system damage problem; DiskWarrior and Apple's own Disk Repair don't seem to have any effect on it. I haven't yet experienced it under the just- released 10.4.2; when I do, I intend to fsck -y in single user mode to see if it makes a difference. I expect to run into it eventually, as I've left intact update's current "problem child" file for this reason.

