Sunday, November 23, 2014

Time Capsule fails again -- and Mavericks has the old OS X sparsebundle delete bug

I’d say that I’m paranoid about backups, but, honestly, I’m just realistic. Our entire computing architecture is designed to destroy data — because it’s designed for performant writes and reads. We aren’t writing on stone tablets [1], we’re writing on sand.

Realism means that for my most important machines I always have two backup systems with as much independence as I can manage. At work I use Retrospect/Win (historic reasons) and HP/Autonomy’s “Connected Backup”. At home I use Carbon Copy Cloner 3.x [2] and Apple’s infamous Time Capsule [3].

But maybe I’m not realistic enough. As of this moment three of my four backup systems have failed.

At work Retrospect/Win stopped launching — probably the victim of a corporate system/antivirus update/bug. “Connected Backup”, when I did a routine restore test, turned out to have made a mess of my backups — old data, missing data, etc.

At home meanwhile Time Capsule failed a routine restore test [4]. I could navigate my backups, but from my primary user account I couldn’t interact with the Time Machine UI. My clicks were ignored. I did some experimentation with different accounts and, as best I can tell, Time Machine wasn’t completely dead. It was just responding extreeeeeeeeeeemlllyyyyyyyy sloowwwwwllyyyyy - and missing my mouse clicks (maybe in the timeline it inhabited my clicks were too fast to be perceived. I used GB ethernet of course.).

I wasted a few hours trying to delete my main machine’s 1.4TB sparse bundle disk image; I tried all the tricks in the book to work around OS X’s inane file deletion limitations [5]. Nothing worked - perhaps because the drive had deeper hardware or permissions or file system problems (Time Capsule is an opaque system, probably running NetBSD.)

I gave up on the fix and wiped the Time Capsule drive, restoring new backups from each computer one at a time (via ethernet for initial backup). 

I’m going to take a look at Synology NAS options. In the meantime I’ll up my Time Capsule ‘single file restore’ checks; I wonder if I’ll have to wipe the Time Capsule every 6 months or so to get more reliable results.

- fn -

[1] At one time in the ancient world writing was done on clay tablets - designed for rapid erasure. It’s a lousy way to preserve data— except that in Nineveh great fire backed the tablets.

[2] Mostly a clone, but it moves updated files into an archival storage. It isn’t a great way to recover data that was corrupted months ago, but since I rotate 4hard drives to 3 locations (work, home, my van) I do have older versions of my data.

[3] Infamous for the world’s least practical user interface (LSD-fueled trip through a star field), awful performance, and cryptic failure modes.

[4] Every month I try restoring a small file; I’ve a Toodledo task for this that I just keep bumping ahead.

[5] Finder and most terminal deletes need to a reference to EVERY file to be deleted, and getconf ARG_MAX shows Mavericks has the same and limit as Mountain Lion: 262,144 arguments. Since a sparse bundle is a collection of 8MB files (“bands”) big bundles hit this limit.

Update 11/24/2014 - Twist

After wiping my Time Capsule drive and recreating my backup, I tried a restore.

It worked well with 2 of the 3 accounts I tried — but, again, not on my home account. The Time Machine window didn’t respond to mouse clicks. As shown in the attached screen photo there’s a pencil icon in bottom left with a slash through it (can’t write).

IMG 5214

So I tried repairing permissions. It’s never worked before, and it didn’t work this time.

Next I verified my (SSD) drive — that hung about midway through. No error, it just didn’t complete.

So I booted from the recovery partition and ran Repair Disk — which did complete. Without any kind of error report.

Then I removed all peripherals and tried Time Machine. This time if I switched to a recent backup and waited a couple of minutes it DID respond and I COULD restore a test file for my problem user.

It seems that there’s something about that user’s account (by far largest and most complex) that causes Time Machine to take a VERY long time to be ready to interact with the user.

Update 11/25/2014 - I think it might have been the SSD.

At the same time I was wrestling this problem I was also preparing to apply Samsung’s fix for my increasingly slow Samsung EVO 840 SSD. As noted above I was seeing some odd behavior with Disk Utility — no error was reported, but Disk Utility seemed to hang waiting for a directory structure operation to complete. After I repaired the Disk Time Machine seemed to work.

After I had a working backup I ran the Samsung repair. Following that my startup times went from 20 seconds (faster than hard drive) to 2-3 seconds (SSD speed). Time Machine was also quite a bit faster.

My guess is that read failures on the directory structure for my problematic user account was causing Time Machine to hang. If so the problem wasn’t with the Time Machine backup, it was with the drive I wanted to restore too…

No comments: