Monday, January 10, 2005

Desktop search: why it has to be much smarter

Faughnan's Tech: Copernic/AOL: current leader in the sponsored (freebie) Windows desktop search race

I keep this blog primarily for my own uses as a place to keep notes on topics. Occasionally I do make editorial comments, but I was surprised recently to learn that at least two people read my posts on Copernic. I even replied to a comment explaining why I haven't yet bothered with MSN search.

So although I don't expect much readership, I'll expound a bit here on my thoughts on desktop search. If nothing else I'll link back to this in future.

I think desktop search has to be smarter than most people think -- at least for the 0.01% of the world that resembles me. (Caveat: I'm so far off the spectrum of users that no product manager with experience would use me as a representative user. On the other hand, I may resemble a "department" of typical users.)

I have thousands, maybe tens of thousands, of documents distributed across local and networked drives and, now, Blogger repositories. They go back about 15 years. I have maybe 10,000 images and they're growing fast. I have over 4GB of email in Outlook repositories and 2GB in Eudora. I have spreadsheets, databases, etc.

"Dumb" full-text search of my repository just returns noise: thousands of hits.

On the other hand, Lookout works great.


Because I use Lookout to search only my Outlook repository. And that repository has LOTS of rich metadata. There are date entries, subject entries, people entries, item-type (contact, task) etc. Lookout provides ways to constrain searches by metadata. It seems to use the metadata in its ranking (Subject >> text). It allows me to omit indexing attachments -- which adds more noise than value. I learn to edit email subjects/titles in Outlook (secret tip: this is very easy to do) before I throw my email in the Save folder (I don't use any other email folders). I can add an optional layer of on-the-fly metadata by adding categories. (Very poorly supported in Outlook otherwise this would work better.)

Desktop search has very little metadata to go on. Yeah, NTFS has LOTS of rich metadata support -- but it's ignored by almost all applications. Microsoft synchronizes (awkardly) Office document metadata with NTFS metadata, but even in Office support is weak. The workflow for adding metadata, even document titles, is very poor.

The biggest source of filesyste meaning-rich metadata on a PC is the folder/path name -- even more than the file name. I do better using a self-built kludged implementation of Norton Change Directory than I do with Copernic or any other filesystem indexing method. That works because I try to make my folder names descriptive.

Smart desktop search for someone as atypical as me needs to be smart about metadata. It needs to value strings in path names more than strings buried on page 50 of a 200 page document. It needs to value "Title" strings more than deep tex strings. It needs to value file name strings. It needs to rank recent above old. Heck, I could make a longer list (anyone want to pay me :-?).

Search results need to be very quick to sort and (me only -- subsort) and to allow additional subqueries (ok, I'm very data oriented.)

I think OS X Tiger search is going to knock the socks off the PC products I've seen so far -- including Google's disappointing offering. Reading their developer notes, they clearly understand the problem -- and, more importantly, they plan to deliver on their understanding.

Yeah, I know Microsoft has great stuff in the labs (Longhorn, etc) -- but so did Xerox 20 years ago.

Of course, as noted above, I'm really extreme. On the other hand, a departmental group of 10-20 people may have similar needs to mine. So someone building a desktop search solution for someone like me is building something that may work in an organization.

No comments: