Lucene - or how I stopped worrying, and learned to love unstructured data

Before moving to how I’m implementing Lucene.net into Subtext, I wanted to bring to my Lucene.net tutorial the experience of a good friend of mine, Nic Wise, who is using Lucene, both Java and .NET, since 2003.
So, without further ado, let’s read the experience directly from Nic’s writings.

Lucene.net tutorial

How to get started with Lucene.net
Lucene.net: the main concepts
Lucene.net: your first application
Dissecting Lucene.net storage: Documents and Fields
Lucene - or how I stopped worrying, and learned to love unstructured data
How Subtext’s Lucene.net index is structured

Lucene is a project I've been using for a long time, and one I find often that people don't know about. I think Simo has covered off what Lucene is, and how to use it, so I'm here to tell you a bit about how I've used it over the years.

My first Lucene

My first use of Lucene was back in about 2003. I was writing a educational website for IDG in New Zealand, using Java, and we needed to search the database. Lucene was really the only option, aside from using various RDBMS tricks, so it was chosen. This was a pretty typical usage tho - throw the content of a database record into the index with a primary key and then query it, pulling back the records in relevance order.

With the collapse of the first internet bubble (yes, it even hit little ol' New Zealand) that site died, and I stopped using Java and moved to .NET. To be honest, I don't even remember the name of the site!

AfterMail / Archive Manager

My next encounter with Lucene - this time Lucene.NET - was when I was at AfterMail (which was later bought by Quest Software, and is now Archive Manager). AfterMail was an email archiving product, which extracts email from Exchange, and puts it into a SQL Server database. Exchange didn't handle huge data sets well (it still doesn't do it well, but it does do it better), but SQL Server can and does handle massive data sets without flinching.

The existing AfterMail product used a very simple index system: break up a document into it's component words, either by tokenizing the content of an email, or using an iFilter to extract the content of an attachment, and then do a mapping between words and email or attachment primary keys. It was pretty simple, and it worked quite well with small data sets, but the size of the index database compared to the size of the source database was a problem - it was often more than 75%! This was really not good when you have a lot of data in the database. This was combined with not having any relevance ranking, or any other of the nice features a "real" index provides.

We decided to give Lucene a try for second major release of AfterMail. On the same data set, Lucene created an index which was about 20% of the size of the source data, performed a lot quicker, and scaled up to massive data sets without any problem.

The general architecture we had went like this:

The data loader would take an email and insert it into the database. It would also add the email's ID and the ID
of any attachments into the "to be indexed" table.

The data loader would take an email and insert it into the database. It would also add the email's ID and the ID
of any attachments into the "to be indexed" table.
The indexing service would look at that "to be indexed" table every minute, and index anything which was in
there.
When the website needed to query the index, it would make a remoting call (what is now WCF) to the index
searching service, which would query the index, and put the results into a database temporary table. This was a
legacy from the original index system, so we could then join onto the email and attachment tables.

We indexed a fair bit of data, including:

The content of the email or attachment, in a form which could be searched but not retrieved.
Which users could see it, so we didn't have to check if the user could see an item in the results.
The email address of the user, broken down - so foo@bar.com was added in as foo, oof, bar.com, moc.rab etc. This
allowed us to search for f*, *oo, *@bar.com, and *@bar*, which Lucene doesn't normally allow (you can do wild cards
at the end, but not the beginning)
Other meta data, like which mailboxes the email was in, which folders, if we knew, and a load of other data.

All of this meant we could provide the user with a strong search function. From time to time, an email would be indexed more than once, updating the document in the index (eg if another user could see the email), but in general, it was a quick and fairly stable process. It wasn't perfect tho: We ran into an issue where we set the merge sizes too high - way, way too high – which caused a merge of two huge files. This would have worked just fine, if a bit slow, except we had a watchdog timer in place: when a service took too long doing anything, the service would be killed and restarted. This led to a lot of temporary index files being left around (around 250GB in one case, for a 20GB index), and a generally broken index. Setting the merge size to a more sane value (around 100,000 - we had it at Int32.MaxInteger) fixed this, but it took us a while to work it out, and about a week to reindex the customers database, which contained around 100GB of email and attachments.

Another gotcha we ran into - and why we had a search service which was accessed via remoting - is that Lucene does NOT like talking to an index which is on a file share. If the network connection goes down, even for a second, you will end up with a trashed index. (this was in 1.x, and may be fixed in 2.x). AfterMail was designed to be distributed over multiple machines, so being able to communicate with a remote index was a requirement.

Just before I left, we did lot of work around the indexing, moving from Lucene.NET 1.x to 2.x, along with a move to .NET 2.0. We added multi-threading for indexing of email (attachments had a bottle neck on the iFilters, which were always single threaded, but the indexing part was multi-threaded), which sped up the indexing by a large factor – I think we were indexing about 20-40 emails per second on a fairly basic dual-core machine, up from 2-3 per second, and it would scale quite linearly as you added more CPU's.

Lucene performed amazingly well, and allowed us to provide a close-to-google style search for our customers.

Top Gear

The next project I used it on was a rewrite of the Top Gear website. This is where some of the less conventional uses came up. For those who don't know, Top Gear is a UK television program about cars, cars and cars, presented in both a technical (BHP, MPG, torque) and non technical, humorous way (it's rubbish/it's amazing/OMFG!). We were redeveloping the website from scratch, for the magazine, and it ties into the show well.

The first aspect of the index was the usual: index the various items in the database (articles, blog posts, car reviews, video metadata), and allow the user to search them. The search results were sightly massaged, as we wanted to bubble newer content to the top, but otherwise we were using Lucene’s built in relevance ordering. The user can also select what they want to search - articles, blog posts, video etc - or just search the whole site.

Quick tips for people new to Lucene

Your documents don't have to have the same fields! For example, the fields for a Video will be different to the fields for an Article, but you can put them in the same index! Just make a few common ones (I usually go with body and tags, as well as a discriminator (Video/Article/News etc) and a database primary key), but really, you can add any number of different fields to a document.

Think about what you need the field for. For example, you may only need the title and the first 100 characters of a blog post, to show on the screen, but storing the whole post will blow out the size of your database. Only store what you need - you can still index and search on text which is not stored.

The second aspect was much less common. Each document in the database had various keywords, or tags, which were added by the editor when they were entered. We then looked for other items in the database which matched those tags, either in their body, tags or other fields, and used that as a list of "related" items. We also weighted the results, so that a match in an items tags counted for more than something in the title or body. For example, on this page you can see the list of related items at the bottom, generated on the fly from Lucene, by looking for other documents which match the tags of this article.

If we were able, we would have extended the tag set using keyword extraction (eg using the Yahoo! Term Extraction API) from the body contents, but this was deemed to be overkill.

Top Gear's publishing system works by pulling out new articles from the CMS database, and creating entries in the main database. At the same time, it adds the item to the index. In addition to this, there is a scheduled process which recreates the index from scratch 4x a day. When the index is rebuilt, it's then distributed to the other web server in the cluster, so that both machines are up to date. The indexes are small, and document count on these are low (<10,000), so reindexing only takes couple of minutes.

My final personal recommendations

All up, Lucene has been a consistent top performer when ever I've needed a search-based database. It can handle very large data sets (100's of GB of source data) without any problems, returning results in real time (<500ms). The mailing list is active, and despite not having a binary distribution, it is maintained, developed, and supported.

If you think of it as only an index, then you are going to only use one aspect of this very versatile tool. It does add another level of complexity to the system, but once you master it - and it's not hard to master - it's a very solid performer, even more so if you stop worrying about relational, and learn to love unstructured.

I recommend the book Lucene In Action, as it has a lot of background on how the searching and indexing work, as well as a how-to guide - the Java and .NET versions are very close to API compatible, certainly enough to make the book worth while.

About the author

Nic Wise is a grey haired software developer from New Zealand, living in London, UK. He is a freelance contractor, previously working for BBC Worldwide on a redevelopment of the Top Gear website and the bbc.com/bbc.co.uk homepage. He has worked on many projects over his 13+ years in the industry. Read more about him.

You can read his rumbling on his The Chicken Coop blog, and follow him on twitter @fastchicken.

Tags: lucene.net,top gear,guest post