How Subtext’s Lucene.net index is structured

Lucene.net tutorial

How to get started with Lucene.net
Lucene.net: the main concepts
Lucene.net: your first application
Dissecting Lucene.net storage: Documents and Fields
Lucene - or how I stopped worrying, and learned to love unstructured data
How Subtext Lucene.net index is structured

In the last part of the tutorial about Lucene.net we talked about how to organized a Lucene index, and how it is important to have a well planned strategy for it. In this post I’m going to show you how I applied those concepts and Nic’s tips during the design of the index for Subtext.

Requirements

Here are the requirements we are designing the index for:

Free-text searches using the search box
When someone comes from a search engine, show more results related to the search he did
Show more posts related to a post

The first two requirements are the usual ones: being able to search for some terms in the index, but the last one requires more than just the list of terms: it’s a MoreLikeThis search and it needs also the Term Vector to be stored.

Than there are other “hidden requirements”: a post can just be a draft (and I don’t want it to appear in searches), or it can be scheduled for future publishing (and again, I don’t want it to appear in search results). Then we also have the “aggregated blog”, which is a collection of all the blogs of the site. To make things even more complex, it’s not just one “wall”, but blogs can be grouped in different “walls” (for example all blogs talking about Silverlight and all the ones talking about ASP.NET MVC). And last, users can decide not to push their posts to the their group.

Structure of the Index

With that all these requirements in mind here is how Subtext’s index is structured:

Name	Index	Store	TV	Boost	Description
Title	TOKENIZED	YES	YES	2	The title of the post
Body	TOKENIZED	NO	YES	-	Body of the post
Tags	TOKENIZED	NO	YES	4	List of tags
PubDate	UN_TOKENIZED	YES	NO	-	The publishing data
BlogID	UN_TOKENIZED	NO	NO	-	The id of the blog
Published	UN_TOKENIZED	NO	NO	-	Is post draft or not?
GroupID	UN_TOKENIZED	NO	NO	-	The group id (0 if not pushed to aggregator)
PostURL	NO	YES	NO	-	The URL of the post
BlogName	NO	YES	NO	-	The name of the blog
PostID	UN_TOKENIZED	YES	NO	-	The id of the post

Explaining why

Let’s explain it a bit more. The only fields that need to be full-text searched are the one that contain some kind of real content: so Title, Body and Tags are the only ones that need to be analyzed and tokenized.

But to comply to all the other requirements, when we do a search we have to search also using other criteria:

PubDate must be less than Now
Published must be true
BlogID must be the one of the blog I’m searching from (when searching inside a single blog)
GroupID must be the one of the aggregated site I’m searching from (when searching inside an aggregated site)

So I also needed index the fields above, but since they are single terms I don’t need to tokenize them.

And I also need the PostID since when we’ll be using the MoreLikeThis query I’ve to pass to supply to Lucene the id of the document which I want to search similar document for.

And finally, a row in the results will be like:

Dissecting Lucene.net storage: Documents and Fields – Sept 4th, 2009 (CodeClimber)

So the only fields I need to retrieve, and thus store, are Title, PubDate, BlogName (shown in case I’m doing a search from the aggregated site) and obviously the URL to link to the complete post.

What do you think? Am I missing something? Would you have done something differently? Please answer with your comments.

The next step

Now that the index has been designed, in the next post we’ll cover some infrastructural code, and show how the search engine service works inside Subtext.

Disclaimer: This is all work in progress and might be (and probably will be) different from the final version of the search engine service that will be included into the next version of Subtext.

Tags: Lucene.net,Document,Field,Subtext