In the last part of the tutorial about we talked about how to organized a Lucene index, and how it is important to have a well planned strategy for it. In this post I’m going to show you how I applied those concepts and Nic’s tips during the design of the index for Subtext.


Here are the requirements we are designing the index for:

  • Free-text searches using the search box
  • When someone comes from a search engine, show more results related to the search he did
  • Show more posts related to a post

The first two requirements are the usual ones: being able to search for some terms in the index, but the last one requires more than just the list of terms: it’s a MoreLikeThis search and it needs also the Term Vector to be stored.

Than there are other “hidden requirements”: a post can just be a draft (and I don’t want it to appear in searches), or it can be scheduled for future publishing (and again, I don’t want it to appear in search results). Then we also have the “aggregated blog”, which is a collection of all the blogs of the site. To make things even more complex, it’s not just one “wall”, but blogs can be grouped in different “walls” (for example all blogs talking about Silverlight and all the ones talking about ASP.NET MVC). And last, users can decide not to push their posts to the their group.

Structure of the Index

With that all these requirements in mind here is how Subtext’s index is structured:

Name Index Store TV Boost Description
Title TOKENIZED YES YES 2 The title of the post
Body TOKENIZED NO YES - Body of the post
Tags TOKENIZED NO YES 4 List of tags
PubDate UN_TOKENIZED YES NO - The publishing data
BlogID UN_TOKENIZED NO NO - The id of the blog
Published UN_TOKENIZED NO NO - Is post draft or not?
GroupID UN_TOKENIZED NO NO - The group id (0 if not pushed to aggregator)
PostURL NO YES NO - The URL of the post
BlogName NO YES NO - The name of the blog
PostID UN_TOKENIZED YES NO - The id of the post

Explaining why

Let’s explain it a bit more. The only fields that need to be full-text searched are the one that contain some kind of real content: so Title, Body and Tags are the only ones that need to be analyzed and tokenized.

But to comply to all the other requirements, when we do a search we have to search also using other criteria:

  • PubDate must be less than Now
  • Published must be true
  • BlogID must be the one of the blog I’m searching from (when searching inside a single blog)
  • GroupID must be the one of the aggregated site I’m searching from (when searching inside an aggregated site)

So I also needed index the fields above, but since they are single terms I don’t need to tokenize them.

And I also need the PostID since when we’ll be using the MoreLikeThis query I’ve to pass to supply to Lucene the id of the document which I want to search similar document for.

And finally, a row in the results will be like:

Dissecting storage: Documents and Fields – Sept 4th, 2009 (CodeClimber)

So the only fields I need to retrieve, and thus store, are Title, PubDate, BlogName (shown in case I’m doing a search from the aggregated site) and obviously the URL to link to the complete post.

What do you think? Am I missing something? Would you have done something differently? Please answer with your comments.

The next step

Now that the index has been designed, in the next post we’ll cover some infrastructural code, and show how the search engine service works inside Subtext.

Disclaimer: This is all work in progress and might be (and probably will be) different from the final version of the search engine service that will be included into the next version of Subtext.