Lucene.net tutorial
In the last part of the tutorial about Lucene.net we talked about how to organized a Lucene index, and how it is important to have a well planned strategy for it. In this post I’m going to show you how I applied those concepts and Nic’s tips during the design of the index for Subtext.
Requirements
Here are the requirements we are designing the index for:
- Free-text searches using the search box
- When someone comes from a search engine, show more results related to the search he did
- Show more posts related to a post
The first two requirements are the usual ones: being able to search for some terms in the index, but the last one requires more than just the list of terms: it’s a MoreLikeThis search and it needs also the Term Vector to be stored.
Than there are other “hidden requirements”: a post can just be a draft (and I don’t want it to appear in searches), or it can be scheduled for future publishing (and again, I don’t want it to appear in search results). Then we also have the “aggregated blog”, which is a collection of all the blogs of the site. To make things even more complex, it’s not just one “wall”, but blogs can be grouped in different “walls” (for example all blogs talking about Silverlight and all the ones talking about ASP.NET MVC). And last, users can decide not to push their posts to the their group.
Structure of the Index
With that all these requirements in mind here is how Subtext’s index is structured:
Name | Index | Store | TV | Boost | Description |
Title | TOKENIZED | YES | YES | 2 | The title of the post |
Body | TOKENIZED | NO | YES | - | Body of the post |
Tags | TOKENIZED | NO | YES | 4 | List of tags |
PubDate | UN_TOKENIZED | YES | NO | - | The publishing data |
BlogID | UN_TOKENIZED | NO | NO | - | The id of the blog |
Published | UN_TOKENIZED | NO | NO | - | Is post draft or not? |
GroupID | UN_TOKENIZED | NO | NO | - | The group id (0 if not pushed to aggregator) |
PostURL | NO | YES | NO | - | The URL of the post |
BlogName | NO | YES | NO | - | The name of the blog |
PostID | UN_TOKENIZED | YES | NO | - | The id of the post |
Explaining why
Let’s explain it a bit more. The only fields that need to be full-text searched are the one that contain some kind of real content: so Title, Body and Tags are the only ones that need to be analyzed and tokenized.
But to comply to all the other requirements, when we do a search we have to search also using other criteria:
- PubDate must be less than Now
- Published must be true
- BlogID must be the one of the blog I’m searching from (when searching inside a single blog)
- GroupID must be the one of the aggregated site I’m searching from (when searching inside an aggregated site)
So I also needed index the fields above, but since they are single terms I don’t need to tokenize them.
And I also need the PostID since when we’ll be using the MoreLikeThis query I’ve to pass to supply to Lucene the id of the document which I want to search similar document for.
And finally, a row in the results will be like:
Dissecting Lucene.net storage: Documents and Fields – Sept 4th, 2009 (CodeClimber)
So the only fields I need to retrieve, and thus store, are Title, PubDate, BlogName (shown in case I’m doing a search from the aggregated site) and obviously the URL to link to the complete post.
What do you think? Am I missing something? Would you have done something differently? Please answer with your comments.
The next step
Now that the index has been designed, in the next post we’ll cover some infrastructural code, and show how the search engine service works inside Subtext.
Disclaimer: This is all work in progress and might be (and probably will be) different from the final version of the search engine service that will be included into the next version of Subtext.