Dissecting Lucene.net storage: Documents and Fields

Lucene.net tutorial

How to get started with Lucene.net
Lucene.net: the main concepts
Lucene.net: your first application
Dissecting Lucene.net storage: Documents and Fields
Lucene - or how I stopped worrying, and learned to love unstructured data
How Subtext’s Lucene.net index is structured

In the previous posts we discussed how to get started with Lucene.net, its main concepts and we developed a sample application that put in practice all the concepts behind Lucene.net developlent. But before moving on, I think it’s worth analyzing in detail how content is stored into Lucene.net index.

The Document

As you already saw previously, a Document is the unit of the indexing and searching process. You add a document to the index and, after you perform a search, you get a list of results: and they are documents.

A document is just an unstructured collection of Fields.

Fields

Fields are the actual content holders of Lucene.net: they are basically a hashtable, with a name and value.

If we had infinite disk space and infinite processing power that’s all we needed to know. But unfortunately disk space and processing power are constrained so you can’t just analyze everything and store into the index. But Lucene.net provides different ways of adding a field to the index.

Everything is controlled through the field constructor:

new Field("fieldName", "value",
    Field.Store.YES,
    Field.Index.NO,
    Field.TermVector.YES);

Store the content or not

You can decide whether to store the content of the field into the index or not:

Field.Store.YES – Stores the content in the index as supplied to the Field’s constructor
Field.Store.NO – Doesn’t store the value at all (you won’t be able to retrieve it)
Field.Store.COMPRESS – Compresses the original value and stores it into the index

When you have to decide whether to store the original content or don’t, you have to think at which data you really need when you will display the result of the search: if you are never going to show the content of the document, there is no need to store it inside the index. But maybe you need to store the date or the users that have the right to access to a document. Or maybe you want to show only the first 100 characters of the post: in this case you will just store them, and not the full post. The final goal is to keep the index size to minimum but, at the same time, make sure you will not need to hit the database to display the results of a search.

This, of course, if you are using Lucene as just the full text index of another “main” storage. If you are using it as a KV store, a-la CouchDB, you obviously need to store everything. In this last scenario, you might want to compress long texts or binary data to keep the size down.

Just one quick point to make sure there are no misunderstandings: even if you don’t store the original value of a field, you can still index it.

To Index or not to Index?

You can then decide which kind of indexing to apply to the value added:

Field.Index.NO – The value is not indexed (it cannot be searched but only retrieved, provided it was stored)
Field.Index.TOKENIZED – The value is fully indexed, using the Analyzer (so the text is first tokenized, then normalized, and so on, as you read in the post about the main concepts of Lucene.net)
Field.Index.UN_TOKENIZED – The value is indexed, but without a analyzer, so it can be searched, but only as single term.
Field.Index.NO_NORM – Indexes without a analyzer and without storing the norm. This in an advanced option that allows you to reduce the memory usage (one byte per field) but at the cost of disabling index boosting and length normalization.

So, when to use which? You have to think about how you are going to search for your documents. For example, if you don’t need to search using the post URL, but you only need it to link to the actual content, then you can safely use the NO option. At the opposite, you might need to search using the exact value of a short field, for example a category. In this case you don’t need to analyze it and break into its terms, so you can index it using the UN_TOKENIZED option, and have the value indexed as a single term. And obviously, if you need to use the TOKENIZED option for the content that needs to be full-text indexed.

And what about Term Vectors?

The third option is TermVectors, but first we have to understand what a Term Vector is. A term vector represents all terms inside a field with the number of occurrences in the document. Usually this is not stored, but it’s useful for some more advanced types of queries like “MoreLikeThis” queries, Span queries and to highlight the matches inside the document. You have the following options:

Field.TermVector.NO – This is the default one, the one that is used when this option is not even specified (using the other constructors)
Field.TermVector.YES – Term vectors are stored
Field.TermVector.WITH_POSITIONS – Stores term vector together with the position of each token
Field.TermVector.WITH_OFFSETS – Stores term vector together with the offset of tokens
Field.TermVector.WITH_POSITIONS_OFFSETS – Stores term vector with both position and offset

This was a more advanced option and again, make sure you know what you are going to do with your index and which types of searches you are going to do.

Boosting

Another topic that is a bit more advanced but very powerful is boosting. With boosting Lucene means the ability to make something (a document, a field, a search term) be more important than the others.

For example you might want that matches on the title of a post are more important than the ones on the content. So you have to set a field boost on the title.

Field title = new Field("title","This is my title",
      Field.Store.YES,Field.Index.TOKENIZED);
title.SetBoost(2.0f);

Or you might want to push a document more than others, so you have to set a boost on a whole document.This means that when you perform a search, this document will be pushed up in the ranking list.

Document doc = new Document();
//Add Fields
doc.SetBoost(2.0f);
writer.AddDocument(doc);

But beware that boosting is disabled if you created the field using the Field.Index.NO_NORM option. If you set the NO_NORM option, the only boosting you can do is the one at search time. I’ll come to search syntax in a future post but here is a quick sample of how you can boost a query term.
If you want to search for documents that contain the terms “MVC” or “MVP”, but you want documents that contain the term “MVC” displayed before, the query should be:

MVC^2 MVP

You use the caret “^” to as if you were raising the term to power of the boost you want to apply.

A practical example of boosting

Let’s see with an example what the effect of boosting is.

Imagine you have 3 documents in your index:

Doc A – “The MVC pattern is better then the MVP pattern”
Doc B - “The MVC pattern is the best thing since sliced bread”
Doc C - “The MVP pattern is too complex”

So, if you search for “MVC^2 MVP” it will return all the 3 documents (this search means either MVC or MVP) it the following order:

Doc A – 3 points (2 points for MVC and 1 for MVP)
Doc B – 2 points (2 points for MVC)
Doc C – 1 point (1 point for MVP)

This is an oversimplification, and the actual ranking algorithm is much more complex, but it shows well the result of boosting.

What’s next?

Now that I went over all the main concepts of Lucene.net, in the next posts we are going to see how I’m planning to organize the index in Subtext.

Tags: lucene.net,document,field