Lucene.net: the main concepts

Lucene.net tutorial

In the previous post you learnt how to get a copy of Lucene.net and where to go in order to look for more information. As you noticed the documentation is far from being complete and easy to read. So in the post I’ll write about the main concepts behind Lucene.net and which are the main steps in the development of a solution based on Lucene.net.

Some of the main concepts

Before looking at the development phases, it’s important to have a look at the main actors of Lucene.net.

Directoy

The directory is where Lucene indexes are stored: it can be a physical folder on the filesystem (FSDirectory) or an area in memory where files are stored (RAMDirectory). The index structure is compatible with all ports of Lucene, so you could also have the indexing done with .NET and searched with Java, or the other way around (obviously, using the filesystem directory).

IndexWriter

This component is responsible for the management of the indexes. It creates indexes, adds documents to the index, optimizes the index.

Analyzer

This is where the complexity of the indexing resides. In a few words the analyzer contains the policy for extracting index terms from the text. There are several analyzers available both in the core library and in the contrib project. And the java version has even more analyzers that have not been ported to .net yet.

Probably the analyzer you’ll use the most is the StandardAnalyzer, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stopwords.

Another interesting analyzer is the SnowballAnalyzer, which works exactly like the standard one, but adds one more step at the end: the stemming phase, using the Snowball stemming language. Stemming is the process of reducing inflected words to their root. For example, if you are looking for “developing”, probably you are also interested in the word “developed” or “develop” or “developer”. During the indexing phase, the stemming process normalizes all these inflected words to their root “develop”. And does the same when querying the index (if you search for “development” it will search for “develop”). Obviously this is tied to the language of the text, so the snowball analyzer comes with many different “grammars” for that.

Document and Fields

A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document. Different fields can be indexed using different algorithm and analyzers. For example you might just want to store the document id, without being able to search on it. But you want to be able to search by tags as single keywords, and, finally you want to index the body of blog post for full text search (thus using the Analyzer and the tokenizers).

Since this is an important topic, I’ll write a more in depth post in the future.

Searcher and IndexReader

The searcher is the component that, with the help of the IndexReader, scans the index files and returns results based on the query supplied.

QueryParser

The query parser is responsible for parsing a string of text to create a query object. It evaluates Lucene query syntax and uses an analyzer (which should be the same you used to index the text) to tokenize the single statements.

The main development steps

And now let’s have a brief overview at the logical steps involved in integrating Lucene.net into your applications:

1 – Initialize Directory and IndexWriter

The first step is initializing the Directory and the IndexWriter. In a web application, like Subtext, this is most likely done in the application startup and then the instance stored in a global variable somewhere (or accessed through a Singleton) since only one Writer can read the Dictionary at the same time.

And when you create the IndexWriter you can supply the analyzer that will be used by default to index all the text.

2 – Add Documents to the Index

Each document is made by various Fields. You have to create a Document with all the Fields that must be indexed and also the ones you need in order to link the result to the real document that is being indexed (for example the id of the post).

And once created the Document, you have to add it to the Directory with the IndexWriter.

At this point, you could either add more documents or close the IndexWriter. The index will be saved to the Directory and can be re-opened later for adding more Documents or to perform queries on in.

3 – Create the Query

Once you have all your documents in the index, it’s time to do some queries.

You can create the query either via the QueryParser or creating a Query object directly via API.

4 – Pass the Query to the IndexSearcher

And once you have the Query object you have to pass it to the Search method of a IndexSearcher.

One caveats is that the IndexSearcher sees the index only at the point it was at the time it was opened. So in order to search over the most recent set of documents you have to re-open the IndexSearcher. But re-opening takes time and resources, so in a web application you might want to cache it somehow and re-open it periodically.

5 – Iterates over the results

The Search method returns the results, inside a Hit object, which contains all the documents that match the query, ordered by Score, which is a very complex math formula that should tell you how much the document found is related to your query. For more information refer to Lucene website: Scoring.

6 – Close everything

And once you are done with everything, close the IndexWriter, IndexSearcher and the Directory object. In a web application this is typically performed in the application shutdown event.

You just read about the main concepts behind Lucene.net. In a future post I’ll write how to implement Lucene.net into a sample console application that puts together all the concepts discussed here.

Tags: Lucene.net

CodeClimber