Lucene.net: your first application

Lucene.net tutorial

How to get started with Lucene.net
Lucene.net: the main concepts
Lucene.net: your first application
Dissecting Lucene.net storage: Documents and Fields
Lucene - or how I stopped worrying, and learned to love unstructured data
How Subtext’s Lucene.net index is structured

In the first two posts of the tutorial you learnt how to get the latest version of Lucene.net, where to get the (little) documentation available, which are the main concepts of Lucene.net and Lucene.net main development steps.

In this third post I’m going to put in practice all the concepts explained the previous post, writing a simple console application that indexes the text entered in the console.

I’ll refer to the steps I outlined in my previous post. So if you haven’t already I recommend you go back and read it.

Step 1 – Initialize the Directory and the IndexWriter

As I said in my previous post, there are two possible Directory you can use: one based on the file system and one based on RAM. You’d usually want to use the FS based one: it’s pretty fast anyway, and you don’t need to constantly dump it to the filesystem. Probably the RAM is more a test fake than something to use for real in production.

And once you have instantiated the Directory you have to open an IndexWriter on it.

Directory directory = FSDirectory.GetDirectory("LuceneIndex");
Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer);

If you are not interested in getting the reference to the Directory, you don’t want to call additional methods on it, and you are interested in just a FSDirectory, you use the short version, and create the IndexWriter with just one line of code.

IndexWriter writer = new IndexWriter("LuceneIndex", analyzer);

Step 2 – Add Documents to the index

I’ll cover this topic more in depth in a subsequent post, but the basic code for adding a document to the index is pretty straightforward. Create a document, add some fields to it, and then add the document to the Index.

Document doc = new Document();
doc.Add(new Field("id", i.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("postBody", text, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

And when you are done with adding all the documents you need, you might call the Optimize method “priming the index for the fastest available search”, and later either Flush to commit all the updates to the Directory or, if you don’t need to add to the index any more, call the Close method to flush and then close all the files in the Directory.

writer.Optimize();
//Close the writer
writer.Flush();
writer.Close();

Step 3 – Create the Query

The Query can be either created via API or parsing Lucene query syntax with the QueryParser.

QueryParser parser = new QueryParser("postBody", analyzer);
Query query = parser.Parse("text");

Query query = new TermQuery(new Term("postBody", "text"));

The two snippets are functionally the same, so when is it good to use the API and when to use the QueryParser? I personally would use the QueryParser when the search string is supplied by the user, and I’d use directly the API when the query is generated by your code.

Step 4 – Pass the Query to the IndexSearcher

Once you have your Query, all you need is passing it to the Search method of the IndexSearcher.

//Setup searcher
IndexSearcher searcher = new IndexSearcher(directory);
//Do the search
Hits hits = searcher.Search(query);

The Searcher must be instantiated before the usage and, for performance reasons, it’s recommended that only one Searcher is open. So open one and use it in all your searches. This might pose some issues in multi-thread environment (like in web applications), but we’ll come to this topic in a future post.

Step 5 – Iterates over the Results

The Search method returns a Hits object, which contains all the documents returned by the query. To list the results, just loop through all the results.

int results = hits.Length();
Console.WriteLine("Found {0} results", results);
for (int i = 0; i < results; i++)
{
    Document doc = hits.Doc(i);
    float score = hits.Score(i);
    Console.WriteLine("Result num {0}, score {1}", i+1,score);
    Console.WriteLine("ID: {0}", doc.Get("id"));
    Console.WriteLine("Text found: {0}" + Environment.NewLine, doc.Get("postBody"));
}

You get the current Document using the Doc(num) method, and the Score (which is a unbund float) using the Score(num) method. You might notice that this a pretty strange API compared to what we are used in .NET. I might have expected to do a foreach over the returned Hits object. Probably this is due to the API being a class-per-class port of the Java version, and so it uses the API design conventions that are typical of the Java world. We can debate over this purist-port approach vs a more idiomatic one for ages, but that’s the way it is.

Step 6 – Close everything

Once you are done with everything, you need to close all the resources: Directory and IndexSearcher.

searcher.Close();
directory.Close();

Get the code

You can download a short sample application that stitch together all that code into a console application that lets you index any text you enter, and later search for it.

Download the sample code

What’s next

This was a very simple application: it was single-threaded and had both the indexing and searching phases in the same piece of code. But before going into the details of the implementation I’m doing for Subtext, in the next post I’ll cover the concept of document and fields more in depth.

Tags: Lucene.net