Lucene.net is powering Subtext 2.5 search

Back in August and September I started a series with the main concepts of Lucene.net and I started to explain the design behind the forthcoming Lucene.net-powered search engine for Subtext. In the last few months I finally had the time to sit down and implement the search engine for Subtext.

In this post I want to show you how I implemented it and since I kept the points of contact with the Subtext domain model to the minimum, how you can modify it to use it in your own application.

Another reason behind this post is… asking to long time Lucene users for comments on the implementation. So comments will be welcome.

Let’s review how it is implemented starting from the main class, SearchEngineService, which contains all the Lucene.net logic.

The Search Engine Service

Starting from the constructor

public SearchEngineService(Directory directory,
        Analyzer analyzer,
        FullTextSearchEngineSettings settings)
{
    _directory = directory;
    _analyzer = analyzer;
    _settings = settings;
}

As you can notice it is not a singleton, even if it should be because there can be only one Writer that is writing to the index. To achieve the single instance status we are using our IoC Container, Ninject. The registration of the service in the container is as follows

Bind<Directory>()
    .ToMethod(c => FSDirectory.Open(new DirectoryInfo(
        c.Kernel.Get<HttpContext>().Server.MapPath("~/App_Data"))))
    .InSingletonScope();
Bind<Analyzer>().To<SnowballAnalyzer>().InSingletonScope()
    .WithConstructorArgument("name",
        c => c.Kernel.Get<FullTextSearchEngineSettings>().Language)
    .WithConstructorArgument("stopWords",
        c => c.Kernel.Get<FullTextSearchEngineSettings>().StopWordsArray);

Bind<ISearchEngineService>().To<SearchEngineService>().InSingletonScope();

What the constructor does is simply setting all the dependencies.

The creation of the Writer

We are deferring the creation of the index writer till the first time it really needs to be used, and it is done from inside the EnsureIndexWriter method. This method is always called from inside a lock as we want to avoid threading issue. Otherwise we might have two different requests trying to create an index writer at the same time, and this is bad as there can only be one index writer per index. This is done because of a small issue with Ninject that, either because of a bug or because how we use it, created the service twice even if it was registered as InSingletonScope.

The EnsureIndexWriter is called before any method that needs a writer to exists, which must be called through the DoWriterAction<T> method.

private T DoWriterAction<T>(Func<IndexWriter,T> action)
{
    lock (WriterLock)
    {
        EnsureIndexWriter();
    }
    return action(_writer);
}

// Method should only be called from within a lock.
void EnsureIndexWriter()
{
    if(_writer == null)
    {
        if(IndexWriter.IsLocked(_directory))
        {
            Log.Error("Something left a lock in the index folder: deleting it");
            IndexWriter.Unlock(_directory);
            Log.Info("Lock Deleted... can proceed");
        }
        _writer = new IndexWriter(_directory, _analyzer,
                        IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetMergePolicy(new LogDocMergePolicy(_writer));
        _writer.SetMergeFactor(5);
    }
}

During the creation I also check for a possible lock on the index file (if the application ends abruptly sometimes the lock file is not deleted) and set a custom merge policy based on the number of documents instead of the size in bytes.

Adding documents

Adding a document to the index is a pretty simple operation:

I delete any previous document with the same document id (line 6)
I add the document to the index (line 10)
When I’m done with all the posts I commit the writes (line 19)
Finally, if done through a mass indexing, I optimize the index (line 22)

   1:  public IEnumerable<IndexingError> AddPosts(IEnumerable<SearchEngineEntry> posts, bool optimize)

   2:  {

   3:      IList<IndexingError> errors = new List<IndexingError>();

   4:      foreach (var post in posts)

   5:      {

   6:          ExecuteRemovePost(post.EntryId);

   7:          try

   8:          {

   9:              var currentPost = post;

  10:              DoWriterAction(writer => writer.AddDocument(CreateDocument(currentPost)));

  11:          }

  12:          catch(Exception ex)

  13:          {

  14:              errors.Add(new IndexingError(post, ex));

  15:          }

  16:      }

  17:      DoWriterAction(writer =>

  18:      {

  19:          writer.Commit();

  20:          if(optimize)

  21:          {

  22:              writer.Optimize();

  23:          }

  24:      });

  25:      return errors;

  26:  }

The CreateDocument is just an utility method that creates fields for the Lucene document.

As you notice, all the operation that require a Writer are called through the DoWriterAction method.

Performing Queries

We have two different types of queries in Subtext: the normal full-text query and the similarity query. They both rely on the same PerformQuery method.

private IEnumerable<SearchEngineResult> PerformQuery(
        ICollection<SearchEngineResult> list,
        Query queryOrig, int max, int blogId, int idToFilter)
{
    Query isPublishedQuery = new TermQuery(new Term(Published, true.ToString()));
    Query isCorrectBlogQuery = GetBlogIdSearchQuery(blogId);
    
    var query = new BooleanQuery();
    query.Add(isPublishedQuery, BooleanClause.Occur.MUST);
    query.Add(queryOrig, BooleanClause.Occur.MUST);
    query.Add(isCorrectBlogQuery, BooleanClause.Occur.MUST);
    IndexSearcher searcher = Searcher;
    TopDocs hits = searcher.Search(query, max);
    int length = hits.scoreDocs.Length;
    int resultsAdded = 0;
    float minScore = _settings.MinimumScore;
    float scoreNorm = 1.0f / hits.GetMaxScore(); 
    for (int i = 0; i < length && resultsAdded < max; i++)
    {
        float score = hits.scoreDocs[i].score * scoreNorm;
        SearchEngineResult result = CreateSearchResult(searcher.Doc(hits.scoreDocs[i].doc), score);
        if (idToFilter != result.EntryId
             && result.Score > minScore
             && result.PublishDate < DateTime.Now)
        {
            list.Add(result);
            resultsAdded++;
        }
            
    }
    return list;
}

This method receives the main query as parameter and enriches it by adding more clauses: publish status and the blog id.

The method then does the search on the index and for each result returned it computes the score and filters out posts under a certain score and those that are published in the future.

Normal full-text search

To perform the simple search, the term entered by the user must be duplicated to be searched in all the textual fields: title, body and tags.

public IEnumerable<SearchEngineResult> Search(string queryString, int max, int blogId, int entryId)
{
    var list = new List<SearchEngineResult>();
    if (String.IsNullOrEmpty(queryString)) return list;
    QueryParser parser = BuildQueryParser();
    Query bodyQuery = parser.Parse(queryString);

    
    string queryStringMerged = String.Format("({0}) OR ({1}) OR ({2})",
                               bodyQuery,
                               bodyQuery.ToString().Replace("Body", "Title"),
                               bodyQuery.ToString().Replace("Body", "Tags"));

    Query query = parser.Parse(queryStringMerged);
    

    return PerformQuery(list, query, max, blogId, entryId);
}

Similarity (more like this)

This other kind of search is a bit more complicate, and makes use of the similarity search that is available in the contrib package.

public IEnumerable<SearchEngineResult> RelatedContents(int entryId, int max, int blogId)
{
    var list = new List<SearchEngineResult>();

    //First look for the original doc
    Query query = GetIdSearchQuery(entryId);
    TopDocs hits = Searcher.Search(query, max);

    if(hits.scoreDocs.Length <= 0) 
    {
        return list;
    }

    int docNum = hits.scoreDocs[0].doc;

    //Setup MoreLikeThis searcher
    var reader = DoWriterAction(w => w.GetReader());
    var mlt = new MoreLikeThis(reader);
    mlt.SetAnalyzer(_analyzer);
    mlt.SetFieldNames(new[] { Title, Body, Tags });
    mlt.SetMinDocFreq(_settings.Parameters.MinimumDocumentFrequency);
    mlt.SetMinTermFreq(_settings.Parameters.MinimumTermFrequency);
    mlt.SetBoost(_settings.Parameters.MoreLikeThisBoost);

    var moreResultsQuery = mlt.Like(docNum);
    return PerformQuery(list, moreResultsQuery, max+1, blogId, entryId);
}

Disposing the service

Since Lucene makes use of external resources in the filesystem and it’s super-important to close all the resources used before disposing the service.

~SearchEngineService(){
    Dispose();
}

public void Dispose()
{
    lock(WriterLock)
    {
        if(!_disposed)
        {
            //Never checking for disposing = true because there are
            //no managed resources to dispose

            var writer = _writer;

            if(writer != null)
            {
                try
                {
                    writer.Close();
                }
                catch(ObjectDisposedException e)
                {
                   Log.Error("Exception while disposing SearchEngineService", e); 
                }
                _writer = null;
            }

            var directory = _directory;
            if(directory != null)
            {
                try
                {
                    directory.Close();
                }
                catch(ObjectDisposedException e)
                {
                    Log.Error("Exception while disposing SearchEngineService", e);
                }
            }

            _disposed = true;
        }
    }
    GC.SuppressFinalize(this);
}

The Indexing Service and the UI components

This post grew over the size I was planning, and I still haven’t talked about the other components of the search engine feature of Subtext. I’ll talk about them in a future post.

If you want to start having a look at the code, you can either get the source or just browse the code online.

Please, review this code

One of the reasons behind this post was also asking to people that used Lucene.net a few comments about our implementation. So, please, comments are welcome. Thank you

Tags: lucene.net,subtext,code review