First time here? You are looking at the most recent posts. You may also want to check out older archives or the tag cloud. Please leave a comment, ask a question and consider subscribing to the latest posts via RSS. Thank you for visiting! (hide this)

February 2010 Blog Posts

Lucene.net is powering Subtext 2.5 search

Back in August and September I started a series with the main concepts of Lucene.net and I started to explain the design behind the forthcoming Lucene.net-powered search engine for Subtext. In the last few months I finally had the time to sit down and implement the search engine for Subtext.

In this post I want to show you how I implemented it and since I kept the points of contact with the Subtext domain model to the minimum, how you can modify it to use it in your own application.

Another reason behind this post is… asking to long time Lucene users for comments on the implementation. So comments will be welcome.

Let’s review how it is implemented starting from the main class, SearchEngineService, which contains all the Lucene.net logic.

The Search Engine Service

Starting from the constructor

public SearchEngineService(Directory directory,
        Analyzer analyzer,
        FullTextSearchEngineSettings settings)
{
    _directory = directory;
    _analyzer = analyzer;
    _settings = settings;
}

As you can notice it is not a singleton, even if it should be because there can be only one Writer that is writing to the index. To achieve the single instance status we are using our IoC Container, Ninject. The registration of the service in the container is as follows

Bind<Directory>()
    .ToMethod(c => FSDirectory.Open(new DirectoryInfo(
        c.Kernel.Get<HttpContext>().Server.MapPath("~/App_Data"))))
    .InSingletonScope();
Bind<Analyzer>().To<SnowballAnalyzer>().InSingletonScope()
    .WithConstructorArgument("name",
        c => c.Kernel.Get<FullTextSearchEngineSettings>().Language)
    .WithConstructorArgument("stopWords",
        c => c.Kernel.Get<FullTextSearchEngineSettings>().StopWordsArray);

Bind<ISearchEngineService>().To<SearchEngineService>().InSingletonScope();

What the constructor does is simply setting all the dependencies.

The creation of the Writer

We are deferring the creation of the index writer till the first time it really needs to be used, and it is done from inside the EnsureIndexWriter method. This method is always called from inside a lock as we want to avoid threading issue. Otherwise we might have two different requests trying to create an index writer at the same time, and this is bad as there can only be one index writer per index. This is done because of a small issue with Ninject that, either because of a bug or because how we use it, created the service twice even if it was registered as InSingletonScope.

The EnsureIndexWriter is called before any method that needs a writer to exists, which must be called through the DoWriterAction<T> method.

private T DoWriterAction<T>(Func<IndexWriter,T> action)
{
    lock (WriterLock)
    {
        EnsureIndexWriter();
    }
    return action(_writer);
}

// Method should only be called from within a lock.
void EnsureIndexWriter()
{
    if(_writer == null)
    {
        if(IndexWriter.IsLocked(_directory))
        {
            Log.Error("Something left a lock in the index folder: deleting it");
            IndexWriter.Unlock(_directory);
            Log.Info("Lock Deleted... can proceed");
        }
        _writer = new IndexWriter(_directory, _analyzer,
                        IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetMergePolicy(new LogDocMergePolicy(_writer));
        _writer.SetMergeFactor(5);
    }
}

During the creation I also check for a possible lock on the index file (if the application ends abruptly sometimes the lock file is not deleted) and set a custom merge policy based on the number of documents instead of the size in bytes.

Adding documents

Adding a document to the index is a pretty simple operation:

  • I delete any previous document with the same document id (line 6)
  • I add the document to the index (line 10)
  • When I’m done with all the posts I commit the writes (line 19)
  • Finally, if done through a mass indexing, I optimize the index (line 22)
   1:  public IEnumerable<IndexingError> AddPosts(IEnumerable<SearchEngineEntry> posts, bool optimize)
   2:  {
   3:      IList<IndexingError> errors = new List<IndexingError>();
   4:      foreach (var post in posts)
   5:      {
   6:          ExecuteRemovePost(post.EntryId);
   7:          try
   8:          {
   9:              var currentPost = post;
  10:              DoWriterAction(writer => writer.AddDocument(CreateDocument(currentPost)));
  11:          }
  12:          catch(Exception ex)
  13:          {
  14:              errors.Add(new IndexingError(post, ex));
  15:          }
  16:      }
  17:      DoWriterAction(writer =>
  18:      {
  19:          writer.Commit();
  20:          if(optimize)
  21:          {
  22:              writer.Optimize();
  23:          }
  24:      });
  25:      return errors;
  26:  }

The CreateDocument is just an utility method that creates fields for the Lucene document.

As you notice, all the operation that require a Writer are called through the DoWriterAction method.

Performing Queries

We have two different types of queries in Subtext: the normal full-text query and the similarity query. They both rely on the same PerformQuery method.

private IEnumerable<SearchEngineResult> PerformQuery(
        ICollection<SearchEngineResult> list,
        Query queryOrig, int max, int blogId, int idToFilter)
{
    Query isPublishedQuery = new TermQuery(new Term(Published, true.ToString()));
    Query isCorrectBlogQuery = GetBlogIdSearchQuery(blogId);
    
    var query = new BooleanQuery();
    query.Add(isPublishedQuery, BooleanClause.Occur.MUST);
    query.Add(queryOrig, BooleanClause.Occur.MUST);
    query.Add(isCorrectBlogQuery, BooleanClause.Occur.MUST);
    IndexSearcher searcher = Searcher;
    TopDocs hits = searcher.Search(query, max);
    int length = hits.scoreDocs.Length;
    int resultsAdded = 0;
    float minScore = _settings.MinimumScore;
    float scoreNorm = 1.0f / hits.GetMaxScore(); 
    for (int i = 0; i < length && resultsAdded < max; i++)
    {
        float score = hits.scoreDocs[i].score * scoreNorm;
        SearchEngineResult result = CreateSearchResult(searcher.Doc(hits.scoreDocs[i].doc), score);
        if (idToFilter != result.EntryId
             && result.Score > minScore
             && result.PublishDate < DateTime.Now)
        {
            list.Add(result);
            resultsAdded++;
        }
            
    }
    return list;
}

This method receives the main query as parameter and enriches it by adding more clauses: publish status and the blog id.

The method then does the search on the index and for each result returned it computes the score and filters out posts under a certain score and those that are published in the future.

Normal full-text search

To perform the simple search, the term entered by the user must be duplicated to be searched in all the textual fields: title, body and tags.

public IEnumerable<SearchEngineResult> Search(string queryString, int max, int blogId, int entryId)
{
    var list = new List<SearchEngineResult>();
    if (String.IsNullOrEmpty(queryString)) return list;
    QueryParser parser = BuildQueryParser();
    Query bodyQuery = parser.Parse(queryString);

    
    string queryStringMerged = String.Format("({0}) OR ({1}) OR ({2})",
                               bodyQuery,
                               bodyQuery.ToString().Replace("Body", "Title"),
                               bodyQuery.ToString().Replace("Body", "Tags"));

    Query query = parser.Parse(queryStringMerged);
    

    return PerformQuery(list, query, max, blogId, entryId);
}

Similarity (more like this)

This other kind of search is a bit more complicate, and makes use of the similarity search that is available in the contrib package.

public IEnumerable<SearchEngineResult> RelatedContents(int entryId, int max, int blogId)
{
    var list = new List<SearchEngineResult>();

    //First look for the original doc
    Query query = GetIdSearchQuery(entryId);
    TopDocs hits = Searcher.Search(query, max);

    if(hits.scoreDocs.Length <= 0) 
    {
        return list;
    }

    int docNum = hits.scoreDocs[0].doc;

    //Setup MoreLikeThis searcher
    var reader = DoWriterAction(w => w.GetReader());
    var mlt = new MoreLikeThis(reader);
    mlt.SetAnalyzer(_analyzer);
    mlt.SetFieldNames(new[] { Title, Body, Tags });
    mlt.SetMinDocFreq(_settings.Parameters.MinimumDocumentFrequency);
    mlt.SetMinTermFreq(_settings.Parameters.MinimumTermFrequency);
    mlt.SetBoost(_settings.Parameters.MoreLikeThisBoost);

    var moreResultsQuery = mlt.Like(docNum);
    return PerformQuery(list, moreResultsQuery, max+1, blogId, entryId);
}

Disposing the service

Since Lucene makes use of external resources in the filesystem and it’s super-important to close all the resources used before disposing the service.

~SearchEngineService(){
    Dispose();
}

public void Dispose()
{
    lock(WriterLock)
    {
        if(!_disposed)
        {
            //Never checking for disposing = true because there are
            //no managed resources to dispose

            var writer = _writer;

            if(writer != null)
            {
                try
                {
                    writer.Close();
                }
                catch(ObjectDisposedException e)
                {
                   Log.Error("Exception while disposing SearchEngineService", e); 
                }
                _writer = null;
            }

            var directory = _directory;
            if(directory != null)
            {
                try
                {
                    directory.Close();
                }
                catch(ObjectDisposedException e)
                {
                    Log.Error("Exception while disposing SearchEngineService", e);
                }
            }

            _disposed = true;
        }
    }
    GC.SuppressFinalize(this);
}

The Indexing Service and the UI components

This post grew over the size I was planning, and I still haven’t talked about the other components of the search engine feature of Subtext. I’ll talk about them in a future post.

If you want to start having a look at the code, you can either get the source or just browse the code online.

Please, review this code

One of the reasons behind this post was also asking to people that used Lucene.net a few comments about our implementation. So, please, comments are welcome. Thank you

Where I'll be during the MVP Summit

And finally the Summit has arrived: on Monday I'm leaving from Milano (at 6:50am, which means waking up at 3:00am) and I'll arriving in Bellevue on Monday afternoon. I'll be staying at the Hyatt Regency Bellevue.

It will be a pretty packed up week, with sessions all the days, and parties all the nights. This is my "tentative" schedule for the parties.

I'll then spend the Saturday visiting around and then I'll fly back to Milano on Sunday morning (arriving in Milano on Monday morning).

I'm pretty excited of all this, mainly because I'll have to chance to meet in person most of the guys (and gals) I interact on twitter and via the various OSS project mailing lists.

See you there.

So Long Avanade, and Thanks for All the Fish

If you follow me on twitter you might already have understood something: last week I gave my two months notice to my employer, Avanade Italy. I've been working with them since the end of 2007, and during the almost 2 years and half I learned a lot. Especially I added to my skills-set some competencies I never had the chance to practice working for a web agency and for a product company: the so called consulting skills. I also had the great opportunity to coach on the job some junior developer that just come out from university (or with little experience): some were more receptive, other less, but at the end I hope they learnt some of the principles of good coding and software design.

And also worked with many great colleagues, some which I’m really going to miss.

So, why am I leaving?

Avanade is a great company, so why am I leaving?

  • Web Development - In Avanade one thing was really lacking: some real web development. I'm a web developer at heart, and in Italy all of the "real" web development (where with real I mean big B2C websites or big online magazine/newspaper) is developed by web agencies like the one I worked before going to NZ.
  • Product expert vs software developer - It seems like all the big consulting companies are trying to sell solutions based on products that require very little to almost no development work: like SharePoint, CRM, BizTalk, Commerce Server and so on. I’m not interested in becoming an expert on customizing a specific product: I’m more interested in “custom development” and in the way people work together.
  • Italy is getting worse and worse every day - Italy is a mess: it might have great natural and historical places but if you are not a tourist Italy is not a good place to live.
  • No “my office” - And the last reason is that the working condition of the consulting industry are really bad: 90% of your time is spent at the customer site, working a small 15” laptops, sitting in temporary places, with just enough space to fit your laptop and move the mouse. And I really miss a place I can call “my office”.

And where am I going?

I think you might have heard of my new employer: it’s called the European Union, and specifically I’ll be working in at the Council of the European Union.

Avanade-To-EU

I applied to the open competition for becoming an EU official in the field of IT when I was still in New Zealand, in summer 2007. It took one year and half to pass all the stages of the competition, and then one year for someone to pick me up from the pool of possible employees. Starting from the the 1st of April, I’ll join the IT department as Team Lead and Architect of the team that is building all the public facing web sites of the Council… and as you have probably seen yourself, there is lot of work to do.

So, after more then 3 years I’ll work on public websites again. And my first task will be moving the team away from the VSS Hell and trying to steer it to a more Agile way of developing software.

Moving away from Italy

But I’m not only changing job, I’m also changing country: I’m going to live in Brussels, Belgium.

Unlike 3 years ago, I’m not going abroad because I want to live in that specific country, but because I want to go away from Italy: I already know I’ll miss the Alps, the lakes, but, as I said before, living here is becoming very difficult. So I’ll consider myself in self-exile and I’ll come back to Italy if/when things get better.

And, if you were wondering, I’m moving with my wife Daniela, which resigned as well, and will look for a job as UX in Belgium after she learn some French.

What changes in my development community involvement?

A nice thing about Belgium is that there is a vibrant development community, both in the .NET space and in the opensource space. I already know some developers from Belgium, like Ivan which I met in New Zealand, and some other MVPs I know through blogs and twitter. And I’ll try to know more of them at the upcoming MVP Summit in two week.

And also being in the middle of Europe, it means it will be easier to go to all the conferences held in London, Amsterdam and in Scandinavia.

But I’ll also keep on working together with Emanuele and Claudio to organize the future editions of the Italian ALT.NET Conference.

PS: Since many people are asking: “So Long, and Thanks for All the Fish” is the title of a book from the  Hitchhiker's Guide to the Galaxy series written by Douglas Adams. But also a geeky way to say goodbye with an hidden meaning.

PPS: If you know Italian, there is also a similar announcement on my Italian blog: Self-Exile.