First time here? You are looking at the most recent posts. You may also want to check out older archives or the tag cloud. Please leave a comment, ask a question and consider subscribing to the latest posts via RSS. Thank you for visiting! (hide this)

September 2009 Blog Posts

Sergejus is running Oxite on Azure and thinking in .NET

http://sergejus.com/Skins/Sergejus/Styles/images/banner.pngMy friend Sergejus Barinovas, developer, community lead, speaker and much more (I think he is also the only MS MVP of Lithuania) has just started a new blogging experience: after running his Lithuanian blog for two years, he decided to start sharing is knowledge, experience and passion for development in English, in his blog called Thinking in .NET.

He will mainly talk about web development, Ajax,  data access technologies and best practices.

One of the cool thing of his blog is that it runs on Oxite, and it’s hosted on Windows Azure and SQL Azure, and in his first post on his blog Sergejus writes about how to setup the Oxite database on SQL Azure. And in the next post of he will talk more about setting up Oxite on Windows Azure.

I really recommend subscribing to his feed. He is also @sergejusb on twitter.

Speaking in Rome at the Gladiators Fest about ASP.NET MVC (Oct 21)

Gladiatore I’m glad to announce that I’ll be speaking in Rome at the Gladiators Fest (see the original in Italian): this is the first event organized by the .NET user group in Rome and I’ve been invited by Emanuele Mattei to talk about my IT passion: ASP.NET MVC.

I’ll held two session:

  • just before the lunch break, the first will be about some of the Best Practices for developing web applications based on ASP.NET MVC (level 300)
  • and immediately after the lunch, I’ll talk about the new features that are going to be introduced with ASP.NET MVC v2. And since I don’t think I’ll talk about this for one hour, there will also be a Q&A session (level 100)

The event is planned for Wednesday October 21st, and the location is a bit outside Rome, at the SAP Auditorium (map).

For more information, the agenda, the list of speakers and the direction to arrive at the location with public transportation please read the official event page: DotNetRomaCeStà – Evento Gladiatori Fest (in English)

I’ve very excited to participate to this event, since it’s the startup event of the community and it is being planned since a long time.

I want to thank Emanuele for the giving me the possibility, and Wrox for providing a few copies of my Beginning ASP.NET MVC book to give away at the event.

See you there.

And the winners of the Mindscape Professional Pack are...

... Martin Harris, from Wellington, NZ (@BlackMael) and Andrea Balducci, from Castelfidardo, Italy (@andreabalducci). They both won a yearly subscription to the Mindscape Mega Pack Professional, which includes a license of Lightspeed Professional, all the WPF controls developed by Mindscape, future updates and any other software and control this prolific software factory from Wellington will release in the next 12 months.

Martin is the "official" winner, selected by picking a random tweet from all the ones tweeted. He won by tweeting the following message:

RT @simonech isolated in the void my only salvation to retreat into self with hope for #mindscape #giveaway http://bit.ly/MindscapeGiveaway

His entry complies to all the rules: comment on my post, follows me and is a creative message.

Andrea, instead, was not chosen by a random draw, but he won because he re-tweeted every day since last Friday. So JD and I decided to give another license also to him for his effort. Furthermore he was the first to comment on the blog and his tweets were the most creative, especially the first ones:

RT routes.MapRoute("@simonech","{c}/{a}/{i}",new {c="prize",a="for",i="me"}); #mindscape #giveaway http://bit.ly/MindscapeGiveaway auguri ;D

RT if(bdays["@simonech"]==today.AddDays(-1)) contest["#mindscape #giveaway"].Join();Response.Redirect("http://bit.ly/MindscapeGiveaway");

RT @simonech 342064617973206c65667420746f206a6f696e2074686520636f6e7465737421 #mindscape #giveaway http://bit.ly/MindscapeGiveaway
[this means: "4 days left to join the contest!"]

Congratulations to both Andrea and Martin.

Simone Chiaretta turns 35 and gives away a Mindscape Mega Pack professional subscription

UPDATE: The winner has been announced: And the winners of the Mindscape Professional Pack are...

mindscape Today it’s my 35th birthday, but instead of boring you with the usual “what happened in the last year” kind of post I’m going to give one of my readers and Twitter followers a gift.

But before the prize, I just want to do a very quick roundup of what my 34th year was: I became an MVP, I became an ASP Insider, I published my first book and I took part in my first triathlon. So quite a good year after all.

But back to the gift.

The prize

Thanks to JD Trask and Mindscape (the cool Wellington based company I was lucky to share the office with when I lived in New Zealand) today you have the chance to win one license of the Mindscape Mega Pack Professional, which is worth $599!

It contains all Mindscape’s products (LightSpeed, all the WPF controls and the SimpleDB tools) and a 12 months subscription: this means that the winner will receive all the updates and also any new products Mindscape will release in the next year, like the new WPF controls, Silverlight controls and more.

The Giveaway Rules

The giveaway starts today, Friday September 11th, 2009 and will last till Thursday September 17th, 2009 11:59PM CEST (date and time in other timezones). I’ll then announce the winner on Friday September 18th, 2009.

All you have to do to signup for the competition is to re-tweet this post (but changing with your own text) and add the #mindscape hashtag, the #giveaway hashtag and the url to this post http://bit.ly/MindscapeGiveaway in it. All the rest can be whatever you like: be creative, you have 75 characters more to fill. If you are out of creative sentences, here is a little help:

RT @simonech Sign me up for the mindscape pack giveaway #mindscape #giveaway http://bit.ly/MindscapeGiveaway

To make sure I don’t miss any tweet, also comment on this post with your twitter name and follow me on twitter (this way I can contact you in case you are picked up from the draw).

To increase the chance of winning the Mega Pack you can tweet more than once, but only once per day, and the “creative part” of the tweet must be different each time. But you need to comment to this blog post just the first time.

So, in short:

  1. Follow @simonech
  2. Re-tweet
  3. Comment leaving your twitter name
  4. and the re-tweet once a day

All the first 3 steps are required to take part in the giveaway. If you don’t comment and don’t follow, you will not be signed in for the competition as I’ll not be able to contact you in case you win.

How the random draw will work

  1. First I’ll look at the comments to this post to gather all the people that signed up for the giveaway.
  2. I’ll then collect all the tweets of all the comment authors.
  3. I’ll purge out tweets sent during the same day and with the same message.
  4. And finally I’ll pickup one of the tweets from this list.

And once picked up the winning tweet I’ll notify its author via DM.

Mindscape Mega Pack

Mindscape is a leading-edge Wellington based software solutions provider widely regarded as a think-tank within the Australasian technical community.

A company founded on the principles of technical excellence, Mindscape employs recognised industry thought-leaders; people who are passionate about producing great, high-quality software.

You can read more about the company and their products from their website and I really recommend subscribing to their blog to get the latest news about their products. You can also download the free version of LightSpeed, and try it out.

How Subtext’s Lucene.net index is structured

In the last part of the tutorial about Lucene.net we talked about how to organized a Lucene index, and how it is important to have a well planned strategy for it. In this post I’m going to show you how I applied those concepts and Nic’s tips during the design of the index for Subtext.

Requirements

Here are the requirements we are designing the index for:

  • Free-text searches using the search box
  • When someone comes from a search engine, show more results related to the search he did
  • Show more posts related to a post

The first two requirements are the usual ones: being able to search for some terms in the index, but the last one requires more than just the list of terms: it’s a MoreLikeThis search and it needs also the Term Vector to be stored.

Than there are other “hidden requirements”: a post can just be a draft (and I don’t want it to appear in searches), or it can be scheduled for future publishing (and again, I don’t want it to appear in search results). Then we also have the “aggregated blog”, which is a collection of all the blogs of the site. To make things even more complex, it’s not just one “wall”, but blogs can be grouped in different “walls” (for example all blogs talking about Silverlight and all the ones talking about ASP.NET MVC). And last, users can decide not to push their posts to the their group.

Structure of the Index

With that all these requirements in mind here is how Subtext’s index is structured:

Name Index Store TV Boost Description
Title TOKENIZED YES YES 2 The title of the post
Body TOKENIZED NO YES - Body of the post
Tags TOKENIZED NO YES 4 List of tags
PubDate UN_TOKENIZED YES NO - The publishing data
BlogID UN_TOKENIZED NO NO - The id of the blog
Published UN_TOKENIZED NO NO - Is post draft or not?
GroupID UN_TOKENIZED NO NO - The group id (0 if not pushed to aggregator)
PostURL NO YES NO - The URL of the post
BlogName NO YES NO - The name of the blog
PostID UN_TOKENIZED YES NO - The id of the post

Explaining why

Let’s explain it a bit more. The only fields that need to be full-text searched are the one that contain some kind of real content: so Title, Body and Tags are the only ones that need to be analyzed and tokenized.

But to comply to all the other requirements, when we do a search we have to search also using other criteria:

  • PubDate must be less than Now
  • Published must be true
  • BlogID must be the one of the blog I’m searching from (when searching inside a single blog)
  • GroupID must be the one of the aggregated site I’m searching from (when searching inside an aggregated site)

So I also needed index the fields above, but since they are single terms I don’t need to tokenize them.

And I also need the PostID since when we’ll be using the MoreLikeThis query I’ve to pass to supply to Lucene the id of the document which I want to search similar document for.

And finally, a row in the results will be like:

Dissecting Lucene.net storage: Documents and Fields – Sept 4th, 2009 (CodeClimber)

So the only fields I need to retrieve, and thus store, are Title, PubDate, BlogName (shown in case I’m doing a search from the aggregated site) and obviously the URL to link to the complete post.

What do you think? Am I missing something? Would you have done something differently? Please answer with your comments.

The next step

Now that the index has been designed, in the next post we’ll cover some infrastructural code, and show how the search engine service works inside Subtext.

Disclaimer: This is all work in progress and might be (and probably will be) different from the final version of the search engine service that will be included into the next version of Subtext.

Lucene - or how I stopped worrying, and learned to love unstructured data

Before moving to how I’m implementing Lucene.net into Subtext, I wanted to bring to my Lucene.net tutorial the experience of a good friend of mine, Nic Wise, who is using Lucene, both Java and .NET, since 2003.
So, without further ado, let’s read the experience directly from Nic’s writings.


Lucene is a project I've been using for a long time, and one I find often that people don't know about. I think Simo has covered off what Lucene is, and how to use it, so I'm here to tell you a bit about how I've used it over the years.

My first Lucene

My first use of Lucene was back in about 2003. I was writing a educational website for IDG in New Zealand, using Java, and we needed to search the database. Lucene was really the only option, aside from using various RDBMS tricks, so it was chosen. This was a pretty typical usage tho - throw the content of a database record into the index with a primary key and then query it, pulling back the records in relevance order.

With the collapse of the first internet bubble (yes, it even hit little ol' New Zealand) that site died, and I stopped using Java and moved to .NET. To be honest, I don't even remember the name of the site!

AfterMail / Archive Manager

My next encounter with Lucene - this time Lucene.NET - was when I was at AfterMail (which was later bought by Quest Software, and is now Archive Manager). AfterMail was an email archiving product, which extracts email from Exchange, and puts it into a SQL Server database. Exchange didn't handle huge data sets well (it still doesn't do it well, but it does do it better), but SQL Server can and does handle massive data sets without flinching.

The existing AfterMail product used a very simple index system: break up a document into it's component words, either by tokenizing the content of an email, or using an iFilter to extract the content of an attachment, and then do a mapping between words and email or attachment primary keys. It was pretty simple, and it worked quite well with small data sets, but the size of the index database compared to the size of the source database was a problem - it was often more than 75%! This was really not good when you have a lot of data in the database. This was combined with not having any relevance ranking, or any other of the nice features a "real" index provides.

We decided to give Lucene a try for second major release of AfterMail. On the same data set, Lucene created an index which was about 20% of the size of the source data, performed a lot quicker, and scaled up to massive data sets without any problem.

The general architecture we had went like this:

The data loader would take an email and insert it into the database. It would also add the email's ID and the ID
of any attachments into the "to be indexed" table.

  1. The data loader would take an email and insert it into the database. It would also add the email's ID and the ID
    of any attachments into the "to be indexed" table.
  2. The indexing service would look at that "to be indexed" table every minute, and index anything which was in
    there.
  3. When the website needed to query the index, it would make a remoting call (what is now WCF) to the index
    searching service, which would query the index, and put the results into a database temporary table. This was a
    legacy from the original index system, so we could then join onto the email and attachment tables.

We indexed a fair bit of data, including:

  • The content of the email or attachment, in a form which could be searched but not retrieved.
  • Which users could see it, so we didn't have to check if the user could see an item in the results.
  • The email address of the user, broken down - so foo@bar.com was added in as foo, oof, bar.com, moc.rab etc. This
    allowed us to search for f*, *oo, *@bar.com, and *@bar*, which Lucene doesn't normally allow (you can do wild cards
    at the end, but not the beginning)
  • Other meta data, like which mailboxes the email was in, which folders, if we knew, and a load of other data.

All of this meant we could provide the user with a strong search function. From time to time, an email would be indexed more than once, updating the document in the index (eg if another user could see the email), but in general, it was a quick and fairly stable process. It wasn't perfect tho: We ran into an issue where we set the merge sizes too high - way, way too high – which caused a merge of two huge files. This would have worked just fine, if a bit slow, except we had a watchdog timer in place: when a service took too long doing anything, the service would be killed and restarted. This led to a lot of temporary index files being left around (around 250GB in one case, for a 20GB index), and a generally broken index. Setting the merge size to a more sane value (around 100,000 - we had it at Int32.MaxInteger) fixed this, but it took us a while to work it out, and about a week to reindex the customers database, which contained around 100GB of email and attachments.

Another gotcha we ran into - and why we had a search service which was accessed via remoting - is that Lucene does NOT like talking to an index which is on a file share. If the network connection goes down, even for a second, you will end up with a trashed index. (this was in 1.x, and may be fixed in 2.x). AfterMail was designed to be distributed over multiple machines, so being able to communicate with a remote index was a requirement.

Just before I left, we did lot of work around the indexing, moving from Lucene.NET 1.x to 2.x, along with a move to .NET 2.0. We added multi-threading for indexing of email (attachments had a bottle neck on the iFilters, which were always single threaded, but the indexing part was multi-threaded), which sped up the indexing by a large factor – I think we were indexing about 20-40 emails per second on a fairly basic dual-core machine, up from 2-3 per second, and it would scale quite linearly as you added more CPU's.

Lucene performed amazingly well, and allowed us to provide a close-to-google style search for our customers.

Top Gear

The next project I used it on was a rewrite of the Top Gear website. This is where some of the less conventional uses came up. For those who don't know, Top Gear is a UK television program about cars, cars and cars, presented in both a technical (BHP, MPG, torque) and non technical, humorous way (it's rubbish/it's amazing/OMFG!). We were redeveloping the website from scratch, for the magazine, and it ties into the show well.

The first aspect of the index was the usual: index the various items in the database (articles, blog posts, car reviews, video metadata), and allow the user to search them. The search results were sightly massaged, as we wanted to bubble newer content to the top, but otherwise we were using Lucene’s built in relevance ordering. The user can also select what they want to search - articles, blog posts, video etc - or just search the whole site.

Quick tips for people new to Lucene

Your documents don't have to have the same fields! For example, the fields for a Video will be different to the fields for an Article, but you can put them in the same index! Just make a few common ones (I usually go with body and tags, as well as a discriminator (Video/Article/News etc) and a database primary key), but really, you can add any number of different fields to a document.

Think about what you need the field for. For example, you may only need the title and the first 100 characters of a blog post, to show on the screen, but storing the whole post will blow out the size of your database. Only store what you need - you can still index and search on text which is not stored.

The second aspect was much less common. Each document in the database had various keywords, or tags, which were added by the editor when they were entered. We then looked for other items in the database which matched those tags, either in their body, tags or other fields, and used that as a list of "related" items. We also weighted the results, so that a match in an items tags counted for more than something in the title or body. For example, on this page you can see the list of related items at the bottom, generated on the fly from Lucene, by looking for other documents which match the tags of this article.

If we were able, we would have extended the tag set using keyword extraction (eg using the Yahoo! Term Extraction API) from the body contents, but this was deemed to be overkill.

Top Gear's publishing system works by pulling out new articles from the CMS database, and creating entries in the main database. At the same time, it adds the item to the index. In addition to this, there is a scheduled process which recreates the index from scratch 4x a day. When the index is rebuilt, it's then distributed to the other web server in the cluster, so that both machines are up to date. The indexes are small, and document count on these are low (<10,000), so reindexing only takes couple of minutes.

My final personal recommendations

All up, Lucene has been a consistent top performer when ever I've needed a search-based database. It can handle very large data sets (100's of GB of source data) without any problems, returning results in real time (<500ms). The mailing list is active, and despite not having a binary distribution, it is maintained, developed, and supported.

If you think of it as only an index, then you are going to only use one aspect of this very versatile tool. It does add another level of complexity to the system, but once you master it - and it's not hard to master - it's a very solid performer, even more so if you stop worrying about relational, and learn to love unstructured.

I recommend the book Lucene In Action, as it has a lot of background on how the searching and indexing work, as well as a how-to guide - the Java and .NET versions are very close to API compatible, certainly enough to make the book worth while.


About the author

image Nic Wise is a grey haired software developer from New Zealand, living in London, UK. He is a freelance contractor, previously working for BBC Worldwide on a redevelopment of the Top Gear website and the bbc.com/bbc.co.uk homepage. He has worked on many projects over his 13+ years in the industry. Read more about him.

You can read his rumbling on his The Chicken Coop blog, and follow him on twitter @fastchicken.

ASP.NET MVC Refcard available

After completing our book, Keyvan and I decided to combine our efforts again, and write a quick reference about ASP.NET MVC.

And today this quick reference is available from DZone, in the RefCardz section of the site.

The refcard doesn’t try to explain what the library is (would have not been possible in just 6 pages) but instead focuses on giving a quick reference of the conventions used by the framework, of the API available and of all the aspect of developing with ASP.NET MVC. More details are available on Keyvan’s post.

Get it here: Getting Started with ASP.NET MVC 1.0

Dissecting Lucene.net storage: Documents and Fields

In the previous posts we discussed how to get started with Lucene.net, its main concepts and we developed a sample application that put in practice all the concepts behind Lucene.net developlent. But before moving on, I think it’s worth analyzing in detail how content is stored into Lucene.net index.

The Document

As you already saw previously, a Document is the unit of the indexing and searching process. You add a document to the index and, after you perform a search, you get a list of results: and they are documents.

A document is just an unstructured collection of Fields.

Fields

Fields are the actual content holders of Lucene.net: they are basically a hashtable, with a name and value.

If we had infinite disk space and infinite processing power that’s all we needed to know. But unfortunately disk space and processing power are constrained so you can’t just analyze everything and store into the index. But Lucene.net provides different ways of adding a field to the index.

Everything is controlled through the field constructor:

new Field("fieldName", "value",
    Field.Store.YES,
    Field.Index.NO,
    Field.TermVector.YES);

Store the content or not

You can decide whether to store the content of the field into the index or not:

  • Field.Store.YES – Stores the content in the index as supplied to the Field’s constructor
  • Field.Store.NO – Doesn’t store the value at all (you won’t be able to retrieve it)
  • Field.Store.COMPRESS – Compresses the original value and stores it into the index

When you have to decide whether to store the original content or don’t, you have to think at which data you really need when you will display the result of the search: if you are never going to show the content of the document, there is no need to store it inside the index. But maybe you need to store the date or the users that have the right to access to a document. Or maybe you want to show only the first 100 characters of the post: in this case you will just store them, and not the full post. The final goal is to keep the index size to minimum but, at the same time, make sure you will not need to hit the database to display the results of a search.

This, of course, if you are using Lucene as just the full text index of another “main” storage. If you are using it as a KV store, a-la CouchDB, you obviously need to store everything. In this last scenario, you might want to compress long texts or binary data to keep the size down.

Just one quick point to make sure there are no misunderstandings: even if you don’t store the original value of a field, you can still index it.

To Index or not to Index?

You can then decide which kind of indexing to apply to the value added:

  • Field.Index.NO – The value is not indexed (it cannot be searched but only retrieved, provided it was stored)
  • Field.Index.TOKENIZED – The value is fully indexed, using the Analyzer (so the text is first tokenized, then normalized, and so on, as you read in the post about the main concepts of Lucene.net)
  • Field.Index.UN_TOKENIZED – The value is indexed, but without a analyzer, so it can be searched, but only as single term.
  • Field.Index.NO_NORM – Indexes without a analyzer and without storing the norm. This in an advanced option that allows you to reduce the memory usage (one byte per field) but at the cost of disabling index boosting and length normalization.

So, when to use which? You have to think about how you are going to search for your documents. For example, if you don’t need to search using the post URL, but you only need it to link to the actual content, then you can safely use the NO option. At the opposite, you might need to search using the exact value of a short field, for example a category. In this case you don’t need to analyze it and break into its terms, so you can index it using the UN_TOKENIZED option, and have the value indexed as a single term. And obviously, if you need to use the TOKENIZED option for the content that needs to be full-text indexed.

And what about Term Vectors?

The third option is TermVectors, but first we have to understand what a Term Vector is. A term vector represents all terms inside a field with the number of occurrences in the document. Usually this is not stored, but it’s useful for some more advanced types of queries like “MoreLikeThis” queries, Span queries and to highlight the matches inside the document. You have the following options:

  • Field.TermVector.NO – This is the default one, the one that is used when this option is not even specified (using the other constructors)
  • Field.TermVector.YES – Term vectors are stored
  • Field.TermVector.WITH_POSITIONS – Stores term vector together with the position of each token
  • Field.TermVector.WITH_OFFSETS – Stores term vector together with the offset of tokens
  • Field.TermVector.WITH_POSITIONS_OFFSETS – Stores term vector with both position and offset

This was a more advanced option and again, make sure you know what you are going to do with your index and which types of searches you are going to do.

Boosting

Another topic that is a bit more advanced but very powerful is boosting. With boosting Lucene means the ability to make something (a document, a field, a search term) be more important than the others.

For example you might want that matches on the title of a post are more important than the ones on the content. So you have to set a field boost on the title.

Field title = new Field("title","This is my title",
      Field.Store.YES,Field.Index.TOKENIZED);
title.SetBoost(2.0f);

Or you might want to push a document more than others, so you have to set a boost on a whole document.This means that when you perform a search, this document will be pushed up in the ranking list.

Document doc = new Document();
//Add Fields
doc.SetBoost(2.0f);
writer.AddDocument(doc);

But beware that boosting is disabled if you created the field using the Field.Index.NO_NORM option. If you set the NO_NORM option, the only boosting you can do is the one at search time. I’ll come to search syntax in a future post but here is a quick sample of how you can boost a query term.
If you want to search for documents that contain the terms “MVC” or “MVP”, but you want documents that contain the term “MVC” displayed before, the query should be:

MVC^2 MVP

You use the caret “^” to as if you were raising the term to power of the boost you want to apply.

A practical example of boosting

Let’s see with an example what the effect of boosting is.

Imagine you have 3 documents in your index:

  • Doc A – “The MVC pattern is better then the MVP pattern”
  • Doc B - “The MVC pattern is the best thing since sliced bread”
  • Doc C - “The MVP pattern is too complex”

So, if you search for “MVC^2 MVP” it will return all the 3 documents (this search means either MVC or MVP) it the following order:

  1. Doc A – 3 points (2 points for MVC and 1 for MVP)
  2. Doc B – 2 points (2 points for MVC)
  3. Doc C – 1 point (1 point for MVP)

This is an oversimplification, and the actual ranking algorithm is much more complex, but it shows well the result of boosting.

What’s next?

Now that I went over all the main concepts of Lucene.net, in the next posts we are going to see how I’m planning to organize the index in Subtext.

Lucene.net: your first application

In the first two posts of the tutorial you learnt how to get the latest version of Lucene.net, where to get the (little) documentation available, which are the main concepts of Lucene.net and Lucene.net main development steps.

In this third post I’m going to put in practice all the concepts explained the previous post, writing a simple console application that indexes the text entered in the console.

I’ll refer to the steps I outlined in my previous post. So if you haven’t already I recommend you go back and read it.

Step 1 – Initialize the Directory and the IndexWriter

As I said in my previous post, there are two possible Directory you can use: one based on the file system and one based on RAM. You’d usually want to use the FS based one: it’s pretty fast anyway, and you don’t need to constantly dump it to the filesystem. Probably the RAM is more a test fake than something to use for real in production.

And once you have instantiated the Directory you have to open an IndexWriter on it.

Directory directory = FSDirectory.GetDirectory("LuceneIndex");
Analyzer analyzer = new StandardAnalyzer();
IndexWriter writer = new IndexWriter(directory, analyzer);

If you are not interested in getting the reference to the Directory, you don’t want to call additional methods on it, and you are interested in just a FSDirectory, you use the short version, and create the IndexWriter with just one line of code.

IndexWriter writer = new IndexWriter("LuceneIndex", analyzer);

Step 2 – Add Documents to the index

I’ll cover this topic more in depth in a subsequent post, but the basic code for adding a document to the index is pretty straightforward. Create a document, add some fields to it, and then add the document to the Index.

Document doc = new Document();
doc.Add(new Field("id", i.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("postBody", text, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

And when you are done with adding all the documents you need, you might call the Optimize method “priming the index for the fastest available search”, and later either Flush to commit all the updates to the Directory or, if you don’t need to add to the index any more, call the Close method to flush and then close all the files in the Directory.

writer.Optimize();
//Close the writer
writer.Flush();
writer.Close();

Step 3 – Create the Query

The Query can be either created via API or parsing Lucene query syntax with the QueryParser.

QueryParser parser = new QueryParser("postBody", analyzer);
Query query = parser.Parse("text");

or

Query query = new TermQuery(new Term("postBody", "text"));

The two snippets are functionally the same, so when is it good to use the API and when to use the QueryParser? I personally would use the QueryParser when the search string is supplied by the user, and I’d use directly the API when the query is generated by your code.

Step 4 – Pass the Query to the IndexSearcher

Once you have your Query, all you need is passing it to the Search method of the IndexSearcher.

//Setup searcher
IndexSearcher searcher = new IndexSearcher(directory);
//Do the search
Hits hits = searcher.Search(query);

The Searcher must be instantiated before the usage and, for performance reasons, it’s recommended that only one Searcher is open. So open one and use it in all your searches. This might pose some issues in multi-thread environment (like in web applications), but we’ll come to this topic in a future post.

Step 5 – Iterates over the Results

The Search method returns a Hits object, which contains all the documents returned by the query. To list the results, just loop through all the results.

int results = hits.Length();
Console.WriteLine("Found {0} results", results);
for (int i = 0; i < results; i++)
{
    Document doc = hits.Doc(i);
    float score = hits.Score(i);
    Console.WriteLine("Result num {0}, score {1}", i+1,score);
    Console.WriteLine("ID: {0}", doc.Get("id"));
    Console.WriteLine("Text found: {0}" + Environment.NewLine, doc.Get("postBody"));
}

You get the current Document using the Doc(num) method, and the Score (which is a unbund float) using the Score(num) method. You might notice that this a pretty strange API compared to what we are used in .NET. I might have expected to do a foreach over the returned Hits object. Probably this is due to the API being a class-per-class port of the Java version, and so it uses the API design conventions that are typical of the Java world. We can debate over this purist-port approach vs a more idiomatic one for ages, but that’s the way it is.

Step 6 – Close everything

Once you are done with everything, you need to close all the resources: Directory and IndexSearcher.

searcher.Close();
directory.Close();

Get the code

You can download a short sample application that stitch together all that code into a console application that lets you index any text you enter, and later search for it.


What’s next

This was a very simple application: it was single-threaded and had both the indexing and searching phases in the same piece of code. But before going into the details of the implementation I’m doing for Subtext, in the next post I’ll cover the concept of document and fields more in depth.