You are: Home/Blog/


Torchbox and the Lucene Search API

Helen Warren, 18 April 2006

Fast, targeted searching of your website
We have recently integrated the excellent Java Search API - Lucene into one of our core products - the RationalMedia content management system. Our RationalMedia indexing and searching application is internally code-named rmLucene. rmLucene indexes all the content on a RationalMedia implementation, including all public and protected web pages and all PDF, MSWord, Excel, PowerPoint and HTML documents in a RationalMedia library. The search facility provided by rmLucene provides our clients with a fast, accurate, powerful search of their RationalMedia driven web site. Search criteria include the obvious keyword search, but searches may also be requested on the basis of document modification dates, content categorization, document type (local or remote resources), and inclusion or exclusion of particular areas of the site. Searches for resources similar to a chosen search result are also a feature of rmLucene. Protected resources are handled automatically, only being included in the search results if the searcher is authorized to view that resource. 

Spidering, indexing and searching the wider web
Another of Torchbox's core offerings is a Java based product for defining a list of starting URLs to spider, index and search. This product is known as FathomFive and is an open source competitor to products like the Google appliance. FathomFive has OAI support and meta-data extraction for LGCL, GCL and IPSV vocabularies, features which are of particular interest to UK public sector organisations, but which can easily be ignored for unrelated implementations.

The FathomFive spider is based on JoBo while the indexer again uses the Lucene API. A Lucene based search servlet is packaged with FathomFive to search the portal of web sites maintained.

Combining Lucene indices
Some of our clients who use RationalMedia to maintain their website also run a FathomFive implementation. For them, to have the ability to optionally include results from their FathomFive portal in their main site search, would mean a seriously powerful, targeted search tool. But these are two separate Lucene indices. Fortunately, searching across these indices jointly is easy using the Lucene API.

Here's the way to combine multiple indices to create a Lucene searcher
object across all of them:

try {
// define arrays of size the number of indices you want to combine
IndexReader[] readers = new IndexReader[2];
IndexSearcher[] searchers = new IndexSearcher[2];

//open readers for each one
readers[0] = IndexReader.open("/usr/local/lucene/rmLucene/fullIndex");
readers[1] = IndexReader.open("/usr/local/lucene/f5/fullIndex");

//create searchers for each one
searchers[0] = new IndexSearcher(readers[0]);
searchers[1] = new IndexSearcher(readers[1]);

// Create the multi versions of the IndexReader and Searcher object
IndexReader myCombinedReader = new MultiReader(readers);
Searcher myCombinedSearcher = new MultiSearcher(searchers);

} catch (IOException ioe) {
System.err.println("Problem creating readers/searchers: "+ioe.getMessage());
}

Lucene takes care of re-assigning the internal ids of the documents in the two indices so they are treated as a single index with continuous, unique Lucene ids. Searchs using the new myCombinedSearcher object are executed in the standard way, with results from the two original indices being seemlessly amalgamated and ranked.

If you're serious about combining your indices....
....you may be interested in the one glitch we've noticed in the implementation of the MultiSearcher object. Here is the scenario.

Generally when you want to search a Lucene index, using several criteria, you build up an overall query object by bolting individual query objects together. e.g. Start with an overall BooleanQuery. To this BooleanQuery, add a query built from the keyword you're searching for:

overallQuery.add(keyWordQuery,BooleanClause.Occur.MUST);

Then, say we also want to constrain the results to all lie in the 'news' section of our site, or subsections thereof:


Query SectionQuery = new WildcardQuery(new Term("section","/news/*"));


We add this onto our overall query too:

overallQuery.add(SectionQuery,BooleanClause.Occur.MUST);


and so on, building up all the pieces. Once we have our final query, we can search the Lucene index:


ourSearcherObject.search(overallQuery());


This all works fine, even with the MultiSearcher object, except in the following scenario:

We want to search across two indices, using a WildcardQuery to exclude documents from the result set. e.g. show me all documents which contain 'avocado', but which do not lie in the 'news' section or any of it's subsections

i.e. as above, but with

overallQuery.add(SectionQuery,BooleanClause.Occur.MUST_NOT);


For some reason this fails when using MultiSearchers. The results return as though we added the SectionQuery in with a MUST rather than a MUST_NOT. It is only the odd combination of (a) multiple indices with a (b) WildcardQuery added into the overall query with a (c) Boolean MUST_NOT clause which causes problems. If we relax any of our constraints, the search returns the expected results i.e. if we


  • Don't use WildcardQuery, but pass in the news section and it's child section to exclude explicitly

  • Exclude results from just one section, not it's children too i.e. don't use WildcardQuery

  • Do use WildcardQuery, and exclude a section and its children, but just use one index thereby using the simple IndexReader and IndexSearcher objects

  • Use the boolean MUST clause rather than MUST_NOT when adding the WildcardQuery


  • I have logged this problem, together with illustrative JUnit tests, on the Lucene JIRA issue tracker.

    Paul Elschot has observed that if one explicitly rewrites the query before sending it into the search method, the world is put to rights, but of course we shouldn't need to do that as the query is rewritten anyway within the search method. Curiouser and curiouser but hardly a major bug in this fantastic search API.

Copyright © Torchbox Ltd, 2008
Torchbox