Menu

Dan Braghis

Principal Engineer and Wagtail Consultant

Apache Solr Search with Solr > 4.7

2 mins read

Solr 4.x brings a plethora of improvements over 3.x and 1.x. All our new projects use 4.x and we try to upgrade any existing client implementations where and when possible. Last week we upgraded another client. The transition was smooth, except for odd entries in the indexing log and, is it turned out, nodes missing from the index.

java.lang.Thread.run(Thread.java:745)
Caused by:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="sm_field_body" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[32, 67, 97, 116, 104, 101, 114, 105, 110, 101, 32, 66, 101, 97, 114, 100, 115, 104, 97, 119, 32, 67, 97, 116, 104, 101, 114, 105, 110, 101]...', original message: bytes can be at most 32766 in length; got 108809
[...]
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 108809

It so happens, the Solr [notes on upgrading from prior versions](https://cwiki.apache.org/confluence/display/solr/Upgrading%20Solr) contain the following:

"Prior to Solr 4.8, terms that exceeded Lucene’s MAX_TERM_LENGTH were silently ignored when indexing documents. Begining with Solr 4.8, a document an error will be generated when attempting to index a document with a term that is too large. If you wish to continue to have large terms ignored, use solr.LengthFilterFactory in all of your Analyzers. See [LUCENE-5472](https://issues.apache.org/jira/browse/LUCENE-5472) for more details."

Drupal Apache Solr fields are prefixed with a set of characters that denote the dynamic field nature and follow the Solr convention. e.g. `ss_means` “single-value string field”, `sm_` — “multi-value string field”.

In our case `sm_field_body` and any `sm_*` fields are declared as `solr.StrField` fields which are not analyzed, just stored as is. Previously, fields larger than the allowed 32k limit were simply ignored, but not anymore.

In usual Solr configurations, a `StrField` could be truncated using [TruncateFieldUpdateProcessorFactory](http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html)

<processor class="solr.TruncateFieldUpdateProcessorFactory">
  <str name="typeClass">solr.StrField</str>
  <int name="maxLength">100</int>
</processor>

However, since sm_* fields are not processed, we need a different solution that does not involve modifying the core Solr configuration. And that comes as a simple hook implementation in a custom module.

<?php

/**
 * Implements hook_apachesolr_index_documents_alter().
 *
 * Fix for https://issues.apache.org/jira/browse/LUCENE-5472
 */
function apachesolr_tweaks_apachesolr_index_documents_alter(array &$documents, $entity, $entity_type, $env_id) {
  foreach ($documents as $id => $document) {
    if (empty($documents[$id]->sm_field_body)) {
      continue;
    }

    foreach ($documents[$id]->sm_field_body as $index => $value) {
      $documents[$id]->sm_field_body[$index] = truncate_utf8($value, 31000); 
    }
  }
}

The above `hook_apachesolr_index_documents_alter()` implementation looks specifically at `sm_field_body` as that was the culprit in our case. You can follow the approach to alter any indexed field.

Once deployed, the indexing could continue without issues and all of the content is now searchable.