Apachesolr Issues with German and other Germanic languages

By introducing stemming the search results were already improved a lot. But there is one speciality in the so called Germanic languages such as German, which prevents the results from being as fine as with English: It's creation of compound words.

One example: The English word Danubian steam ship becomes in German Donaudampfschiff.

What's the problem?

Apachesolr is using a RequestHandler called DisMaxRequestHandler. This handler matches only full words. E. g. if you have a document about Danubian steam ships in your index and if you search for steam ship, your search will be successful because the words steam and ship were matched (by the benefit of stemming).

Now imagine the case in German: the document in your index is about Donaudampfschifffahrt and your are performing a search for Dampfschiff. Your search will be unsuccessful, because the words do not match. The DisMaxRequestHandler does not match parts of words.

Those combined words such as "Donaudampfschiff" are called compound words. In some Germanic languages it is common to build up very long compound words, such as Krankenhausbettenverwaltung (English: hospital bed administration).

When I realized the problem, I thought I have to give up using Apachesolr because I have a lot German content. But then I had an idea: if we could break up compound words in Apachesolr's text analyzer filter chain, everything should be fine! I needed a filter to explode the word Donaudampfschifffahrt into the four words Donau dampf schiff fahrt.

I did some recherche and I discovered theses articles that describe algorithms for breaking up compound words:

BananaSplit - A Dictionary-based Compound Splitter for German by Niels Ott
A Splitter for German compound words - a presentation by Pasquale Imbemba

My target was to create a filter for Apachesolr based on the described algorithms.... Pascquale was so kind to send me his implementation in Java...

And then there was much rejoice ...

But it became even better: Apachesolr is based on another Apache project called Lucene. In fact, Lucene provides the search machine, whereas Apachesolr is rather the REST interface to Lucene. To be able to write a compund word filter I got me a book about Lucene's internals: "Lucene in Action" from Manning. I studied a lot and I was very surprised: the solution was already there! Lucene provides a filter called DictionaryCompoundWordTokenFilter that already does, what I want! It is not well known. I did not find much information about it, but it is the exact solution to the problem given.

All I had to do now, was to configure the compound words splitter into the schema.xml. Here is an excerpt from my file:

<analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory"
        mapping="mapping-ISOLatin1Accent.txt" />
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
        words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1" generateNumberParts="1" catenateWords="1"
        catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
        dictionary="my_dictionary.txt" />
    <filter class="solr.SnowballPorterFilterFactory" language="German"
        protected="protwords.txt" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory"
        mapping="mapping-ISOLatin1Accent.txt" />
    <tokenizer class="solr.WhitespaceTokenizerFactory" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
        ignoreCase="true" expand="true" />
    <filter class="solr.StopFilterFactory" ignoreCase="true"
        words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1" generateNumberParts="1" catenateWords="0"
        catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.SnowballPorterFilterFactory" language="German"
        protected="protwords.txt" />
    <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>

The blue line indicates the addition of the compound word breaker filter into the index text analyzer.

You may have noticed the dictionary="my_dictionary.txt". That's because the compound word splitter needs a dictionary of basic German words. It breaks the compound words into the basic words that are listed in the dictionary.

As a first step, I analyzed the words, that people were searching within drupal. I then created the list by breaking up the most used words manually. And I had to transform the German Umlaute like this

ü -> u

ä -> a

ö -> o

The reason is that the tokens, that are being passed to the compound word spltter have already passed the character transliteration (provided by MappingCharFilterFactory).

My lists starts like this

abbruch
abgabe
abgeltung
abkurzung
abkurzungen
ablehnung
abrechnung
abschluss
absender
absenz
abwicklung
...

Comments

hi, thanks for posting this;

Submitted by Fredrik (not verified) on 22 October 2009 - 1:35pm

hi,
thanks for posting this; is there a publicly available/freely licence dictionary of these words that you used for you "my_dictionary.txt".
best,
fredrik

I thought I have to give up

Submitted by Neys (not verified) on 17 February 2011 - 1:25am

I thought I have to give up using Apachesolr because I have a lot German content. But then I had an idea. I then created the list by breaking up the most used words manually. Thank you.