Tuesday, September 27, 2016

N-Grams Analyzer Using Lucene 3.0.3

This post is inspired from this Article. It might have used different version of Lucene. So it was difficult to make it work first. This is working version for Lucene-3.0.3.

If you ever need to get n-grams tokens of a given string, you can simply use the helpers provided by Lucene Analyzer.

First you need to build your own Analyzer. This can be done using a ShinlgeMatrixFilter with the parameters as required. E.g.


public class NGramAnalyzer extends Analyzer {

  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new StopFilter(true, new LowerCaseFilter
            (new ShingleMatrixFilter
              (new StandardTokenizer(Version.LUCENE_20, reader),2,3,' ')),
            StopAnalyzer.ENGLISH_STOP_WORDS_SET);
  }
}


The parameters of the ShingleMatrixFilter states the minimum and maximum shingle size. "Shingle" is another name for token N-Grams.

Then you can you use the Analyzer as:


public static void main(String[] args) {
    try {
      String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene";
      Analyzer analyzer = new NGramAnalyzer();

      TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
      OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class);
      TermAttribute charTermAttribute = stream.addAttribute(TermAttribute.class);

      stream.reset();
      while (stream.incrementToken()) {
        String term = charTermAttribute.toString();
        System.out.println(term);
      }

    } catch (IOException ie) {
      System.out.println("IO Error " + ie.getMessage());
    }
  }