Class DictionaryLookup

java.lang.Object
morfologik.stemming.DictionaryLookup
All Implemented Interfaces:
Iterable<WordData>, IStemmer

public final class DictionaryLookup extends Object implements IStemmer, Iterable<WordData>
This class implements a dictionary lookup of an inflected word over a dictionary previously compiled using the dict_compile tool.
  • Field Details

    • matcher

      private final FSATraversal matcher
      An FSA used for lookups.
    • finalStatesIterator

      private final ByteSequenceIterator finalStatesIterator
      An iterator for walking along the final states of fsa.
    • rootNode

      private final int rootNode
      FSA's root node.
    • EXPAND_SIZE

      private static final int EXPAND_SIZE
      Expand buffers and arrays by this constant.
      See Also:
    • forms

      private WordData[] forms
      Private internal array of reusable word data objects.
    • formsList

      private final ArrayViewList<WordData> formsList
      A "view" over an array implementing
    • dictionaryMetadata

      private final DictionaryMetadata dictionaryMetadata
      Features of the compiled dictionary.
      See Also:
    • encoder

      private final CharsetEncoder encoder
      Charset encoder for the FSA.
    • decoder

      private final CharsetDecoder decoder
      Charset decoder for the FSA.
    • fsa

      private final FSA fsa
      The FSA we are using.
    • separatorChar

      private final char separatorChar
      See Also:
    • byteBuffer

      private ByteBuffer byteBuffer
      Internal reusable buffer for encoding words into byte arrays using encoder.
    • charBuffer

      private CharBuffer charBuffer
      Internal reusable buffer for encoding words into byte arrays using encoder.
    • matchResult

      private final MatchResult matchResult
      Reusable match result.
    • dictionary

      private final Dictionary dictionary
      The Dictionary this lookup is using.
    • sequenceEncoder

      private final ISequenceEncoder sequenceEncoder
  • Constructor Details

    • DictionaryLookup

      public DictionaryLookup(Dictionary dictionary) throws IllegalArgumentException
      Creates a new object of this class using the given FSA for word lookups and encoding for converting characters to bytes.
      Parameters:
      dictionary - The dictionary to use for lookups.
      Throws:
      IllegalArgumentException - if FSA's root node cannot be acquired (dictionary is empty).
  • Method Details

    • lookup

      public List<WordData> lookup(CharSequence word)
      Searches the automaton for a symbol sequence equal to word, followed by a separator. The result is a stem (decompressed accordingly to the dictionary's specification) and an optional tag data.
      Specified by:
      lookup in interface IStemmer
      Parameters:
      word - The word (typically inflected) to look up base forms for.
      Returns:
      A list of WordData entries (possibly empty).
    • applyReplacements

      public static String applyReplacements(CharSequence word, LinkedHashMap<String,String> replacements)
      Apply partial string replacements from a given map. Useful if the word needs to be normalized somehow (i.e., ligatures, apostrophes and such).
      Parameters:
      word - The word to apply replacements to.
      replacements - A map of replacements (from->to).
      Returns:
      new string with all replacements applied.
    • iterator

      public Iterator<WordData> iterator()
      Return an iterator over all WordData entries available in the embedded Dictionary.
      Specified by:
      iterator in interface Iterable<WordData>
    • getDictionary

      public Dictionary getDictionary()
      Returns:
      Return the Dictionary used by this object.
    • getSeparatorChar

      public char getSeparatorChar()
      Returns:
      Returns the logical separator character splitting inflected form, lemma correction token and a tag. Note that this character is a best-effort conversion from a byte in DictionaryMetadata.separator and may not be valid in the target encoding (although this is highly unlikely).