Class NgramExtractor

java.lang.Object
com.optimaize.langdetect.ngram.NgramExtractor

public class NgramExtractor extends Object
Class for extracting n-grams out of a text.
  • Field Details

    • gramLengths

      @NotNull private final @NotNull List<Integer> gramLengths
    • filter

      @Nullable private final @Nullable NgramFilter filter
    • textPadding

      @Nullable private final @Nullable Character textPadding
  • Constructor Details

    • NgramExtractor

      private NgramExtractor(@NotNull @NotNull List<Integer> gramLengths, @Nullable @Nullable NgramFilter filter, @Nullable @Nullable Character textPadding)
  • Method Details

    • gramLength

      public static NgramExtractor gramLength(int gramLength)
    • gramLengths

      public static NgramExtractor gramLengths(Integer... gramLength)
    • filter

      public NgramExtractor filter(NgramFilter filter)
    • textPadding

      public NgramExtractor textPadding(char textPadding)
      To ensure having border grams, this character is added to the left and right of the text.

      Example: when textPadding is a space ' ' then a text input "foo" becomes " foo ", ensuring that n-grams like " f" are created.

      If the text already has such a character in that position (eg starts with), it is not added there.

      Parameters:
      textPadding - for example a space ' '.
    • getGramLengths

      public List<Integer> getGramLengths()
    • extractGrams

      @NotNull public @NotNull List<String> extractGrams(@NotNull @NotNull CharSequence text)
      Creates the n-grams for a given text in the order they occur.

      Example: extractSortedGrams("Foo bar", 2) => [Fo,oo,o , b,ba,ar]

      Parameters:
      text -
      Returns:
      The grams, empty if the input was empty or if none for that gramLength fits.
    • extractCountedGrams

      @NotNull public @NotNull Map<String,Integer> extractCountedGrams(@NotNull @NotNull CharSequence text)
      Returns:
      Key = ngram, value = count The order is as the n-grams appeared first in the string.
    • _extractCounted

      private void _extractCounted(CharSequence text, int gramLength, int len, Map<String,Integer> grams)
    • guessNumDistinctiveGrams

      private static int guessNumDistinctiveGrams(int textLength, int gramLength)
      This is trying to be smart. It also depends on script (alphabet less than ideographic). So I'm not sure how good it really is. Just trying to prevent array copies... and for Latin it seems to work fine.
    • applyPadding

      private CharSequence applyPadding(CharSequence text)