Interface WordIndexer<W>

Type Parameters:
W - A type representing words in the language. Can be a String, or something more complex if needed
All Superinterfaces:
Serializable
All Known Implementing Classes:
StringWordIndexer

public interface WordIndexer<W> extends Serializable
Enumerates words in the vocabulary of a language model. Stores a two-way mapping between integers and words.
Author:
adampauls
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Interface
    Description
    static class 
     
  • Method Summary

    Modifier and Type
    Method
    Description
    Returns the start symbol (usually something like </s>
    int
    Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.
    int
    Gets the index for a word, adding if necessary.
    int
     
    Returns the start symbol (usually something like <s>
    Returns the unk symbol (usually something like <unk>
    getWord(int index)
    Gets the word object for an index.
    int
    Number of words that have been added so far
    void
     
    void
     
    void
     
    void
    Informs the implementation that no more words can be added to the vocabulary.
  • Method Details

    • getOrAddIndex

      int getOrAddIndex(W word)
      Gets the index for a word, adding if necessary.
      Parameters:
      word -
      Returns:
    • getOrAddIndexFromString

      int getOrAddIndexFromString(String word)
    • getIndexPossiblyUnk

      int getIndexPossiblyUnk(W word)
      Should never add to vocabulary, and should return getUnkSymbol() if the word is not in the vocabulary.
      Parameters:
      word -
      Returns:
    • getWord

      W getWord(int index)
      Gets the word object for an index.
      Parameters:
      index -
      Returns:
    • numWords

      int numWords()
      Number of words that have been added so far
      Returns:
    • getStartSymbol

      W getStartSymbol()
      Returns the start symbol (usually something like <s>
      Returns:
    • setStartSymbol

      void setStartSymbol(W sym)
    • getEndSymbol

      W getEndSymbol()
      Returns the start symbol (usually something like </s>
      Returns:
    • setEndSymbol

      void setEndSymbol(W sym)
    • getUnkSymbol

      W getUnkSymbol()
      Returns the unk symbol (usually something like <unk>
      Returns:
    • setUnkSymbol

      void setUnkSymbol(W sym)
    • trimAndLock

      void trimAndLock()
      Informs the implementation that no more words can be added to the vocabulary. Implementations may perform some space optimization, and should trigger an error if an attempt is made to add a word after this point.