presage 0.9.1
Tokenizer Class Referenceabstract

#include <tokenizer.h>

Inheritance diagram for Tokenizer:
Inheritance graph
Collaboration diagram for Tokenizer:
Collaboration graph

Classes

class  StreamGuard
 

Public Member Functions

 Tokenizer (std::istream &stream, const std::string blankspaces, const std::string separators)
 
virtual ~Tokenizer ()
 
virtual int countTokens ()=0
 
virtual bool hasMoreTokens () const =0
 
virtual std::string nextToken ()=0
 
virtual double progress () const =0
 
void blankspaceChars (const std::string)
 
std::string blankspaceChars () const
 
void separatorChars (const std::string)
 
std::string separatorChars () const
 
void lowercaseMode (const bool)
 
bool lowercaseMode () const
 
std::string streamToString () const
 

Protected Member Functions

bool isBlankspace (const int character) const
 
bool isSeparator (const int character) const
 

Protected Attributes

std::istream & stream
 
std::ios::iostate sstate
 
std::streamoff offbeg
 
std::streamoff offend
 
std::streamoff offset
 

Private Attributes

std::string blankspaces
 
std::string separators
 
bool lowercase
 

Detailed Description

The Tokenizer class takes an input stream and parses it into "tokens", allowing the tokens to be read one at a time.

The parsing process is controlled by the character classification sets:

  • blankspace characters: characters that mark a token boundary and are not part of the token.
  • separator characters: characters that mark a token boundary and might be considered tokens, depending on the value of a flag (to be implemented).
  • valid characters: any non blankspace and non separator character.

Each byte read from the input stream is regarded as a character in the range '\u0000' through '\u00FF'.

In addition, an instance has flags that control:

  • whether the characters of tokens are converted to lowercase.
  • whether separator characters constitute tokens. (TBD)

A typical application first constructs an instance of this class, supplying the input stream to be tokenized, the set of blankspaces, and the set of separators, and then repeatedly loops, while method hasMoreTokens returns true, calling the nextToken method.

Definition at line 64 of file tokenizer.h.

Constructor & Destructor Documentation

◆ Tokenizer()

Tokenizer::Tokenizer ( std::istream & stream,
const std::string blankspaces,
const std::string separators )

Definition at line 27 of file tokenizer.cpp.

References blankspaceChars(), blankspaces, lowercase, offbeg, offend, offset, separatorChars(), separators, sstate, and stream.

Referenced by ForwardTokenizer::ForwardTokenizer(), and ReverseTokenizer::ReverseTokenizer().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ ~Tokenizer()

Tokenizer::~Tokenizer ( )
virtual

Definition at line 53 of file tokenizer.cpp.

References sstate, and stream.

Member Function Documentation

◆ blankspaceChars() [1/2]

std::string Tokenizer::blankspaceChars ( ) const

Gets blankspace characters.

Definition at line 66 of file tokenizer.cpp.

References blankspaces.

Referenced by Tokenizer().

Here is the caller graph for this function:

◆ blankspaceChars() [2/2]

void Tokenizer::blankspaceChars ( const std::string chars)

Sets blankspace characters.

Definition at line 61 of file tokenizer.cpp.

References blankspaces.

◆ countTokens()

virtual int Tokenizer::countTokens ( )
pure virtual

Returns the number of tokens left.

Implemented in ForwardTokenizer, and ReverseTokenizer.

◆ hasMoreTokens()

virtual bool Tokenizer::hasMoreTokens ( ) const
pure virtual

Tests if there are more tokens.

Implemented in ForwardTokenizer, and ReverseTokenizer.

◆ isBlankspace()

bool Tokenizer::isBlankspace ( const int character) const
protected

Definition at line 91 of file tokenizer.cpp.

References blankspaces.

Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().

Here is the caller graph for this function:

◆ isSeparator()

bool Tokenizer::isSeparator ( const int character) const
protected

Definition at line 101 of file tokenizer.cpp.

References separators.

Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().

Here is the caller graph for this function:

◆ lowercaseMode() [1/2]

bool Tokenizer::lowercaseMode ( ) const

Gets lowercase mode.

Definition at line 86 of file tokenizer.cpp.

References lowercase.

Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().

Here is the caller graph for this function:

◆ lowercaseMode() [2/2]

void Tokenizer::lowercaseMode ( const bool value)

Sets lowercase mode.

Definition at line 81 of file tokenizer.cpp.

References lowercase.

Referenced by ContextChangeDetector::change(), ContextTracker::getToken(), ContextTracker::learn(), and main().

Here is the caller graph for this function:

◆ nextToken()

virtual std::string Tokenizer::nextToken ( )
pure virtual

Returns the next token.

Implemented in ForwardTokenizer, and ReverseTokenizer.

◆ progress()

virtual double Tokenizer::progress ( ) const
pure virtual

Returns progress percentage.

Implemented in ForwardTokenizer, and ReverseTokenizer.

◆ separatorChars() [1/2]

std::string Tokenizer::separatorChars ( ) const

Gets separator characters.

Definition at line 76 of file tokenizer.cpp.

References separators.

Referenced by Tokenizer().

Here is the caller graph for this function:

◆ separatorChars() [2/2]

void Tokenizer::separatorChars ( const std::string chars)

Sets separator characters.

Definition at line 71 of file tokenizer.cpp.

References separators.

◆ streamToString()

std::string Tokenizer::streamToString ( ) const
inline

Definition at line 109 of file tokenizer.h.

References offbeg, offend, and stream.

Member Data Documentation

◆ blankspaces

◆ lowercase

bool Tokenizer::lowercase
private

Definition at line 157 of file tokenizer.h.

Referenced by lowercaseMode(), lowercaseMode(), and Tokenizer().

◆ offbeg

◆ offend

◆ offset

◆ separators

std::string Tokenizer::separators
private

◆ sstate

std::ios::iostate Tokenizer::sstate
protected

Definition at line 145 of file tokenizer.h.

Referenced by Tokenizer(), and ~Tokenizer().

◆ stream


The documentation for this class was generated from the following files: