API Documentation

High-level API

The high level API exposes functions that works on plain unicode strings.

If you need to process other source or have implemented your own tokenizer, you’d better use the lower level parser classes below.

text_to_num.text2num(text: str, lang: str, relaxed: bool = False) → int

Convert the text string containing an integer number written in French into an integer value.

Set relaxed to True if you want to accept “quatre vingt(s)” as “quatre-vingt”.

Raises an AssertionError if text does not describe a valid number. Return an int.

text_to_num.alpha2digit(text: str, lang: str, relaxed: bool = False, signed: bool = True) → str

Return the text of text with all the French spelled numbers converted to digits. Takes care of punctuation. Set relaxed to True if you want to accept some disjoint numbers as compounds. Set signed to False if you don’t want to produce signed numbers, that is, for example, if you prefer to get « moins 2 » instead of « -2 ».

Parsers

The high-level API is build upon these parsers implemented as classes.

Those classes passively consume word tokens and thus can be easly integrated into your own tokenizer/framework.

Convert spelled numbers into numeric values or digit strings.

class text_to_num.parsers.WordStreamValueParser(lang: text_to_num.lang.base.Language, relaxed: bool = False)

The actual value builder engine.

The engine incrementaly recognize a stream of words as a valid number and build the corresponding numeric (integer) value.

The algorithm is based on the observation that humans gather the digits by group of three to more easily speak them out. And indeed, the language uses powers of 1000 to structure big numbers.

Public API:

  • self.push(word)
  • self.value: int
group_expects(word: str, update: bool = True) → bool

Does the current group expect word to complete it as a valid number? word should not be a multiplier; multiplier should be handled first.

is_coef_appliable(coef: int) → bool

Is this multiplier expected?

push(word: str, look_ahead: Optional[str] = None) → bool

Push next word from the stream.

Don’t push punctuation marks or symbols, only words. It is the responsability of the caller to handle punctuation or any marker of pause in the word stream. The best practice is to call self.close() on such markers and start again after.

Return True if word contributes to the current value else False.

The first time (after instanciating self) this function returns True marks the beginning of a number.

If this function returns False, and the last call returned True, that means you reached the end of a number. You can get its value from self.value.

Then, to parse a new number, you need to instanciate a new engine and start again from the last word you tried (the one that has just been rejected).

value

At any moment, get the value of the currently recognized number.

class text_to_num.parsers.WordToDigitParser(lang: text_to_num.lang.base.Language, relaxed: bool = False, signed: bool = True)

Words to digit transcriber.

The engine incrementaly recognize a stream of words as a valid cardinal, ordinal, decimal or formal number (including leading zeros) and build the corresponding digit string.

The submitted stream must be logically bounded: it is a phrase, it has a beginning and an end and does not contain sub-phrases. Formally, it does not contain punctuation nor voice pauses.

For example, this text:

« You don’t understand. I want two cups of coffee, three cups of tea and an apple pie. »

contains three phrases:

  • « you don’t understand »
  • « I want two cups of coffee »
  • « three cups of tea and an apple pie »

In other words, a stream must not cross (nor include) punctuation marks or voice pauses. Otherwise you may get unexpected, illogical, results. If you need to parse complete texts with punctuation, consider using alpha2digit transformer.

Zeros are not treated as isolates but are considered as starting a new formal number and are concatenated to the following digit.

Public API:

  • self.push(word, look_ahead)
  • self.close()
  • self.value: str
at_start() → bool

Return True if nothing valid parsed yet.

at_start_of_seq() → bool

Return true if we are waiting for the start of the integer part or the start of the fraction part.

close() → None

Signal end of input if input stream ends while still in a number.

It’s safe to call it multiple times.

is_alone(word: str, next_word: Optional[str]) → bool
push(word: str, look_ahead: Optional[str] = None) → bool

Push next word from the stream.

Return True if word contributes to the current value else False.

The first time (after instanciating self) this function returns True marks the beginning of a number.

If this function returns False, and the last call returned True, that means you reached the end of a number. You can get its value from self.value.

Then, to parse a new number, you need to instanciate a new engine and start again from the last word you tried (the one that has just been rejected).

value

Misc.

text_to_num.transforms.look_ahead(sequence: Sequence[Any]) → Iterator[Tuple[Any, Any]]

Look-ahead iterator.

Iterate over a sequence by returning couples (current element, next element). The last couple returned before StopIteration is raised, is (last element, None).

Example:

>>> for elt, nxt_elt in look_ahead(sequence):
... # do something