API Documentation¶
High-level API¶
The high level API exposes functions that works on plain unicode strings.
If you need to process other source or have implemented your own tokenizer, you’d better use the lower level parser classes below.
- text_to_num.text2num(text: str, lang: str | Language, relaxed: bool = False) int¶
Convert the
textstring containing an integer number written as letters into an integer value.Set
relaxedto True if you want to accept “quatre vingt(s)” as “quatre-vingt” (fr) or “ein und zwanzig” as “einundzwanzig” (de) etc..Raises an ValueError if
textdoes not describe a valid number. Return an int.
- text_to_num.alpha2digit(text: str, lang: str, relaxed: bool = False, signed: bool = True, ordinal_threshold: int = 3) str¶
Return the text of
textwith all thelangspelled numbers converted to digits. Takes care of punctuation. Setrelaxedto True if you want to accept some disjoint numbers as compounds. Setsignedto False if you don’t want to produce signed numbers, that is, for example, if you prefer to get « minus 2 » instead of « -2 ».Ordinals up to ordinal_threshold are not converted.
Parsers¶
The high-level API is build upon these parsers implemented as classes.
Those classes passively consume word tokens and thus can be easly integrated into your own tokenizer/framework.
Convert spelled numbers into numeric values or digit strings.
- class text_to_num.parsers.WordStreamValueParser(lang: Language, relaxed: bool = False)¶
The actual value builder engine.
The engine incrementaly recognize a stream of words as a valid number and build the corresponding numeric (integer) value.
The algorithm is based on the observation that humans gather the digits by group of three to more easily speak them out. And indeed, the language uses powers of 1000 to structure big numbers.
Public API:
self.push(word)self.value: int
- group_expects(word: str, update: bool = True) bool¶
Does the current group expect
wordto complete it as a valid number?wordshould not be a multiplier; multiplier should be handled first.
- is_coef_appliable(coef: int) bool¶
- push(word: str, look_ahead: str | None = None) bool¶
Push next word from the stream.
Don’t push punctuation marks or symbols, only words. It is the responsability of the caller to handle punctuation or any marker of pause in the word stream. The best practice is to call
self.close()on such markers and start again after.Return
Trueifwordcontributes to the current value elseFalse.The first time (after instanciating
self) this function returns True marks the beginning of a number.If this function returns False, and the last call returned True, that means you reached the end of a number. You can get its value from
self.value.Then, to parse a new number, you need to instanciate a new engine and start again from the last word you tried (the one that has just been rejected).
- property value: int¶
At any moment, get the value of the currently recognized number.
- class text_to_num.parsers.WordStreamValueParserGerman(lang: Language, relaxed: bool = False)¶
The actual value builder engine for the German language.
The engine processes numbers blockwise and sums them up at the end.
The algorithm is based on the observation that humans gather the digits by group of three to more easily speak them out. And indeed, the language uses powers of 1000 to structure big numbers.
Public API:
self.parse(word)self.value: int
- parse(text: str) bool¶
Check text for number words, split complex number words (hundertfünfzig) if necessary and parse all at once.
- property value: int¶
At any moment, get the value of the currently recognized number.
- class text_to_num.parsers.WordStreamValueParserInterface(lang: Language, relaxed: bool = False)¶
Interface for language-dependent ‘WordStreamValueParser’
- parse(text: str) bool¶
Parse whole text (or fail).
- push(word: str, look_ahead: str | None = None) bool¶
Push next word from the stream.
- property value: int¶
At any moment, get the value of the currently recognized number.
- class text_to_num.parsers.WordToDigitParser(lang: Language, relaxed: bool = False, signed: bool = True, ordinal_threshold: int = 3, preceding_word: str | None = None)¶
Words to digit transcriber.
The engine incrementaly recognize a stream of words as a valid cardinal, ordinal, decimal or formal number (including leading zeros) and build the corresponding digit string.
The submitted stream must be logically bounded: it is a phrase, it has a beginning and an end and does not contain sub-phrases. Formally, it does not contain punctuation nor voice pauses.
For example, this text:
« You don’t understand. I want two cups of coffee, three cups of tea and an apple pie. »
contains three phrases:
« you don’t understand »
« I want two cups of coffee »
« three cups of tea and an apple pie »
In other words, a stream must not cross (nor include) punctuation marks or voice pauses. Otherwise you may get unexpected, illogical, results. If you need to parse complete texts with punctuation, consider using alpha2digit transformer.
Zeros are not treated as isolates but are considered as starting a new formal number and are concatenated to the following digit.
Public API:
self.push(word, look_ahead)self.close()self.value: str
- at_start() bool¶
Return True if nothing valid parsed yet.
- at_start_of_seq() bool¶
Return true if we are waiting for the start of the integer part or the start of the fraction part.
- close() None¶
Signal end of input if input stream ends while still in a number.
It’s safe to call it multiple times.
- is_alone(word: str, next_word: str | None) bool¶
Check if the word is ‘alone’ meaning its part of ‘Language.NEVER_IF_ALONE’ exceptions and has no other numbers around itself.
- push(word: str, look_ahead: str | None = None) bool¶
Push next word from the stream.
Return
Trueifwordcontributes to the current value elseFalse.The first time (after instanciating
self) this function returns True marks the beginning of a number.If this function returns False, and the last call returned True, that means you reached the end of a number. You can get its value from
self.value.Then, to parse a new number, you need to instanciate a new engine and start again from the last word you tried (the one that has just been rejected).
- property value: str¶
Return the current value.
Misc.¶
- text_to_num.transforms.look_ahead(sequence: Sequence[Any]) Iterator[Tuple[Any, Any]]¶
Look-ahead iterator.
Iterate over a sequence by returning couples (current element, next element). The last couple returned before StopIteration is raised, is (last element, None).
Example:
>>> for elt, nxt_elt in look_ahead(sequence): ... # do something