smoothtext package

Submodules

smoothtext.backend module

Backend module for SmoothText text processing library.

This module provides functionality to manage and validate different NLP backends that can be used with SmoothText. It supports multiple NLP frameworks through a unified interface, allowing users to switch between different implementations based on their needs.

Examples

>>> from smoothtext.backend import Backend
>>> Backend.is_supported('nltk')
True
>>> supported_backends = Backend.list_supported()

class smoothtext.backend.Backend(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing NLP backends supported by SmoothText.

This enum defines the available NLP processing backends and provides utility methods for backend validation and management. Each backend represents a different NLP framework that can be used for text processing tasks.

Available backends:

NLTK: Natural Language Toolkit, suitable for basic NLP tasks
Stanza: Stanford NLP’s Stanza, offering state-of-the-art accuracy

Examples

>>> backend = Backend.parse('nltk')
>>> if backend and Backend.is_supported(backend):
...     print(f"{backend.value} is available")

static is_supported(backend)

Verifies if a backend is installed and available for use.

This method checks both if the backend is valid and if its required dependencies are installed in the current environment.

Parameters:: backend (Union[Backend, str]) – The backend to check, either as a Backend enum value or a string identifier.
Returns:: True if the backend is valid and its dependencies are installed, False otherwise.
Return type:: bool

Examples

>>> Backend.is_supported('nltk')
True  # If NLTK is installed
>>> Backend.is_supported('invalid')
False

static list_supported()

Retrieves all backends that are currently available for use.

This method checks all defined backends and returns only those that have their dependencies properly installed in the current environment.

Returns:: A list of Backend enum values representing the backends that are ready to use.
Return type:: list[Backend]

Examples

>>> supported = Backend.list_supported()
>>> print([b.value for b in supported])
['NLTK']  # If only NLTK is installed

static parse(backend)

Converts a backend identifier to its corresponding Backend enum value.

Parameters:: backend (Union[Backend, str]) – The backend identifier to parse. Can be either a Backend enum value or a string matching a backend name (case-insensitive).
Returns:: The corresponding Backend enum value if valid, None if the input cannot be mapped to a valid backend.
Return type:: Optional[Backend]

Examples

>>> Backend.parse('nltk')
<Backend.NLTK>
>>> Backend.parse('invalid')
None

static values()

Returns a list of all available backend options.

This method provides access to all defined backends, regardless of whether they are currently installed and supported in the environment.

Returns:: A list containing all defined Backend enum values.
Return type:: list[Backend]

Examples

>>> backends = Backend.values()
>>> print([b.value for b in backends])
['NLTK', 'Stanza']

smoothtext.language module

Language support module for SmoothText.

This module provides language identification and parsing capabilities through the Language enum. It supports ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants using either hyphen or underscore separators (e.g., ‘en-US’ or ‘en_US’).

Examples

>>> lang = Language.parse("en-US")
>>> print(lang)
English (United States)
>>> print(lang.family())
English

class smoothtext.language.Language(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing languages supported by SmoothText.

This enum provides language identification capabilities with support for both ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants. Languages are organized in families, where regional variants (e.g., English_US) belong to a parent language (e.g., English).

# Base Languages

English: Generic English language support

German: Generic German language support

Russian: Generic Russian language support

Turkish: Generic Turkish language support

# English Variants

English_GB: British English variant

English_US: American English variant (default for ‘en’)

# German Variants

German_DE: German (Germany) variant (default for ‘de’)

# Russian Variants

Russian_RU: Russian (Russia) variant (default for ‘ru’)

# Turkish Variants

Turkish_TR: Turkish (Türkiye) variant (default for ‘tr’)

Examples

>>> lang = Language.English_US
>>> print(lang.alpha2())  # Returns 'en'
>>> print(lang.family())  # Returns Language.English

alpha2()

Get the ISO 639-1 two-letter code of the language.

Returns:: Two-letter language code (e.g., ‘en’ for English, ‘tr’ for Turkish)
Return type:: str

alpha3()

Get the ISO 639-2 three-letter code of the language.

Returns:: Three-letter language code (e.g., ‘eng’ for English, ‘tur’ for Turkish)
Return type:: str

static families()

Get a list of all base language families.

Returns:: List containing all base Language enum values
Return type:: list[Language]

family()

Get the family (base) language of the current language variant.

The family language represents the base language without region/country specifics. Regional variants return their base language, while base languages return themselves.

Returns:: Base language enum value
Return type:: Language

Examples

>>> Language.English_US.family()  # Returns Language.English
>>> Language.English.family()     # Returns Language.English

static parse(language)

Parse a language identifier into a Language enum value.

Parameters:: language (Language | str) – Language identifier to parse. Can be: - Language enum value - Full name (e.g., ‘English’) - ISO 639-1 code (e.g., ‘en’) - ISO 639-2 code (e.g., ‘eng’)
Returns:: Corresponding Language enum value if valid None: If the input cannot be parsed into a supported language
Return type:: Language

Examples

>>> Language.parse('en')
Language.English
>>> Language.parse('invalid')
None

static parse_multiple(languages)

Parse multiple language identifiers into a list of Language enum values. Note: The order of returned languages is not guaranteed.

Parameters:

languages (Language | str | list[Language | str]) – One or more language identifiers. Can be: - Single Language enum value - Single language string - List of Language enum values and/or strings - Comma-separated string of language identifiers

Returns:

List of unique, valid Language enum values.: The order of languages in the list is not guaranteed.

Return type:

list[Language]

Examples

>>> # Order may vary in the results
>>> Language.parse_multiple('en,tr')
[Language.English_GB, Language.Turkish_TR]  # or [Language.Turkish_TR, Language.English_GB]
>>> set(Language.parse_multiple(['en', 'invalid', 'tr']))  # Use set for order-independent comparison
{Language.English_GB, Language.Turkish_TR}

readability_formulas()

Get a list of supported readability formulas for the current language.

Different languages have different readability formulas that are specifically designed and validated for their linguistic characteristics.

Returns:: List of readability formulas supported for this language
Return type:: list[ReadabilityFormula]

Examples

>>> Language.English.readability_formulas()
[ReadabilityFormula.Automated_Readability_Index,
 ReadabilityFormula.Flesch_Reading_Ease,
 ReadabilityFormula.Flesch_Kincaid_Grade,
 ReadabilityFormula.Flesch_Kincaid_Grade_Simplified,
 ReadabilityFormula.Gunning_Fog_Index]
>>> Language.Turkish.readability_formulas()
[ReadabilityFormula.Atesman, ReadabilityFormula.Bezirci_Yilmaz]

static values()

Get a list of all supported languages.

Returns:: List containing all supported Language enum values
Return type:: list[Language]

variants()

Get a list of all language variants for the current language.

Returns:: List of all language variants, including the current language
Return type:: list[Language]

Examples

>>> Language.English_GB.variants()
[Language.English_GB, Language.English_US]
>>> Language.English.variants()
[Language.English_GB, Language.English_US]

smoothtext.readability module

Readability formulas module for SmoothText.

This module provides an enumeration of various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.

Examples

>>> from smoothtext import ReadabilityFormula
>>> formula = ReadabilityFormula.Flesch_Reading_Ease
>>> print(formula.value)
Flesch Reading Ease
>>> print(formula.supports(Language.English))
True
>>> print(formula.supports(Language.Turkish))
False

class smoothtext.readability.ReadabilityFormula(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enumeration of readability formulas supported by SmoothText.

This enum defines various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.

# English Formulas

Automated_Readability_Index: English readability formula developed by the US Army.

Flesch_Reading_Ease: Classic English readability formula. Scores range from 0 (hardest) to 100 (easiest).

Flesch_Kincaid_Grade: English grade-level assessment formula. Indicates US grade level required to understand the text.

Flesch_Kincaid_Grade_Simplified: Simplified version of Flesch-Kincaid Grade. Uses reduced parameters for grade-level assessment.

Gunning_Fog_Index: English readability formula developed by Robert Gunning.

# German Formulas

Wiener_Sachtextformel: Alias for Wiener_Sachtextformel_3 (general purpose formula).

Wiener_Sachtextformel_1: First variant of Wiener Sachtextformel. Optimized for narrative texts.

Wiener_Sachtextformel_2: Second variant of Wiener Sachtextformel. Optimized for scientific texts.

Wiener_Sachtextformel_3: Third variant of Wiener Sachtextformel. General purpose formula.

Wiener_Sachtextformel_4: Fourth variant of Wiener Sachtextformel. Alternative general purpose formula.

# Russian Formulas

Matskovskiy: Russian readability formula developed by Matskovskiy. Provides grade-level assessment for Russian texts.

# Turkish Formulas

Atesman: Turkish readability formula developed by Ateşman. Scores range from 0 (hardest) to 100 (easiest).

Bezirci_Yilmaz: Turkish readability formula by Bezirci and Yılmaz. Provides grade-level assessment for Turkish texts.

supports(language)

Determines if the formula supports the specified language.

Parameters:

language (Union[Language, str, None]) – The language to check support for. Can be either a Language enum value or a string identifier.

Returns:

True if the formula supports the specified language.
False if the formula does not support the language.

Return type:

bool

smoothtext.smoothtext module

SmoothText - A Python library for natural language text analysis and readability scoring.

This module provides functionality for: - Text tokenization (sentences and words) - Syllable counting and syllabification - Multiple readability formula calculations (Flesch, Ateşman, etc.) - Reading time estimation - Support for multiple languages and backend engines

All functionality is exposed through the SmoothText class which handles the preparation of required resources and provides a consistent API across different backends.

class smoothtext.smoothtext.SmoothText(language=None, backend=None)

Bases: object

Main class for text analysis and readability scoring.

The SmoothText class provides methods for: - Text tokenization and counting (sentences, words, syllables) - Readability scoring using various formulas - Reading time estimation - Language-specific text processing - Emoji handling

Supported backends: - NLTK - Stanza

Supported languages: - English - German - Turkish

Examples

>>> st = SmoothText(language="en", backend="nltk")
>>> score = st.flesch_reading_ease("This is a test sentence.")
>>> time = st.reading_time("Some text to analyze")

atesman(text, demojize=False)

Calculate Ateşman readability score for Turkish text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.

Score ranges: 90-100: Very easy 70-89: Easy 50-69: Medium difficulty 30-49: Difficult 1-29: Very difficult

Parameters:

text (str) – Input Turkish text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Ateşman readability score (higher = easier to read)

Return type:

float

Examples

>>> score = st.atesman("Basit bir Türkçe metin.")

property backend: Backend

Get the backend of the SmoothText instance.

Returns:: Backend of the SmoothText instance.

bezirci_yilmaz(text, demojize=False)

Calculate Bezirci-Yılmaz readability score for Turkish text. The score takes into account sentence length and frequency of words with 3+ syllables. Higher scores indicate more difficult readability.

Parameters:

text (str) – Input Turkish text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Bezirci-Yılmaz readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.bezirci_yilmaz("Türkçe metin örneği.")

compute_readability(text, formula, demojize=False)

Calculate readability score using the specified formula.

Parameters:

text (str) – Input text to analyze
formula (ReadabilityFormula) – ReadabilityFormula to use for scoring
demojize (bool) – If True, convert emojis to text descriptions before scoring

Returns:

Readability score (higher scores generally indicate easier readability)

Return type:

float

Examples

>>> score = st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)

static count_consonants(text)

Count the number of consonants in the text after converting to ASCII.

Parameters:: text (str) – Input text to analyze
Returns:: Number of consonant characters found
Return type:: int

Examples

>>> count = st.count_consonants("hello")  # Returns: 3

count_sentences(text)

Count the number of sentences in the input text.

Parameters:: text (str) – Input text to analyze
Returns:: Number of sentences detected
Return type:: int

Examples

>>> count = st.count_sentences("This is one. This is two.")
>>> # Returns: 2

count_syllables(word, tokenize=True)

Count the number of syllables in a word or text.

Parameters:

word (str) – Input word or text to analyze
tokenize (bool) – If True, tokenize input text and count syllables for each word

Returns:

Total number of syllables found

Return type:

int

Examples

>>> count = st.count_syllables("hello")  # Returns: 2
>>> count = st.count_syllables("hello world", tokenize=True)  # Returns: 3

static count_vowels(text)

Count the number of vowels in the text after converting to ASCII.

Parameters:: text (str) – Input text to analyze
Returns:: Number of vowel characters found
Return type:: int

Examples

>>> count = st.count_vowels("hello")  # Returns: 2

count_words(text)

Count the number of words in the text. This function counts the number of alphanumeric tokens retrieved from the tokenize method.

Parameters:: text (str) – Input text to count words from
Returns:: Number of alphanumeric words found
Return type:: int

Examples

>>> count = st.count_words("Hello, world!")  # Returns: 2

demojize(text, delimiters=('(', ')'))

Convert emoji characters to their text descriptions.

Parameters:

text (str) – Input text containing emojis
delimiters (tuple[str, str]) – Tuple of (open, close) delimiters to wrap emoji descriptions

Returns:

Text with emojis replaced by their descriptions

Return type:

str

Examples

>>> text = st.demojize("I love 🐈")
>>> # Returns: "I love (cat)"

flesch_reading_ease(text, demojize=False)

Calculate Flesch Reading Ease score for the text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.

Score ranges: 90-100: Very easy 80-89: Easy 70-79: Fairly easy 60-69: Standard 50-59: Fairly difficult 30-49: Difficult 0-29: Very difficult

Parameters:

text (str) – Input text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Flesch Reading Ease score (higher = easier to read)

Return type:

float

Examples

>>> score = st.flesch_reading_ease("Simple text is easy to read.")

static is_ready(backend, language)

Check if the backend is ready for the specified language.

Parameters:

backend (Backend | str) – Backend to check.
language (Language | str) – Language to check.

Return type:

bool

Returns:

True if the backend is ready for the language, False otherwise.

property language: Language

Get the language of the SmoothText instance.

Returns:: Language of the SmoothText instance.

static prepare(backend=None, languages=None, skip_downloads=False, silence_downloaders=True, **backend_kwargs)

Prepare the required resources for text analysis.

This method downloads and initializes the necessary language models and data for the specified backend and languages. It must be called before using any text analysis functionality.

Parameters:

backend (Backend | str | None) – The backend engine to use (NLTK or Stanza)
languages (Language | list[Language] | str | list[str] | None) – Language(s) to prepare resources for
skip_downloads (bool) – If True, skip downloading models even if not present
silence_downloaders (bool) – If True, suppress download progress output
**backend_kwargs – Additional arguments passed to backend downloaders

Raises:

RuntimeError – If preparation fails or no valid backends are found

Return type:

None

Examples

SmoothText.prepare(backend=”nltk”, languages=[“en”])

reading_aloud_time(text, words_per_minute=183.0, round_up=True)

Calculate estimated reading aloud time using default speaking speed. Default speed is 183 WPM based on research averages.

Parameters:

text (str) – Input text to analyze
words_per_minute (float) – Optional custom speaking speed
round_up (bool) – If True, round result up to nearest second

Returns:

Estimated speaking time in seconds

Return type:

float

reading_time(text, words_per_minute, round_up=True)

Calculate estimated reading time for the text.

Parameters:

text (str) – Input text to analyze
words_per_minute (float) – Reading speed in words per minute
round_up (bool) – If True, round result up to nearest second

Returns:

Estimated reading time in seconds

Return type:

float

Examples

>>> time = st.reading_time("Some text to read", words_per_minute=200)

remove_emojis(text)

Remove emoji characters from the text.

Parameters:: text (str) – Input text containing emojis
Returns:: Text with emojis removed
Return type:: str

Examples

>>> text = st.remove_emojis("I love 🐈")
>>> # Returns: "I love "

sentencize(text)

Split text into sentences using the configured backend tokenizer.

Parameters:: text (str) – Input text to split into sentences
Returns:: List of sentences found in the text
Return type:: list[str]

Examples

>>> sentences = st.sentencize("This is a test. Another sentence.")
>>> # Returns: ["This is a test.", "Another sentence."]

silent_reading_time(text, words_per_minute=238.0, round_up=True)

Calculate estimated silent reading time using default reading speed. Default speed is 238 WPM based on research averages.

Parameters:

text (str) – Input text to analyze
words_per_minute (float) – Optional custom reading speed
round_up (bool) – If True, round result up to nearest second

Returns:

Estimated silent reading time in seconds

Return type:

float

syllabify(word, tokenize=False, sentencize=False)

Split words into syllables using language-specific rules.

This method can operate on single words, lists of words, or lists of sentences containing words. However, for simple counting, it is recommended to use the count_syllables method as it is more efficient and accurate. This method will keep punctuation marks as separate tokens.

Parameters:

word (str) – Input word or text to syllabify
tokenize (bool) – If True, split input into words first
sentencize (bool) – If True, split input into sentences first

Returns:

List of syllables for a single word list[list[str]]: List of words with their syllables if tokenize=True list[list[list[str]]]: List of sentences containing words with syllables if sentencize=True

Return type:

list[str]

Examples

>>> syllables = st.syllabify("hello")
>>> # Returns: ["hel", "lo"]

>>> word_syllables = st.syllabify("hello world", tokenize=True)
>>> # Returns: [["hel", "lo"], ["world"]]

tokenize(text, split_sentences=False)

Tokenize text into words using the configured backend tokenizer.

Parameters:

text (str) – Input text to tokenize
split_sentences (bool) – If True, return tokens grouped by sentences

Returns:

List of tokens if split_sentences=False list[list[str]]: List of sentences containing lists of tokens if split_sentences=True

Return type:

list[str]

Examples

>>> tokens = st.tokenize("Hello world!")
>>> # Returns: ["Hello", "world", "!"]
>>>
>>> sent_tokens = st.tokenize("Hi there. Bye now.", split_sentences=True)
>>> # Returns: [["Hi", "there", "."], ["Bye", "now", "."]]

wiener_sachtextformel(text, demojize=False, version=3)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Score ranges: 4-5: Very easy 6-8: Easy 9-11: Average 12-14: Difficult 15+: Very difficult

Parameters:

text (str) – Input German text to analyze
demojize (bool) – If True, convert emojis to text before scoring
version (int) – Wiener Sachtextformel version to use (1-4)

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.")
>>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.", version=3)

wiener_sachtextformel_1(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:

text (str) – Input German text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_1("Deutscher Textbeispiel.")

wiener_sachtextformel_2(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:

text (str) – Input German text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_2("Deutscher Textbeispiel.")

wiener_sachtextformel_3(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:

text (str) – Input German text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_3("Deutscher Textbeispiel.")

wiener_sachtextformel_4(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:

text (str) – Input German text to analyze
demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_4("Deutscher Textbeispiel.")

word_frequencies(text, lemmatize=True)

Count the frequency of words in the text.

Parameters:

text (str) – Input text to analyze
lemmatize (bool) – If True, lemmatize words before counting

Returns:

Dictionary of word frequencies

Return type:

dict[str, int]

Examples

>>> freqs = st.word_frequencies("Hello world! Hello again.")
>>> # Returns: {"hello": 2, "world": 1, "again": 1}

smoothtext package

Submodules

smoothtext.backend module

smoothtext.language module

smoothtext.readability module

smoothtext.smoothtext module

Module contents