smoothtext package

Submodules

smoothtext.backend module

Backend module for SmoothText text processing library.

This module provides functionality to manage and validate different NLP backends that can be used with SmoothText. It supports multiple NLP frameworks through a unified interface, allowing users to switch between different implementations based on their needs.

Examples

>>> from smoothtext.backend import Backend
>>> Backend.is_supported('nltk')
True
>>> supported_backends = Backend.list_supported()
class smoothtext.backend.Backend(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing NLP backends supported by SmoothText.

This enum defines the available NLP processing backends and provides utility methods for backend validation and management. Each backend represents a different NLP framework that can be used for text processing tasks.

Available backends:
  • NLTK: Natural Language Toolkit, suitable for basic NLP tasks

  • Stanza: Stanford NLP’s Stanza, offering state-of-the-art accuracy

Examples

>>> backend = Backend.parse('nltk')
>>> if backend and Backend.is_supported(backend):
...     print(f"{backend.value} is available")
static is_supported(backend)

Verifies if a backend is installed and available for use.

This method checks both if the backend is valid and if its required dependencies are installed in the current environment.

Parameters:

backend (Union[Backend, str]) – The backend to check, either as a Backend enum value or a string identifier.

Returns:

True if the backend is valid and its dependencies are installed, False otherwise.

Return type:

bool

Examples

>>> Backend.is_supported('nltk')
True  # If NLTK is installed
>>> Backend.is_supported('invalid')
False
static list_supported()

Retrieves all backends that are currently available for use.

This method checks all defined backends and returns only those that have their dependencies properly installed in the current environment.

Returns:

A list of Backend enum values representing the backends that are ready to use.

Return type:

list[Backend]

Examples

>>> supported = Backend.list_supported()
>>> print([b.value for b in supported])
['NLTK']  # If only NLTK is installed
static parse(backend)

Converts a backend identifier to its corresponding Backend enum value.

Parameters:

backend (Union[Backend, str]) – The backend identifier to parse. Can be either a Backend enum value or a string matching a backend name (case-insensitive).

Returns:

The corresponding Backend enum value if valid, None if the input cannot be mapped to a valid backend.

Return type:

Optional[Backend]

Examples

>>> Backend.parse('nltk')
<Backend.NLTK>
>>> Backend.parse('invalid')
None
static values()

Returns a list of all available backend options.

This method provides access to all defined backends, regardless of whether they are currently installed and supported in the environment.

Returns:

A list containing all defined Backend enum values.

Return type:

list[Backend]

Examples

>>> backends = Backend.values()
>>> print([b.value for b in backends])
['NLTK', 'Stanza']

smoothtext.language module

Language support module for SmoothText.

This module provides language identification and parsing capabilities through the Language enum. It supports ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants using either hyphen or underscore separators (e.g., ‘en-US’ or ‘en_US’).

Examples

>>> lang = Language.parse("en-US")
>>> print(lang)
English (United States)
>>> print(lang.family())
English
class smoothtext.language.Language(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing languages supported by SmoothText.

This enum provides language identification capabilities with support for both ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants. Languages are organized in families, where regional variants (e.g., English_US) belong to a parent language (e.g., English).

# Base Languages
English

Generic English language support

German

Generic German language support

Russian

Generic Russian language support

Turkish

Generic Turkish language support

# English Variants
English_GB

British English variant

English_US

American English variant (default for ‘en’)

# German Variants
German_DE

German (Germany) variant (default for ‘de’)

# Russian Variants
Russian_RU

Russian (Russia) variant (default for ‘ru’)

# Turkish Variants
Turkish_TR

Turkish (Türkiye) variant (default for ‘tr’)

Examples

>>> lang = Language.English_US
>>> print(lang.alpha2())  # Returns 'en'
>>> print(lang.family())  # Returns Language.English
alpha2()

Get the ISO 639-1 two-letter code of the language.

Returns:

Two-letter language code (e.g., ‘en’ for English, ‘tr’ for Turkish)

Return type:

str

alpha3()

Get the ISO 639-2 three-letter code of the language.

Returns:

Three-letter language code (e.g., ‘eng’ for English, ‘tur’ for Turkish)

Return type:

str

static families()

Get a list of all base language families.

Returns:

List containing all base Language enum values

Return type:

list[Language]

family()

Get the family (base) language of the current language variant.

The family language represents the base language without region/country specifics. Regional variants return their base language, while base languages return themselves.

Returns:

Base language enum value

Return type:

Language

Examples

>>> Language.English_US.family()  # Returns Language.English
>>> Language.English.family()     # Returns Language.English
static parse(language)

Parse a language identifier into a Language enum value.

Parameters:

language (Language | str) – Language identifier to parse. Can be: - Language enum value - Full name (e.g., ‘English’) - ISO 639-1 code (e.g., ‘en’) - ISO 639-2 code (e.g., ‘eng’)

Returns:

Corresponding Language enum value if valid None: If the input cannot be parsed into a supported language

Return type:

Language

Examples

>>> Language.parse('en')
Language.English
>>> Language.parse('invalid')
None
static parse_multiple(languages)

Parse multiple language identifiers into a list of Language enum values. Note: The order of returned languages is not guaranteed.

Parameters:

languages (Language | str | list[Language | str]) – One or more language identifiers. Can be: - Single Language enum value - Single language string - List of Language enum values and/or strings - Comma-separated string of language identifiers

Returns:

List of unique, valid Language enum values.

The order of languages in the list is not guaranteed.

Return type:

list[Language]

Examples

>>> # Order may vary in the results
>>> Language.parse_multiple('en,tr')
[Language.English_GB, Language.Turkish_TR]  # or [Language.Turkish_TR, Language.English_GB]
>>> set(Language.parse_multiple(['en', 'invalid', 'tr']))  # Use set for order-independent comparison
{Language.English_GB, Language.Turkish_TR}
readability_formulas()

Get a list of supported readability formulas for the current language.

Different languages have different readability formulas that are specifically designed and validated for their linguistic characteristics.

Returns:

List of readability formulas supported for this language

Return type:

list[ReadabilityFormula]

Examples

>>> Language.English.readability_formulas()
[ReadabilityFormula.Automated_Readability_Index,
 ReadabilityFormula.Flesch_Reading_Ease,
 ReadabilityFormula.Flesch_Kincaid_Grade,
 ReadabilityFormula.Flesch_Kincaid_Grade_Simplified,
 ReadabilityFormula.Gunning_Fog_Index]
>>> Language.Turkish.readability_formulas()
[ReadabilityFormula.Atesman, ReadabilityFormula.Bezirci_Yilmaz]
static values()

Get a list of all supported languages.

Returns:

List containing all supported Language enum values

Return type:

list[Language]

variants()

Get a list of all language variants for the current language.

Returns:

List of all language variants, including the current language

Return type:

list[Language]

Examples

>>> Language.English_GB.variants()
[Language.English_GB, Language.English_US]
>>> Language.English.variants()
[Language.English_GB, Language.English_US]

smoothtext.readability module

Readability formulas module for SmoothText.

This module provides an enumeration of various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.

Examples

>>> from smoothtext import ReadabilityFormula
>>> formula = ReadabilityFormula.Flesch_Reading_Ease
>>> print(formula.value)
Flesch Reading Ease
>>> print(formula.supports(Language.English))
True
>>> print(formula.supports(Language.Turkish))
False
class smoothtext.readability.ReadabilityFormula(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enumeration of readability formulas supported by SmoothText.

This enum defines various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.

# English Formulas
Automated_Readability_Index

English readability formula developed by the US Army.

Flesch_Reading_Ease

Classic English readability formula. Scores range from 0 (hardest) to 100 (easiest).

Flesch_Kincaid_Grade

English grade-level assessment formula. Indicates US grade level required to understand the text.

Flesch_Kincaid_Grade_Simplified

Simplified version of Flesch-Kincaid Grade. Uses reduced parameters for grade-level assessment.

Gunning_Fog_Index

English readability formula developed by Robert Gunning.

# German Formulas
Wiener_Sachtextformel

Alias for Wiener_Sachtextformel_3 (general purpose formula).

Wiener_Sachtextformel_1

First variant of Wiener Sachtextformel. Optimized for narrative texts.

Wiener_Sachtextformel_2

Second variant of Wiener Sachtextformel. Optimized for scientific texts.

Wiener_Sachtextformel_3

Third variant of Wiener Sachtextformel. General purpose formula.

Wiener_Sachtextformel_4

Fourth variant of Wiener Sachtextformel. Alternative general purpose formula.

# Russian Formulas
Matskovskiy

Russian readability formula developed by Matskovskiy. Provides grade-level assessment for Russian texts.

# Turkish Formulas
Atesman

Turkish readability formula developed by Ateşman. Scores range from 0 (hardest) to 100 (easiest).

Bezirci_Yilmaz

Turkish readability formula by Bezirci and Yılmaz. Provides grade-level assessment for Turkish texts.

supports(language)

Determines if the formula supports the specified language.

Parameters:

language (Union[Language, str, None]) – The language to check support for. Can be either a Language enum value or a string identifier.

Returns:

  • True if the formula supports the specified language.

  • False if the formula does not support the language.

Return type:

bool

smoothtext.smoothtext module

SmoothText - A Python library for natural language text analysis and readability scoring.

This module provides functionality for: - Text tokenization (sentences and words) - Syllable counting and syllabification - Multiple readability formula calculations (Flesch, Ateşman, etc.) - Reading time estimation - Support for multiple languages and backend engines

All functionality is exposed through the SmoothText class which handles the preparation of required resources and provides a consistent API across different backends.

class smoothtext.smoothtext.SmoothText(language=None, backend=None)

Bases: object

Main class for text analysis and readability scoring.

The SmoothText class provides methods for: - Text tokenization and counting (sentences, words, syllables) - Readability scoring using various formulas - Reading time estimation - Language-specific text processing - Emoji handling

Supported backends: - NLTK - Stanza

Supported languages: - English - German - Turkish

Examples

>>> st = SmoothText(language="en", backend="nltk")
>>> score = st.flesch_reading_ease("This is a test sentence.")
>>> time = st.reading_time("Some text to analyze")
atesman(text, demojize=False)

Calculate Ateşman readability score for Turkish text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.

Score ranges: 90-100: Very easy 70-89: Easy 50-69: Medium difficulty 30-49: Difficult 1-29: Very difficult

Parameters:
  • text (str) – Input Turkish text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Ateşman readability score (higher = easier to read)

Return type:

float

Examples

>>> score = st.atesman("Basit bir Türkçe metin.")
property backend: Backend

Get the backend of the SmoothText instance.

Returns:

Backend of the SmoothText instance.

bezirci_yilmaz(text, demojize=False)

Calculate Bezirci-Yılmaz readability score for Turkish text. The score takes into account sentence length and frequency of words with 3+ syllables. Higher scores indicate more difficult readability.

Parameters:
  • text (str) – Input Turkish text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Bezirci-Yılmaz readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.bezirci_yilmaz("Türkçe metin örneği.")
compute_readability(text, formula, demojize=False)

Calculate readability score using the specified formula.

Parameters:
  • text (str) – Input text to analyze

  • formula (ReadabilityFormula) – ReadabilityFormula to use for scoring

  • demojize (bool) – If True, convert emojis to text descriptions before scoring

Returns:

Readability score (higher scores generally indicate easier readability)

Return type:

float

Examples

>>> score = st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)
static count_consonants(text)

Count the number of consonants in the text after converting to ASCII.

Parameters:

text (str) – Input text to analyze

Returns:

Number of consonant characters found

Return type:

int

Examples

>>> count = st.count_consonants("hello")  # Returns: 3
count_sentences(text)

Count the number of sentences in the input text.

Parameters:

text (str) – Input text to analyze

Returns:

Number of sentences detected

Return type:

int

Examples

>>> count = st.count_sentences("This is one. This is two.")
>>> # Returns: 2
count_syllables(word, tokenize=True)

Count the number of syllables in a word or text.

Parameters:
  • word (str) – Input word or text to analyze

  • tokenize (bool) – If True, tokenize input text and count syllables for each word

Returns:

Total number of syllables found

Return type:

int

Examples

>>> count = st.count_syllables("hello")  # Returns: 2
>>> count = st.count_syllables("hello world", tokenize=True)  # Returns: 3
static count_vowels(text)

Count the number of vowels in the text after converting to ASCII.

Parameters:

text (str) – Input text to analyze

Returns:

Number of vowel characters found

Return type:

int

Examples

>>> count = st.count_vowels("hello")  # Returns: 2
count_words(text)

Count the number of words in the text. This function counts the number of alphanumeric tokens retrieved from the tokenize method.

Parameters:

text (str) – Input text to count words from

Returns:

Number of alphanumeric words found

Return type:

int

Examples

>>> count = st.count_words("Hello, world!")  # Returns: 2
demojize(text, delimiters=('(', ')'))

Convert emoji characters to their text descriptions.

Parameters:
  • text (str) – Input text containing emojis

  • delimiters (tuple[str, str]) – Tuple of (open, close) delimiters to wrap emoji descriptions

Returns:

Text with emojis replaced by their descriptions

Return type:

str

Examples

>>> text = st.demojize("I love 🐈")
>>> # Returns: "I love (cat)"
flesch_reading_ease(text, demojize=False)

Calculate Flesch Reading Ease score for the text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.

Score ranges: 90-100: Very easy 80-89: Easy 70-79: Fairly easy 60-69: Standard 50-59: Fairly difficult 30-49: Difficult 0-29: Very difficult

Parameters:
  • text (str) – Input text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Flesch Reading Ease score (higher = easier to read)

Return type:

float

Examples

>>> score = st.flesch_reading_ease("Simple text is easy to read.")
static is_ready(backend, language)

Check if the backend is ready for the specified language.

Parameters:
  • backend (Backend | str) – Backend to check.

  • language (Language | str) – Language to check.

Return type:

bool

Returns:

True if the backend is ready for the language, False otherwise.

property language: Language

Get the language of the SmoothText instance.

Returns:

Language of the SmoothText instance.

static prepare(backend=None, languages=None, skip_downloads=False, silence_downloaders=True, **backend_kwargs)

Prepare the required resources for text analysis.

This method downloads and initializes the necessary language models and data for the specified backend and languages. It must be called before using any text analysis functionality.

Parameters:
  • backend (Backend | str | None) – The backend engine to use (NLTK or Stanza)

  • languages (Language | list[Language] | str | list[str] | None) – Language(s) to prepare resources for

  • skip_downloads (bool) – If True, skip downloading models even if not present

  • silence_downloaders (bool) – If True, suppress download progress output

  • **backend_kwargs – Additional arguments passed to backend downloaders

Raises:

RuntimeError – If preparation fails or no valid backends are found

Return type:

None

Examples

SmoothText.prepare(backend=”nltk”, languages=[“en”])

reading_aloud_time(text, words_per_minute=183.0, round_up=True)

Calculate estimated reading aloud time using default speaking speed. Default speed is 183 WPM based on research averages.

Parameters:
  • text (str) – Input text to analyze

  • words_per_minute (float) – Optional custom speaking speed

  • round_up (bool) – If True, round result up to nearest second

Returns:

Estimated speaking time in seconds

Return type:

float

reading_time(text, words_per_minute, round_up=True)

Calculate estimated reading time for the text.

Parameters:
  • text (str) – Input text to analyze

  • words_per_minute (float) – Reading speed in words per minute

  • round_up (bool) – If True, round result up to nearest second

Returns:

Estimated reading time in seconds

Return type:

float

Examples

>>> time = st.reading_time("Some text to read", words_per_minute=200)
remove_emojis(text)

Remove emoji characters from the text.

Parameters:

text (str) – Input text containing emojis

Returns:

Text with emojis removed

Return type:

str

Examples

>>> text = st.remove_emojis("I love 🐈")
>>> # Returns: "I love "
sentencize(text)

Split text into sentences using the configured backend tokenizer.

Parameters:

text (str) – Input text to split into sentences

Returns:

List of sentences found in the text

Return type:

list[str]

Examples

>>> sentences = st.sentencize("This is a test. Another sentence.")
>>> # Returns: ["This is a test.", "Another sentence."]
silent_reading_time(text, words_per_minute=238.0, round_up=True)

Calculate estimated silent reading time using default reading speed. Default speed is 238 WPM based on research averages.

Parameters:
  • text (str) – Input text to analyze

  • words_per_minute (float) – Optional custom reading speed

  • round_up (bool) – If True, round result up to nearest second

Returns:

Estimated silent reading time in seconds

Return type:

float

syllabify(word, tokenize=False, sentencize=False)

Split words into syllables using language-specific rules.

This method can operate on single words, lists of words, or lists of sentences containing words. However, for simple counting, it is recommended to use the count_syllables method as it is more efficient and accurate. This method will keep punctuation marks as separate tokens.

Parameters:
  • word (str) – Input word or text to syllabify

  • tokenize (bool) – If True, split input into words first

  • sentencize (bool) – If True, split input into sentences first

Returns:

List of syllables for a single word list[list[str]]: List of words with their syllables if tokenize=True list[list[list[str]]]: List of sentences containing words with syllables if sentencize=True

Return type:

list[str]

Examples

>>> syllables = st.syllabify("hello")
>>> # Returns: ["hel", "lo"]
>>> word_syllables = st.syllabify("hello world", tokenize=True)
>>> # Returns: [["hel", "lo"], ["world"]]
tokenize(text, split_sentences=False)

Tokenize text into words using the configured backend tokenizer.

Parameters:
  • text (str) – Input text to tokenize

  • split_sentences (bool) – If True, return tokens grouped by sentences

Returns:

List of tokens if split_sentences=False list[list[str]]: List of sentences containing lists of tokens if split_sentences=True

Return type:

list[str]

Examples

>>> tokens = st.tokenize("Hello world!")
>>> # Returns: ["Hello", "world", "!"]
>>>
>>> sent_tokens = st.tokenize("Hi there. Bye now.", split_sentences=True)
>>> # Returns: [["Hi", "there", "."], ["Bye", "now", "."]]
wiener_sachtextformel(text, demojize=False, version=3)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Score ranges: 4-5: Very easy 6-8: Easy 9-11: Average 12-14: Difficult 15+: Very difficult

Parameters:
  • text (str) – Input German text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

  • version (int) – Wiener Sachtextformel version to use (1-4)

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.")
>>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.", version=3)
wiener_sachtextformel_1(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:
  • text (str) – Input German text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_1("Deutscher Textbeispiel.")
wiener_sachtextformel_2(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:
  • text (str) – Input German text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_2("Deutscher Textbeispiel.")
wiener_sachtextformel_3(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:
  • text (str) – Input German text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_3("Deutscher Textbeispiel.")
wiener_sachtextformel_4(text, demojize=False)

Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.

Parameters:
  • text (str) – Input German text to analyze

  • demojize (bool) – If True, convert emojis to text before scoring

Returns:

Wiener Sachtextformel readability score (higher = more difficult)

Return type:

float

Examples

>>> score = st.wiener_sachtextformel_4("Deutscher Textbeispiel.")
word_frequencies(text, lemmatize=True)

Count the frequency of words in the text.

Parameters:
  • text (str) – Input text to analyze

  • lemmatize (bool) – If True, lemmatize words before counting

Returns:

Dictionary of word frequencies

Return type:

dict[str, int]

Examples

>>> freqs = st.word_frequencies("Hello world! Hello again.")
>>> # Returns: {"hello": 2, "world": 1, "again": 1}

Module contents