smoothtext package
Submodules
smoothtext.backend module
Backend module for SmoothText text processing library.
This module provides functionality to manage and validate different NLP backends that can be used with SmoothText. It supports multiple NLP frameworks through a unified interface, allowing users to switch between different implementations based on their needs.
Examples
>>> from smoothtext.backend import Backend
>>> Backend.is_supported('nltk')
True
>>> supported_backends = Backend.list_supported()
- class smoothtext.backend.Backend(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Enum representing NLP backends supported by SmoothText.
This enum defines the available NLP processing backends and provides utility methods for backend validation and management. Each backend represents a different NLP framework that can be used for text processing tasks.
- Available backends:
NLTK: Natural Language Toolkit, suitable for basic NLP tasks
Stanza: Stanford NLP’s Stanza, offering state-of-the-art accuracy
Examples
>>> backend = Backend.parse('nltk') >>> if backend and Backend.is_supported(backend): ... print(f"{backend.value} is available")
- static is_supported(backend)
Verifies if a backend is installed and available for use.
This method checks both if the backend is valid and if its required dependencies are installed in the current environment.
- Parameters:
backend (Union[Backend, str]) – The backend to check, either as a Backend enum value or a string identifier.
- Returns:
True if the backend is valid and its dependencies are installed, False otherwise.
- Return type:
bool
Examples
>>> Backend.is_supported('nltk') True # If NLTK is installed >>> Backend.is_supported('invalid') False
- static list_supported()
Retrieves all backends that are currently available for use.
This method checks all defined backends and returns only those that have their dependencies properly installed in the current environment.
- Returns:
A list of Backend enum values representing the backends that are ready to use.
- Return type:
list[Backend]
Examples
>>> supported = Backend.list_supported() >>> print([b.value for b in supported]) ['NLTK'] # If only NLTK is installed
- static parse(backend)
Converts a backend identifier to its corresponding Backend enum value.
- Parameters:
backend (Union[Backend, str]) – The backend identifier to parse. Can be either a Backend enum value or a string matching a backend name (case-insensitive).
- Returns:
The corresponding Backend enum value if valid, None if the input cannot be mapped to a valid backend.
- Return type:
Optional[Backend]
Examples
>>> Backend.parse('nltk') <Backend.NLTK> >>> Backend.parse('invalid') None
- static values()
Returns a list of all available backend options.
This method provides access to all defined backends, regardless of whether they are currently installed and supported in the environment.
- Returns:
A list containing all defined Backend enum values.
- Return type:
list[Backend]
Examples
>>> backends = Backend.values() >>> print([b.value for b in backends]) ['NLTK', 'Stanza']
smoothtext.language module
Language support module for SmoothText.
This module provides language identification and parsing capabilities through the Language enum. It supports ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants using either hyphen or underscore separators (e.g., ‘en-US’ or ‘en_US’).
Examples
>>> lang = Language.parse("en-US")
>>> print(lang)
English (United States)
>>> print(lang.family())
English
- class smoothtext.language.Language(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Enum representing languages supported by SmoothText.
This enum provides language identification capabilities with support for both ISO 639-1 (two-letter) and ISO 639-2 (three-letter) language codes, with optional country variants. Languages are organized in families, where regional variants (e.g., English_US) belong to a parent language (e.g., English).
- # Base Languages
- English
Generic English language support
- German
Generic German language support
- Russian
Generic Russian language support
- Turkish
Generic Turkish language support
- # English Variants
- English_GB
British English variant
- English_US
American English variant (default for ‘en’)
- # German Variants
- German_DE
German (Germany) variant (default for ‘de’)
- # Russian Variants
- Russian_RU
Russian (Russia) variant (default for ‘ru’)
- # Turkish Variants
- Turkish_TR
Turkish (Türkiye) variant (default for ‘tr’)
Examples
>>> lang = Language.English_US >>> print(lang.alpha2()) # Returns 'en' >>> print(lang.family()) # Returns Language.English
- alpha2()
Get the ISO 639-1 two-letter code of the language.
- Returns:
Two-letter language code (e.g., ‘en’ for English, ‘tr’ for Turkish)
- Return type:
str
- alpha3()
Get the ISO 639-2 three-letter code of the language.
- Returns:
Three-letter language code (e.g., ‘eng’ for English, ‘tur’ for Turkish)
- Return type:
str
- static families()
Get a list of all base language families.
- Returns:
List containing all base Language enum values
- Return type:
list[Language]
- family()
Get the family (base) language of the current language variant.
The family language represents the base language without region/country specifics. Regional variants return their base language, while base languages return themselves.
- Returns:
Base language enum value
- Return type:
Language
Examples
>>> Language.English_US.family() # Returns Language.English >>> Language.English.family() # Returns Language.English
- static parse(language)
Parse a language identifier into a Language enum value.
- Parameters:
language (
Language
|str
) – Language identifier to parse. Can be: - Language enum value - Full name (e.g., ‘English’) - ISO 639-1 code (e.g., ‘en’) - ISO 639-2 code (e.g., ‘eng’)- Returns:
Corresponding Language enum value if valid None: If the input cannot be parsed into a supported language
- Return type:
Language
Examples
>>> Language.parse('en') Language.English >>> Language.parse('invalid') None
- static parse_multiple(languages)
Parse multiple language identifiers into a list of Language enum values. Note: The order of returned languages is not guaranteed.
- Parameters:
languages (
Language
|str
|list
[Language
|str
]) – One or more language identifiers. Can be: - Single Language enum value - Single language string - List of Language enum values and/or strings - Comma-separated string of language identifiers- Returns:
- List of unique, valid Language enum values.
The order of languages in the list is not guaranteed.
- Return type:
list[Language]
Examples
>>> # Order may vary in the results >>> Language.parse_multiple('en,tr') [Language.English_GB, Language.Turkish_TR] # or [Language.Turkish_TR, Language.English_GB] >>> set(Language.parse_multiple(['en', 'invalid', 'tr'])) # Use set for order-independent comparison {Language.English_GB, Language.Turkish_TR}
- readability_formulas()
Get a list of supported readability formulas for the current language.
Different languages have different readability formulas that are specifically designed and validated for their linguistic characteristics.
- Returns:
List of readability formulas supported for this language
- Return type:
list[ReadabilityFormula]
Examples
>>> Language.English.readability_formulas() [ReadabilityFormula.Automated_Readability_Index, ReadabilityFormula.Flesch_Reading_Ease, ReadabilityFormula.Flesch_Kincaid_Grade, ReadabilityFormula.Flesch_Kincaid_Grade_Simplified, ReadabilityFormula.Gunning_Fog_Index] >>> Language.Turkish.readability_formulas() [ReadabilityFormula.Atesman, ReadabilityFormula.Bezirci_Yilmaz]
- static values()
Get a list of all supported languages.
- Returns:
List containing all supported Language enum values
- Return type:
list[Language]
- variants()
Get a list of all language variants for the current language.
- Returns:
List of all language variants, including the current language
- Return type:
list[Language]
Examples
>>> Language.English_GB.variants() [Language.English_GB, Language.English_US] >>> Language.English.variants() [Language.English_GB, Language.English_US]
smoothtext.readability module
Readability formulas module for SmoothText.
This module provides an enumeration of various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.
Examples
>>> from smoothtext import ReadabilityFormula
>>> formula = ReadabilityFormula.Flesch_Reading_Ease
>>> print(formula.value)
Flesch Reading Ease
>>> print(formula.supports(Language.English))
True
>>> print(formula.supports(Language.Turkish))
False
- class smoothtext.readability.ReadabilityFormula(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
Enum
Enumeration of readability formulas supported by SmoothText.
This enum defines various readability formulas that can be used to assess text complexity in different languages. Each formula is designed for specific languages and provides different metrics for text readability.
- # English Formulas
- Automated_Readability_Index
English readability formula developed by the US Army.
- Flesch_Reading_Ease
Classic English readability formula. Scores range from 0 (hardest) to 100 (easiest).
- Flesch_Kincaid_Grade
English grade-level assessment formula. Indicates US grade level required to understand the text.
- Flesch_Kincaid_Grade_Simplified
Simplified version of Flesch-Kincaid Grade. Uses reduced parameters for grade-level assessment.
- Gunning_Fog_Index
English readability formula developed by Robert Gunning.
- # German Formulas
- Wiener_Sachtextformel
Alias for Wiener_Sachtextformel_3 (general purpose formula).
- Wiener_Sachtextformel_1
First variant of Wiener Sachtextformel. Optimized for narrative texts.
- Wiener_Sachtextformel_2
Second variant of Wiener Sachtextformel. Optimized for scientific texts.
- Wiener_Sachtextformel_3
Third variant of Wiener Sachtextformel. General purpose formula.
- Wiener_Sachtextformel_4
Fourth variant of Wiener Sachtextformel. Alternative general purpose formula.
- # Russian Formulas
- Matskovskiy
Russian readability formula developed by Matskovskiy. Provides grade-level assessment for Russian texts.
- # Turkish Formulas
- Atesman
Turkish readability formula developed by Ateşman. Scores range from 0 (hardest) to 100 (easiest).
- Bezirci_Yilmaz
Turkish readability formula by Bezirci and Yılmaz. Provides grade-level assessment for Turkish texts.
- supports(language)
Determines if the formula supports the specified language.
- Parameters:
language (
Union
[Language
,str
,None
]) – The language to check support for. Can be either a Language enum value or a string identifier.- Returns:
True if the formula supports the specified language.
False if the formula does not support the language.
- Return type:
bool
smoothtext.smoothtext module
SmoothText - A Python library for natural language text analysis and readability scoring.
This module provides functionality for: - Text tokenization (sentences and words) - Syllable counting and syllabification - Multiple readability formula calculations (Flesch, Ateşman, etc.) - Reading time estimation - Support for multiple languages and backend engines
All functionality is exposed through the SmoothText class which handles the preparation of required resources and provides a consistent API across different backends.
- class smoothtext.smoothtext.SmoothText(language=None, backend=None)
Bases:
object
Main class for text analysis and readability scoring.
The SmoothText class provides methods for: - Text tokenization and counting (sentences, words, syllables) - Readability scoring using various formulas - Reading time estimation - Language-specific text processing - Emoji handling
Supported backends: - NLTK - Stanza
Supported languages: - English - German - Turkish
Examples
>>> st = SmoothText(language="en", backend="nltk") >>> score = st.flesch_reading_ease("This is a test sentence.") >>> time = st.reading_time("Some text to analyze")
- atesman(text, demojize=False)
Calculate Ateşman readability score for Turkish text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.
Score ranges: 90-100: Very easy 70-89: Easy 50-69: Medium difficulty 30-49: Difficult 1-29: Very difficult
- Parameters:
text (
str
) – Input Turkish text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Ateşman readability score (higher = easier to read)
- Return type:
float
Examples
>>> score = st.atesman("Basit bir Türkçe metin.")
- property backend: Backend
Get the backend of the SmoothText instance.
- Returns:
Backend of the SmoothText instance.
- bezirci_yilmaz(text, demojize=False)
Calculate Bezirci-Yılmaz readability score for Turkish text. The score takes into account sentence length and frequency of words with 3+ syllables. Higher scores indicate more difficult readability.
- Parameters:
text (
str
) – Input Turkish text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Bezirci-Yılmaz readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.bezirci_yilmaz("Türkçe metin örneği.")
- compute_readability(text, formula, demojize=False)
Calculate readability score using the specified formula.
- Parameters:
text (
str
) – Input text to analyzeformula (
ReadabilityFormula
) – ReadabilityFormula to use for scoringdemojize (
bool
) – If True, convert emojis to text descriptions before scoring
- Returns:
Readability score (higher scores generally indicate easier readability)
- Return type:
float
Examples
>>> score = st.compute_readability(text, ReadabilityFormula.Flesch_Reading_Ease)
- static count_consonants(text)
Count the number of consonants in the text after converting to ASCII.
- Parameters:
text (
str
) – Input text to analyze- Returns:
Number of consonant characters found
- Return type:
int
Examples
>>> count = st.count_consonants("hello") # Returns: 3
- count_sentences(text)
Count the number of sentences in the input text.
- Parameters:
text (
str
) – Input text to analyze- Returns:
Number of sentences detected
- Return type:
int
Examples
>>> count = st.count_sentences("This is one. This is two.") >>> # Returns: 2
- count_syllables(word, tokenize=True)
Count the number of syllables in a word or text.
- Parameters:
word (
str
) – Input word or text to analyzetokenize (
bool
) – If True, tokenize input text and count syllables for each word
- Returns:
Total number of syllables found
- Return type:
int
Examples
>>> count = st.count_syllables("hello") # Returns: 2 >>> count = st.count_syllables("hello world", tokenize=True) # Returns: 3
- static count_vowels(text)
Count the number of vowels in the text after converting to ASCII.
- Parameters:
text (
str
) – Input text to analyze- Returns:
Number of vowel characters found
- Return type:
int
Examples
>>> count = st.count_vowels("hello") # Returns: 2
- count_words(text)
Count the number of words in the text. This function counts the number of alphanumeric tokens retrieved from the tokenize method.
- Parameters:
text (
str
) – Input text to count words from- Returns:
Number of alphanumeric words found
- Return type:
int
Examples
>>> count = st.count_words("Hello, world!") # Returns: 2
- demojize(text, delimiters=('(', ')'))
Convert emoji characters to their text descriptions.
- Parameters:
text (
str
) – Input text containing emojisdelimiters (
tuple
[str
,str
]) – Tuple of (open, close) delimiters to wrap emoji descriptions
- Returns:
Text with emojis replaced by their descriptions
- Return type:
str
Examples
>>> text = st.demojize("I love 🐈") >>> # Returns: "I love (cat)"
- flesch_reading_ease(text, demojize=False)
Calculate Flesch Reading Ease score for the text. The score typically ranges between 0-100, though scores outside this range are possible. Higher scores indicate easier readability.
Score ranges: 90-100: Very easy 80-89: Easy 70-79: Fairly easy 60-69: Standard 50-59: Fairly difficult 30-49: Difficult 0-29: Very difficult
- Parameters:
text (
str
) – Input text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Flesch Reading Ease score (higher = easier to read)
- Return type:
float
Examples
>>> score = st.flesch_reading_ease("Simple text is easy to read.")
- static is_ready(backend, language)
Check if the backend is ready for the specified language.
- Parameters:
backend (
Backend
|str
) – Backend to check.language (
Language
|str
) – Language to check.
- Return type:
bool
- Returns:
True if the backend is ready for the language, False otherwise.
- property language: Language
Get the language of the SmoothText instance.
- Returns:
Language of the SmoothText instance.
- static prepare(backend=None, languages=None, skip_downloads=False, silence_downloaders=True, **backend_kwargs)
Prepare the required resources for text analysis.
This method downloads and initializes the necessary language models and data for the specified backend and languages. It must be called before using any text analysis functionality.
- Parameters:
backend (
Backend
|str
|None
) – The backend engine to use (NLTK or Stanza)languages (
Language
|list
[Language
] |str
|list
[str
] |None
) – Language(s) to prepare resources forskip_downloads (
bool
) – If True, skip downloading models even if not presentsilence_downloaders (
bool
) – If True, suppress download progress output**backend_kwargs – Additional arguments passed to backend downloaders
- Raises:
RuntimeError – If preparation fails or no valid backends are found
- Return type:
None
Examples
SmoothText.prepare(backend=”nltk”, languages=[“en”])
- reading_aloud_time(text, words_per_minute=183.0, round_up=True)
Calculate estimated reading aloud time using default speaking speed. Default speed is 183 WPM based on research averages.
- Parameters:
text (
str
) – Input text to analyzewords_per_minute (
float
) – Optional custom speaking speedround_up (
bool
) – If True, round result up to nearest second
- Returns:
Estimated speaking time in seconds
- Return type:
float
- reading_time(text, words_per_minute, round_up=True)
Calculate estimated reading time for the text.
- Parameters:
text (
str
) – Input text to analyzewords_per_minute (
float
) – Reading speed in words per minuteround_up (
bool
) – If True, round result up to nearest second
- Returns:
Estimated reading time in seconds
- Return type:
float
Examples
>>> time = st.reading_time("Some text to read", words_per_minute=200)
- remove_emojis(text)
Remove emoji characters from the text.
- Parameters:
text (
str
) – Input text containing emojis- Returns:
Text with emojis removed
- Return type:
str
Examples
>>> text = st.remove_emojis("I love 🐈") >>> # Returns: "I love "
- sentencize(text)
Split text into sentences using the configured backend tokenizer.
- Parameters:
text (
str
) – Input text to split into sentences- Returns:
List of sentences found in the text
- Return type:
list[str]
Examples
>>> sentences = st.sentencize("This is a test. Another sentence.") >>> # Returns: ["This is a test.", "Another sentence."]
- silent_reading_time(text, words_per_minute=238.0, round_up=True)
Calculate estimated silent reading time using default reading speed. Default speed is 238 WPM based on research averages.
- Parameters:
text (
str
) – Input text to analyzewords_per_minute (
float
) – Optional custom reading speedround_up (
bool
) – If True, round result up to nearest second
- Returns:
Estimated silent reading time in seconds
- Return type:
float
- syllabify(word, tokenize=False, sentencize=False)
Split words into syllables using language-specific rules.
This method can operate on single words, lists of words, or lists of sentences containing words. However, for simple counting, it is recommended to use the count_syllables method as it is more efficient and accurate. This method will keep punctuation marks as separate tokens.
- Parameters:
word (
str
) – Input word or text to syllabifytokenize (
bool
) – If True, split input into words firstsentencize (
bool
) – If True, split input into sentences first
- Returns:
List of syllables for a single word list[list[str]]: List of words with their syllables if tokenize=True list[list[list[str]]]: List of sentences containing words with syllables if sentencize=True
- Return type:
list[str]
Examples
>>> syllables = st.syllabify("hello") >>> # Returns: ["hel", "lo"]
>>> word_syllables = st.syllabify("hello world", tokenize=True) >>> # Returns: [["hel", "lo"], ["world"]]
- tokenize(text, split_sentences=False)
Tokenize text into words using the configured backend tokenizer.
- Parameters:
text (
str
) – Input text to tokenizesplit_sentences (
bool
) – If True, return tokens grouped by sentences
- Returns:
List of tokens if split_sentences=False list[list[str]]: List of sentences containing lists of tokens if split_sentences=True
- Return type:
list[str]
Examples
>>> tokens = st.tokenize("Hello world!") >>> # Returns: ["Hello", "world", "!"] >>> >>> sent_tokens = st.tokenize("Hi there. Bye now.", split_sentences=True) >>> # Returns: [["Hi", "there", "."], ["Bye", "now", "."]]
- wiener_sachtextformel(text, demojize=False, version=3)
Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.
Score ranges: 4-5: Very easy 6-8: Easy 9-11: Average 12-14: Difficult 15+: Very difficult
- Parameters:
text (
str
) – Input German text to analyzedemojize (
bool
) – If True, convert emojis to text before scoringversion (
int
) – Wiener Sachtextformel version to use (1-4)
- Returns:
Wiener Sachtextformel readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.") >>> score = st.wiener_sachtextformel("Deutscher Textbeispiel.", version=3)
- wiener_sachtextformel_1(text, demojize=False)
Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.
- Parameters:
text (
str
) – Input German text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Wiener Sachtextformel readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.wiener_sachtextformel_1("Deutscher Textbeispiel.")
- wiener_sachtextformel_2(text, demojize=False)
Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.
- Parameters:
text (
str
) – Input German text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Wiener Sachtextformel readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.wiener_sachtextformel_2("Deutscher Textbeispiel.")
- wiener_sachtextformel_3(text, demojize=False)
Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.
- Parameters:
text (
str
) – Input German text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Wiener Sachtextformel readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.wiener_sachtextformel_3("Deutscher Textbeispiel.")
- wiener_sachtextformel_4(text, demojize=False)
Calculate Wiener Sachtextformel readability score for German text. The score takes into account sentence length and frequency of words with different lengths. Higher scores indicate more difficult text.
- Parameters:
text (
str
) – Input German text to analyzedemojize (
bool
) – If True, convert emojis to text before scoring
- Returns:
Wiener Sachtextformel readability score (higher = more difficult)
- Return type:
float
Examples
>>> score = st.wiener_sachtextformel_4("Deutscher Textbeispiel.")
- word_frequencies(text, lemmatize=True)
Count the frequency of words in the text.
- Parameters:
text (
str
) – Input text to analyzelemmatize (
bool
) – If True, lemmatize words before counting
- Returns:
Dictionary of word frequencies
- Return type:
dict[str, int]
Examples
>>> freqs = st.word_frequencies("Hello world! Hello again.") >>> # Returns: {"hello": 2, "world": 1, "again": 1}