Miscellaneous functions for manipulating text¶

Collection of text functions that don’t fit in another category.

kitchen.text.misc.byte_string_valid_encoding(byte_string, encoding='utf-8')¶

Detect if a byte str is valid in a specific encoding

Parameters:	byte_string – Byte `str` to test for bytes not valid in this encoding encoding – encoding to test against. Defaults to UTF-8.
Returns:	`True` if there are no invalid UTF-8 characters. `False` if an invalid character is detected.

Note

This function checks whether the byte str is valid in the specified encoding. It does not detect whether the byte str actually was encoded in that encoding. If you want that sort of functionality, you probably want to use guess_encoding() instead.

kitchen.text.misc.byte_string_valid_xml(byte_string, encoding='utf-8')¶

Check that a byte str would be valid in xml

Parameters:	byte_string – Byte `str` to check encoding – Encoding of the xml file. Default: UTF-8
Returns:	`True` if the string is valid. `False` if it would be invalid in the xml file

In some cases you’ll have a whole bunch of byte strings and rather than transforming them to unicode and back to byte str for output to xml, you will just want to make sure they work with the xml file you’re constructing. This function will help you do that. Example:

ARRAY_OF_MOSTLY_UTF8_STRINGS = [...]
processed_array = []
for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
    if byte_string_valid_xml(string, 'utf-8'):
        processed_array.append(string)
    else:
        processed_array.append(guess_bytes_to_xml(string, encoding='utf-8'))
output_xml(processed_array)

kitchen.text.misc.guess_encoding(byte_string, disable_chardet=False)¶

Try to guess the encoding of a byte str

Raises TypeError:
Parameters:	byte_string – byte `str` to guess the encoding of disable_chardet – If this is True, we never attempt to use `chardet` to guess the encoding. This is useful if you need to have reproducibility whether `chardet` is installed or not. Default: `False`.
	if `byte_string` is not a byte `str` type
Returns:	string containing a guess at the encoding of `byte_string`. This is appropriate to pass as the encoding argument when encoding and decoding unicode strings.

We start by attempting to decode the byte str as UTF-8. If this succeeds we tell the world it’s UTF-8 text. If it doesn’t and chardet is installed on the system and disable_chardet is False this function will use it to try detecting the encoding of byte_string. If it is not installed or chardet cannot determine the encoding with a high enough confidence then we rather arbitrarily claim that it is latin-1. Since latin-1 will encode to every byte, decoding from latin-1 to unicode will not cause UnicodeErrors although the output might be mangled.

kitchen.text.misc.html_entities_unescape(string)¶

Substitute unicode characters for HTML entities

Raises TypeError:
Parameters:	string – `unicode` string to substitute out html entities
	if something other than a `unicode` string is given
Return type:	`unicode` string
Returns:	The plain text without html entities

kitchen.text.misc.process_control_chars(string, strategy='replace')¶

Look for and transform control characters in a string

Parameters:

string – string to search for and transform control characters within
strategy –
XML does not allow ASCII control characters. When we encounter those we need to know what to do. Valid options are:

replace: (default) Replace the control characters with "?"

ignore: Remove the characters altogether from the output

strict: Raise a ControlCharError when we encounter a control character

Raises:

TypeError – if string is not a unicode string.
ValueError – if the strategy is not one of replace, ignore, or strict.
kitchen.text.exceptions.ControlCharError – if the strategy is strict and a control character is present in the string

Returns:

unicode string with no control characters in it.

kitchen.text.misc.str_eq(str1, str2, encoding='utf-8', errors='replace')¶

Compare two stringsi, converting to byte str if one is unicode

Parameters:	str1 – First string to compare str2 – Second string to compare encoding – If we need to convert one string into a byte `str` to compare, the encoding to use. Default is utf-8. errors – What to do if we encounter errors when encoding the string. See the `kitchen.text.converters.to_bytes()` documentation for possible values. The default is `replace`.

This function prevents UnicodeError (python-2.4 or less) and UnicodeWarning (python 2.5 and higher) when we compare a unicode string to a byte str. The errors normally arise because the conversion is done to ASCII. This function lets you convert to utf-8 or another encoding instead.

Note

When we need to convert one of the strings from unicode in order to compare them we convert the unicode string into a byte str. That means that strings can compare differently if you use different encodings for each.

Note that str1 == str2 is faster than this function if you can accept the following limitations:

Limited to python-2.5+ (otherwise a UnicodeDecodeError may be thrown)
Will generate a UnicodeWarning if non-ASCII byte str is compared to unicode string.

Miscellaneous functions for manipulating text¶

Previous topic

Next topic

This Page

Navigation

Miscellaneous functions for manipulating text¶

Previous topic

Next topic

This Page

Quick search

Navigation