Preprocessor

class textClustPy.Preprocessor(language='english', stopword_removal=True, stemming=False, punctuation=True, hashtag=True, username=True, url=True, max_grams=1, exclude_tokens=None)

The Preprocessor object is one of the three core components. It determines how incoming text obvservations should be preprocessed before clustering.

Parameters

language (string) – Language that should be used for stopword identification. We use nltk for stopword detection.
stopword_removal (bool) – Boolean variable, indicating whether stopwords should be removed.
stemming (bool) – Boolean variable, indicating whether stopwords should be removed.
punctuation (int) – Boolean variable, indicating whether punctuation should be removed.
punctuation – Boolean variable, indicating whether hashtags should be removed (especially interesing for twitter input).
punctuation – Boolean variable, indicating whether usernames (@username) should be removed (especially interesing for twitter input).
punctuation – Boolean variable, indicating whether urls should be removed.
max_grams – What types of n-grams should be generated (max_grams=2 creates 1-grams as well as 2-grams). The higher max_grams the higher

preprocess(observation_text): Preprocesses a string with the given settings :param observation_text: input string :type observation_text: string