Preprocessor

class textClustPy.Preprocessor(language='english', stopword_removal=True, stemming=False, punctuation=True, hashtag=True, username=True, url=True, max_grams=1, exclude_tokens=None)

The Preprocessor object is one of the three core components. It determines how incoming text obvservations should be preprocessed before clustering.

Parameters
  • language (string) – Language that should be used for stopword identification. We use nltk for stopword detection.

  • stopword_removal (bool) – Boolean variable, indicating whether stopwords should be removed.

  • stemming (bool) – Boolean variable, indicating whether stopwords should be removed.

  • punctuation (int) – Boolean variable, indicating whether punctuation should be removed.

  • punctuation – Boolean variable, indicating whether hashtags should be removed (especially interesing for twitter input).

  • punctuation – Boolean variable, indicating whether usernames (@username) should be removed (especially interesing for twitter input).

  • punctuation – Boolean variable, indicating whether urls should be removed.

  • max_grams – What types of n-grams should be generated (max_grams=2 creates 1-grams as well as 2-grams). The higher max_grams the higher

preprocess(observation_text)

Preprocesses a string with the given settings :param observation_text: input string :type observation_text: string