Inputs

class textClustPy.Input(textclust, preprocessor, timeformat='%Y-%m-%d %H:%M:%S', timeprecision='seconds', config=None, callback=None)

Abstract input class

Parameters
  • textclust (textClustPy.textclust) – A textclust instance of type textClustPy.textclust

  • preprocessor (textClustPy.Preprocessor) – Preprocessor instance of type textClustPy.textclust

  • timeformat – Specifies the time format. Described as strftime directives (see https://strftime.org). Default is: %Y-%m-%d %H:%M:%S

  • timeprecision (string) – If realtimefading is enabled, timeprecision specifies on which time unit the fading factor is applied (seconds/minutes/hours). Default = “seconds”

  • config (string) – Relative path/name of config file

  • callback (function) – Callback function that is called for each incoming observation. The callback function expects four parameters: ID, time, text and a Observation object.

class textClustPy.CSVInput(csvfile=None, delimiter='|', quotechar=';', newline='\n', col_id=1, col_time=1, col_text=2, col_label=3, **kwargs)

This class implements the a csv input

Parameters
  • csvfile (string) – Relative path and filename of the csv document

  • delimiter (char) – Delimiter that separates different columns

  • quotechar (char) – Character that is used for quotes

  • newline (char) – Character indicating a new line.

  • col_id (int) – Column index that contains the text id

  • col_time (int) – Column index that contains the time

  • col_text (int) – Column index that contains the text

  • col_label (int) – Column index that contains the true cluster belonging

run()

Update the textclust algorithm with the complete data in the data frame

update(n)

Update the textclust algorithm on new observations

Parameters

n (int) – Number of observations that should be used by textclust

class textClustPy.InMemInput(pdframe, col_id=1, col_time=1, col_text=2, col_label=None, **kwargs)
Parameters
  • pdframe (DataFrame) – Pandas data frame that serves as stream input

  • col_id (int) – Column index that contains the text id

  • col_time (int) – Column index that contains the time

  • col_text (int) – Column index that contains the text

  • col_text – Column index that contains the true cluster belonging

run()

Update the textclust algorithm with the complete data in the data frame

update(n)

Update the textclust algorithm on new observations

Parameters

n (int) – Number of observations that should be used by textclust

class textClustPy.TwitterInput(api_key, api_secret, access_token, access_secret, terms, languages=['en'], conf=None, **kwargs)

A twitter input accesses the twitter stream and directly applies textclust on the incoming data.

Parameters
  • api_key (string) – Twitter API key

  • api_secret (string) – Twitter API secret

  • access_token (string) – Twitter access token

  • access_secret (string) – Twitter access secret

  • terms (List of strings) – List of searchterms/hashtags that should be monitored in twitter

  • languages (List of strings) – Filter teweets by languages

  • callback (Function) – Callback function that expects one parameter of Tweepy type Status (see http://docs.tweepy.org/en/latest/)