API

In the following we list a

class textClustPy.textclust(radius=0.3, _lambda=0.0005, tgap=100, verbose=None, termfading=True, realtimefading=True, micro_distance='tfidf_cosine_distance', macro_distance='tfidf_cosine_distance', model=None, idf=True, num_macro=3, minWeight=0, config=None, embedding_verification=False, callback=None, auto_r=False, auto_merge=True, sigma=1)

This class implements the textClust clustering algorithm.

Parameters
  • radius (float (default=0.5)) – Distance threshold to merge two micro-clusters

  • _lambda (float) – Fading factor of micro-clusters

  • tgap (bool) – Time between outlier removal (default=100)

  • verbose – Verbose mode (default=false)

  • termfading (bool) – Logical whether individual terms should also be faded (default=true)

  • realtimefading (bool) – Logical whether Natural Time or Number of observations should be used for fading (default=true)

  • micro_distance (string) – Distance metric used for clustering micro clusters (default =”tfidf_cosine_distance)

  • macro_distance (string) – Distance metric used for clustering macro clusters (default=”tfidf_cosine_distance)

  • model (string) – Name of the Word Embedding Model that can be used for clustering (default=None)

  • num_macro (int) – Number of macro clusters that should be identified during the reclustering phase (default = 3)

  • min_Weight (float) – Minimum weight of micro clusters to be used for reclustering (default = 0)

  • config (string) – Path and filename of external configuration file (default = None)

  • callback (function, optional) – Callback function that should be called after tgap steps (default = None)

Variables
  • n – number of processed documents

  • omega – omega is defined as the minimum weight of a cluster \(2^{(-\lambda * gap)}\)

changedistance(type, metric_name)

Changes the distance metric used for micro/macro clustering

Parameters
  • type (string) – Either “micro” or “macro”

  • metric_name (string) – Name of the new distance metric

deleteModel()

Deletes the embedding model currently used

get_macroclusters()

Returns a list of all current micro clusters

Returns

List of macro cluster dictionaries.

get_microclusters()

Returns a list of all current micro clusters

Returns

List of returned textClustPy.microcluster objects

loadconfig(filename)

Loads the config file

Parameters

filename (string) – Relative name/path of the config file

showclusters(topn, num, type='micro')

Prints out the top micro/macro clusters (sorted after weight) :param topn: Number of top clusters to display :type topn: int :param num: Number of cluster representatives shown for each cluster :type num int :param type: Type of cluster (micro or macro) :type type string

update(text, id, time, realtime=None)

Updates the micro-clustering by incorporating a new observation :param text: A new text document that should be clustered :type text: string :param id: Unique document id :type id: int/double param time: Timestamp of the new text document. If realtimefading is enabled, this parameter has to be provided. :type time: time, optional

class textClustPy.microcluster(tf, time, weight, realtime, textid, clusterid)

Micro-clusters are statistic summaries of the data stream.

Variables
  • id – cluster id

  • tf – terms and frequencies of the micro cluster tokens

  • weight – the current weight of the microcluster

  • time – last time the microcluster was updated

  • textids – All ids of documents that were assigned to the micro cluster