API
In the following we list a
- class textClustPy.textclust(radius=0.3, _lambda=0.0005, tgap=100, verbose=None, termfading=True, realtimefading=True, micro_distance='tfidf_cosine_distance', macro_distance='tfidf_cosine_distance', model=None, idf=True, num_macro=3, minWeight=0, config=None, embedding_verification=False, callback=None, auto_r=False, auto_merge=True, sigma=1)
This class implements the textClust clustering algorithm.
- Parameters
radius (float (default=0.5)) – Distance threshold to merge two micro-clusters
_lambda (float) – Fading factor of micro-clusters
tgap (bool) – Time between outlier removal (default=100)
verbose – Verbose mode (default=false)
termfading (bool) – Logical whether individual terms should also be faded (default=true)
realtimefading (bool) – Logical whether Natural Time or Number of observations should be used for fading (default=true)
micro_distance (string) – Distance metric used for clustering micro clusters (default =”tfidf_cosine_distance)
macro_distance (string) – Distance metric used for clustering macro clusters (default=”tfidf_cosine_distance)
model (string) – Name of the Word Embedding Model that can be used for clustering (default=None)
num_macro (int) – Number of macro clusters that should be identified during the reclustering phase (default = 3)
min_Weight (float) – Minimum weight of micro clusters to be used for reclustering (default = 0)
config (string) – Path and filename of external configuration file (default = None)
callback (function, optional) – Callback function that should be called after tgap steps (default = None)
- Variables
n – number of processed documents
omega – omega is defined as the minimum weight of a cluster \(2^{(-\lambda * gap)}\)
- changedistance(type, metric_name)
Changes the distance metric used for micro/macro clustering
- Parameters
type (string) – Either “micro” or “macro”
metric_name (string) – Name of the new distance metric
- deleteModel()
Deletes the embedding model currently used
- get_macroclusters()
Returns a list of all current micro clusters
- Returns
List of macro cluster dictionaries.
- get_microclusters()
Returns a list of all current micro clusters
- Returns
List of returned
textClustPy.microclusterobjects
- loadconfig(filename)
Loads the config file
- Parameters
filename (string) – Relative name/path of the config file
- showclusters(topn, num, type='micro')
Prints out the top micro/macro clusters (sorted after weight) :param topn: Number of top clusters to display :type topn: int :param num: Number of cluster representatives shown for each cluster :type num int :param type: Type of cluster (micro or macro) :type type string
- update(text, id, time, realtime=None)
Updates the micro-clustering by incorporating a new observation :param text: A new text document that should be clustered :type text: string :param id: Unique document id :type id: int/double param time: Timestamp of the new text document. If realtimefading is enabled, this parameter has to be provided. :type time: time, optional
- class textClustPy.microcluster(tf, time, weight, realtime, textid, clusterid)
Micro-clusters are statistic summaries of the data stream.
- Variables
id – cluster id
tf – terms and frequencies of the micro cluster tokens
weight – the current weight of the microcluster
time – last time the microcluster was updated
textids – All ids of documents that were assigned to the micro cluster