CLUSTER_VARIANTS
Applies to: CELONIS 4.5 CELONIS 4.6 CELONIS 4.7
Description
The CLUSTER_VARIANTS operator groups similar process variants (or traces) into clusters. Variants which cannot be assigned to a cluster are marked as noise.
Warning
Operator Performance
This operator performs very expensive computations and requires excessive memory and CPU resources. To avoid running out of memory, this operator is currently limited to 100,000 distinct variants.
Computation Times
While executing this operator, users may also experience long computation times and unresponsive analyses as it occupies a vast amount of computation capacity.
Syntax
CLUSTER_VARIANTS ( variant_column, MIN_PTS, EPSILON )
- variant_column: The column which stores the result of the VARIANT operator. 
- MIN_PTS: INT value giving the minimal density of similar variants that is required to create a cluster. Lower values tend to create more clusters, while higher values tend to classify more variants as noise. An estimate of a well-performing parameter value can be computed by the ESTIMATE_CLUSTER_PARAMS operator. 
- EPSILON: INT value giving the search radius for measuring the variant density. It is quantified by the number of different relations between two subsequent activities in the variants. The value must be an integer in the range [0, 5]. The higher the value, the more it is likely that all variants are assigned to the same cluster. It is recommended to choose a quite low value (e.g. 2). However, the value of 0 requires equality between variants to be clustered. 
Result: An INT column in which the number represents the ID of the cluster, to which the case (and its variant respectively) has been assigned to. A cluster ID of -1 indicates that the case has been classified as noise and a cluster ID of -2 indicates that the respective case does not have any activities (i.e. the case is empty).
NULL handling
Since the computation depends on the VARIANT operator, this operator only needs to handle NULL values if VARIANT also returns NULL (i.e. a case has only NULL-value activities). The cluster ID will be then -2.
Example
| [1] Variant 1 (trace  
 |