cblearn.cluster.ComparisonHC#

class cblearn.cluster.ComparisonHC(n_clusters)[source]#

ComparisonHC.

ComparisonHC [1] is an hierarchical clustering algorithm that calculates clusters on triplet data without computing an intermediate embedding. This is done via an adapted linkage algorithm that only uses the triplet information.

As this is algorithm produces its clusterings via a Dendrogram that is created on the whole dataset, we do not provide a fit method. Call fit_predict directly with the complete dataset you want to do an clustering on.

Keep in mind that this algorithm was optimized and developed for hierarchical clustering, and simply adapted to produce a flat clustering with the desired number of clusters. Thus, this algorithm might not have optimal performance in these settings when compared to other approaches.

dendrogram_#

numpy array, shape (n_clusters-1, 4) An array corresponding to the learned dendrogram. After iteration i, dendrogram[i,0] and dendrogram[i,1] are the indices of the merged clusters, and dendrogram[i,2] is the size of the new cluster. The dendrogram is initialized to None until the fit method is called. The last column is set to 0 (implemented like this by the original algorithm).

cluster_#

list of list Initial cluster information used for fitting.

Examples:

>>> from sklearn.datasets import make_blobs
>>> from sklearn.metrics import normalized_mutual_info_score
>>> from cblearn.datasets import make_random_triplets
>>> from cblearn.cluster import ComparisonHC
>>> import numpy as np
>>> means = np.array([[1,0], [-1, 0]])
>>> stds = 0.2 * np.ones(means.shape)
>>> xs, ys = make_blobs(n_samples=[10, 10], centers=means, cluster_std=stds,
...                     n_features=2, random_state=2)
>>> estimator = ComparisonHC(2)
>>> t = make_random_triplets(xs, result_format="list-order", size=5000, random_state=2)
>>> labels = estimator.fit_predict(t)
>>> normalized_mutual_info_score(labels, ys)
1.0

References

__init__(n_clusters)[source]#

Initialize the estimator.

Parameters:

n_clusters (int) – Number of clusters desired in the final clustering.

Methods

__init__(n_clusters)

Initialize the estimator.

fit(X[, y, init_clusters])

Computes the dendrogram of a list of clusters.

fit_predict(X[, y])

Perform clustering on X and returns cluster labels.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, init_clusters])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None, init_clusters=None)[source]#

Computes the dendrogram of a list of clusters.

Parameters:
  • X – Triplets, repeated responses will be ignored (majority vote)

  • y – optional responses

  • init_clusters – list of (list of examples), len(n_clusters) An optional list containing the initial clusters (list of examples).

Returns:

object

Return type:

self

Raises:

ValueError – If the initial partition has less that n_examples.

fit_predict(X, y=None, **kwargs)#

Perform clustering on X and returns cluster labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    Added in version 1.4.

Returns:

labels – Cluster labels.

Return type:

ndarray of shape (n_samples,), dtype=np.int64

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_fit_request(*, init_clusters='$UNCHANGED$')#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

init_clusters (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for init_clusters parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance