Getting Started#

Setup#

cblearn requires Python 3.9 or newer. The package is mainly tested on Linux, but Windows and Mac OS should work, too.

Python environment#

The easiest way to install Python and its dependencies is using a

Anaconda environment or similar, because dependencies do not conflict with other Python packages you may have installed and the Python version can be specified.

conda create -n cblearn python==3.9
conda activate cblearn

Install cblearn#

cblearn and can be installed using pip:

pip install cblearn

This will install the minimal set of required packages, sufficient for most uses and saving disk space. However, some features require more packages that can be installed by adding an option to the install command.

Install Extra Requirements#

Extra requirements can be installed by adding an option to the install command and enable more advanced features. Some of those extra dependencies need non-Python packages to be installed first.

$ pip install cblearn[torch,wrapper] h5py
torch

Most estimators provide an (optional) implementation using pytorch to run large datasets on CPU and GPU. This requires the `pytorch <https://pytorch.org/get-started/locally/>`_ package to be installed manually or by adding the torch extras option to the install command. Note that pytorch might need about 1GB of disk space.

wrapper

The estimators in Wrapper provide an Python interface to the original implementation in R-lang. This requires the rpy2 package to be installed by adding the wrapper option to the install command. Additionally, this requires an installed R interpreter whit must available be in the PATH environment variable. The R packages are installed automatically upon the first use of the estimators.

h5py

The function cblearn.datasets.fetch_imagenet_similarity() requires the h5py package to load the dataset. This can package can be installed with pip. Note that some platforms require additionally the hdf5 libraries to be installed manually.

Quick Start#

cblearn is designed to be easy to use. The following example generates triplets from a point cloud, each specifying if point A is closer to point B or C, and fits an ordinal embedding model to the triplets. This ordinal embedding model is then used to predict the relative distances between the points.

 1import numpy as np
 2from cblearn.datasets import make_random_triplets
 3from cblearn.embedding import SOE
 4from cblearn.metrics import procrustes_distance
 5
 6points = np.random.rand(20, 2)
 7estimator = SOE(n_components=2)
 8
 9print(f"Triplets | Error (SSE)\n{22 * '-'}")
10for n in (25, 100, 400, 1600):
11    triplets = make_random_triplets(points, size=n, result_format="list-order")
12    embedding = estimator.fit_transform(triplets)
13    error = procrustes_distance(points, embedding)
14    print(f"    {len(triplets):4d} |       {error:.3f}")

The output should show a trend similar to the following:

Triplets | Error (SSE)
----------------------
      25 |       0.913
     100 |       0.278
     400 |       0.053
    1600 |       0.001

The Procrustes distance measures the sum of squared errors between points and embedding after aligning the embedding to the points (i.e., by optimizing rotating, translation, and scaling). The error approaches zero, demonstrating that the relative distances in the point cloud can be reconstructed from triplets only once enough are available.

The triplet generator’s result_format option specifies the expected data format of the triplets, as triplets can be represented in different ways. This example uses the list-order format, a list of triplets, containing the indices of an anchor, near, and far point. Learn more about data formats and other aspects of the library in the User Guide. Alternatively, you can find more code in the Examples or get an overview of the API reference.