Getting Started#
Setup#
cblearn requires Python 3.9 or newer.
The package is mainly tested on Linux, but Windows and Mac OS should work, too.
Python environment#
- The easiest way to install Python and its dependencies is using a
Anaconda environment or similar, because dependencies do not conflict with other Python packages you may have installed and the Python version can be specified.
conda create -n cblearn python==3.9
conda activate cblearn
Install cblearn#
cblearn and can be installed using pip:
pip install cblearn
This will install the minimal set of required packages, sufficient for most uses and saving disk space.
However, some features require more packages that can be installed by adding an option to the install command.
Install Extra Requirements#
Extra requirements can be installed by adding an option to the install command and enable more advanced features.
Some of those extra dependencies need non-Python packages to be installed first.
$ pip install cblearn[torch,wrapper] h5py
- torch
Most estimators provide an (optional) implementation using
pytorchto run large datasets on CPU and GPU. This requires the`pytorch<https://pytorch.org/get-started/locally/>`_ package to be installed manually or by adding thetorchextras option to the install command. Note thatpytorchmight need about 1GB of disk space.- wrapper
The estimators in Wrapper provide an Python interface to the original implementation in
R-lang. This requires therpy2package to be installed by adding thewrapperoption to the install command. Additionally, this requires an installedRinterpreter whit must available be in thePATHenvironment variable. TheRpackages are installed automatically upon the first use of the estimators.- h5py
The function
cblearn.datasets.fetch_imagenet_similarity()requires theh5pypackage to load the dataset. This can package can be installed with pip. Note that some platforms require additionally thehdf5libraries to be installed manually.
Quick Start#
cblearn is designed to be easy to use. The following example generates triplets from a point cloud, each specifying if point A is closer to point B or C, and fits an ordinal embedding model to the triplets. This ordinal embedding model is then used to predict the relative distances between the points.
1import numpy as np
2from cblearn.datasets import make_random_triplets
3from cblearn.embedding import SOE
4from cblearn.metrics import procrustes_distance
5
6points = np.random.rand(20, 2)
7estimator = SOE(n_components=2)
8
9print(f"Triplets | Error (SSE)\n{22 * '-'}")
10for n in (25, 100, 400, 1600):
11 triplets = make_random_triplets(points, size=n, result_format="list-order")
12 embedding = estimator.fit_transform(triplets)
13 error = procrustes_distance(points, embedding)
14 print(f" {len(triplets):4d} | {error:.3f}")
The output should show a trend similar to the following:
Triplets | Error (SSE)
----------------------
25 | 0.913
100 | 0.278
400 | 0.053
1600 | 0.001
The Procrustes distance measures the sum of squared errors between points and embedding after aligning the embedding to the points (i.e., by optimizing rotating, translation, and scaling). The error approaches zero, demonstrating that the relative distances in the point cloud can be reconstructed from triplets only once enough are available.
The triplet generator’s result_format option specifies the expected data format of the triplets, as triplets can be represented in different ways. This example uses the list-order format, a list of triplets, containing the indices of an anchor, near, and far point. Learn more about data formats and other aspects of the library in the User Guide. Alternatively, you can find more code in the Examples or get an overview of the API reference.