Module dbMAP

Sub-modules

Approximate Nearest Neighbors {#dbmap.ann}

Class NMSlibTransformer {#dbmap.ann.NMSlibTransformer}

class NMSlibTransformer(
    n_neighbors=30,
    metric='cosine',
    method='hnsw',
    n_jobs=10,
    p=None,
    M=30,
    efC=100,
    efS=100,
    dense=False,
    verbose=False
)

Wrapper for using nmslib as sklearn's KNeighborsTransformer. This implements an escalable approximate k-nearest-neighbors graph on spaces defined by nmslib. Read more about nmslib and its various available metrics at https://github.com/nmslib/nmslib. Calling 'nn <- NMSlibTransformer()' initializes the class with neighbour search parameters.

Parameters

n_neighbors : int (optional, default 30) : number of nearest-neighbors to look for. In practice, this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

metric : str (optional, default 'cosine') : accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include: -'sqeuclidean' -'euclidean' -'l1' -'lp' - requires setting the parameter p -'cosine' -'angular' -'negdotprod' -'levenshtein' -'hamming' -'jaccard' -'jansen-shan'

method : str (optional, default 'hsnw') : approximate-neighbor search method. Available methods include: -'hnsw' : a Hierarchical Navigable Small World Graph. -'sw-graph' : a Small World Graph. -'vp-tree' : a Vantage-Point tree with a pruning rule adaptable to non-metric distances. -'napp' : a Neighborhood APProximation index. -'simple_invindx' : a vanilla, uncompressed, inverted index, which has no parameters. -'brute_force' : a brute-force search, which has no parameters. 'hnsw' is usually the fastest method, followed by 'sw-graph' and 'vp-tree'.

n_jobs : int (optional, default 1) : number of threads to be used in computation. Defaults to 1. The algorithm is highly scalable to multi-threading.

M : int (optional, default 30) : defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSlib. Please check more about NMSlib at https://github.com/nmslib/nmslib.

efC : int (optional, default 100) : A 'hnsw' parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

efS : int (optional, default 100) : A 'hnsw' parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

dense : bool (optional, default False) : Whether to force the algorithm to use dense data, such as np.ndarrays and pandas DataFrames.

Returns

Class for really fast approximate-nearest-neighbors search.

Example

import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
from dbmap.ann import NMSlibTransformer
#
### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)
#
### Start class with parameters
nn = NMSlibTransformer()
nn = nn.fit(data)
#
### Obtain kNN graph
knn = nn.transform(data)
#
### Obtain kNN indices, distances and distance gradient
ind, dist, grad = nn.ind_dist_grad(data)
#
### Test for recall efficiency during approximate nearest neighbors search
test = nn.test_efficiency(data)

Ancestors (in MRO)

Methods

Method fit {#dbmap.ann.NMSlibTransformer.fit}
def fit(
    self,
    data
)
Method ind_dist_grad {#dbmap.ann.NMSlibTransformer.ind_dist_grad}
def ind_dist_grad(
    self,
    data,
    return_grad=True,
    return_graph=True
)
Method test_efficiency {#dbmap.ann.NMSlibTransformer.test_efficiency}
def test_efficiency(
    self,
    data,
    data_use=0.1
)

Test if NMSlibTransformer and KNeighborsTransformer give same results

Method transform {#dbmap.ann.NMSlibTransformer.transform}
def transform(
    self,
    data
)
def update_search(
    self,
    n_neighbors
)

Updates number of neighbors for kNN distance computation.

Parameters

n_neighbors: New number of neighbors to look for.

Diffusion harmonics {#dbmap.diffusion}

Class Diffusor {#dbmap.diffusion.Diffusor}

class Diffusor(
    n_components=50,
    n_neighbors=10,
    alpha=1,
    n_jobs=10,
    ann=True,
    ann_dist='cosine',
    p=None,
    M=30,
    efC=100,
    efS=100,
    knn_dist='cosine',
    kernel_use='decay',
    transitions=True,
    eigengap=True,
    norm=False,
    verbose=True
)

Sklearn estimator for using fast anisotropic diffusion with a multiscaling adaptive algorithm as proposed by Setty et al, 2018, and optimized by Sidarta-Oliveira, 2020.

Parameters

n_components : Number of diffusion components to compute. Defaults to 100. We suggest larger values if : analyzing more than 10,000 cells.

n_neighbors : Number of k-nearest-neighbors to compute. The adaptive kernel will normalize distances by each cell : distance of its median neighbor.

knn_dist : Distance metric for building kNN graph. Defaults to 'euclidean'. Users are encouraged to explore : different metrics, such as 'cosine' and 'jaccard'. The 'hamming' and 'jaccard' distances are also available for string vectors.

ann : Boolean. Whether to use approximate nearest neighbors for graph construction. Defaults to True.

alpha : Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which is suitable for normalized data.

n_jobs : Number of threads to use in calculations. Defaults to all but one.

verbose : controls verbosity.

Returns

Diffusion components ['EigenVectors'], associated eigenvalues ['EigenValues'] and suggested number of
resulting components to use during Multiscaling.

Example


import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap

### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)

### Fit the anisotropic diffusion process
diff = dbmap.diffusion.Diffusor()
res = diff.fit_transform(data)

Ancestors (in MRO)

Methods

Method fit {#dbmap.diffusion.Diffusor.fit}
def fit(
    self,
    data
)

Fits an adaptive anisotropic kernel to the data. :param data: input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset. Read more at https://github.com/davisidarta/dbmap

Method ind_dist_grad {#dbmap.diffusion.Diffusor.ind_dist_grad}
def ind_dist_grad(
    self,
    data,
    n_components=None,
    dense=False
)

Effectively computes on data. Also returns the normalized diffusion distances, indexes and gradient obtained by approximating the Laplace-Beltrami operator. :param plot_knee: Whether to plot the scree plot of diffusion eigenvalues. :param data: input data. Takes in numpy arrays and scipy csr sparse matrices. Please use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset. Read more at https://github.com/davisidarta/dbmap

Method return_dict {#dbmap.diffusion.Diffusor.return_dict}
def return_dict(
    self
)

:return: Dictionary containing normalized and multiscaled Diffusion Components (['StructureComponents']), their eigenvalues ['EigenValues'], non-normalized components (['EigenVectors']) and the kernel used for transformation of distances into affinities (['kernel']).

Method transform {#dbmap.diffusion.Diffusor.transform}
def transform(
    self,
    data
)

Graph utilities {#dbmap.graph_utils}

Function approximate_n_neighbors {#dbmap.graph_utils.approximate_n_neighbors}

def approximate_n_neighbors(
    data,
    n_neighbors=30,
    metric='cosine',
    method='hnsw',
    n_jobs=10,
    efC=100,
    efS=100,
    M=30,
    dense=False,
    verbose=False
)

Simple function using NMSlibTransformer from dbmap.ann. This implements a very fast and scalable approximate k-nearest-neighbors graph on spaces defined by nmslib. Read more about nmslib and its various available metrics at https://github.com/nmslib/nmslib. Read more about dbMAP at https://github.com/davisidarta/dbMAP.

Parameters

n_neighbors : number of nearest-neighbors to look for. In practice, : this should be considered the average neighborhood size and thus vary depending on your number of features, samples and data intrinsic dimensionality. Reasonable values range from 5 to 100. Smaller values tend to lead to increased graph structure resolution, but users should beware that a too low value may render granulated and vaguely defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger values can slightly increase computational time.

metric : accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include: : -'sqeuclidean' -'euclidean' -'l1' -'cosine' -'angular' -'negdotprod' -'levenshtein' -'hamming' -'jaccard' -'jansen-shan'

method: approximate-neighbor search method. Defaults to 'hsnw' (usually the fastest).

n_jobs: number of threads to be used in computation. Defaults to 10 (~5 cores).

efC : increasing this value improves the quality of a constructed graph and leads to higher : accuracy of search. However this also leads to longer indexing times. A reasonable range is 100-2000. Defaults to 100.

efS : similarly to efC, improving this value improves recall at the expense of longer : retrieval time. A reasonable range is 100-2000.

M : defines the maximum number of neighbors in the zero and above-zero layers during HSNW : (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check more about NMSLIB at https://github.com/nmslib/nmslib .

:returns: k-nearest-neighbors indices and distances. Can be customized to also return return the k-nearest-neighbors graph and its gradient.

Example
knn_indices, knn_dists = approximate_n_neighbors(data)

Function compute_connectivities_adapmap {#dbmap.graph_utils.compute_connectivities_adapmap}

def compute_connectivities_adapmap(
    data,
    n_components=100,
    n_neighbors=30,
    alpha=0.0,
    n_jobs=10,
    ann=True,
    ann_dist='cosine',
    M=30,
    efC=100,
    efS=100,
    knn_dist='euclidean',
    kernel_use='sidarta',
    sensitivity=1,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0
)

Sklearn estimator for using fast anisotropic diffusion with an anisotropic adaptive algorithm as proposed by Setty et al, 2018, and optimized by Sidarta-Oliveira, 2020. This procedure generates diffusion components that effectivelly carry the maximum amount of information regarding the data geometric structure (structure components). These structure components then undergo a fuzzy-union of simplicial sets. This step is from umap.fuzzy_simplicial_set [McInnes18]_. Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.

Parameters

n_components : Number of diffusion components to compute. Defaults to 100. We suggest larger values if : analyzing more than 10,000 cells.

n_neighbors : Number of k-nearest-neighbors to compute. The adaptive kernel will normalize distances by each cell : distance of its median neighbor.

knn_dist : Distance metric for building kNN graph. Defaults to 'euclidean'. :  

ann : Boolean. Whether to use approximate nearest neighbors for graph construction. Defaults to True.

alpha : Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which is suitable for normalized data.

n_jobs : Number of threads to use in calculations. Defaults to all but one.

sensitivity : Sensitivity to select eigenvectors if diff_normalization is set to 'knee'. Useful when dealing wit :  

:returns: Diffusion components ['EigenVectors'], associated eigenvalues ['EigenValues'] and suggested number of resulting components to use during Multiscaling.

Example
import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap

##### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)

##### Fit the anisotropic diffusion process
diff = dbmap.diffusion.Diffusor()
res = diff.fit_transform(data)

Function compute_membership_strengths {#dbmap.graph_utils.compute_membership_strengths}

def compute_membership_strengths(
    knn_indices,
    knn_dists,
    sigmas,
    rhos
)

Construct the membership strength data for the 1-skeleton of each local fuzzy simplicial set -- this is formed as a sparse matrix where each row is a local fuzzy simplicial set, with a membership strength for the 1-simplex to each other data point.

Parameters

knn_indices : array of shape (n_samples, n_neighbors) : The indices on the n_neighbors closest points in the dataset.

knn_dists : array of shape (n_samples, n_neighbors) : The distances to the n_neighbors closest points in the dataset.

sigmas : array of shape(n_samples) : The normalization factor derived from the metric tensor approximation.

rhos : array of shape(n_samples) : The local connectivity adjustment.

Returns

rows : array of shape (n_samples * n_neighbors) : Row data for the resulting sparse matrix (coo format)

cols : array of shape (n_samples * n_neighbors) : Column data for the resulting sparse matrix (coo format)

vals : array of shape (n_samples * n_neighbors) : Entries for the resulting sparse matrix (coo format)

Function fuzzy_simplicial_set_nmslib {#dbmap.graph_utils.fuzzy_simplicial_set_nmslib}

def fuzzy_simplicial_set_nmslib(
    X,
    n_neighbors,
    knn_indices=None,
    knn_dists=None,
    nmslib_metric='cosine',
    nmslib_n_jobs=None,
    nmslib_efC=100,
    nmslib_efS=100,
    nmslib_M=30,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0,
    apply_set_operations=True,
    verbose=False
)

Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.

Parameters

X : array of shape (n_samples, n_features) : The data to be modelled as a fuzzy simplicial set.

n_neighbors : int : The number of neighbors to use to approximate geodesic distance. Larger numbers induce more global estimates of the manifold that can miss finer detail, while smaller values will focus on fine manifold structure to the detriment of the larger picture.

nmslib_metric : str (optional, default 'cosine') : accepted NMSLIB metrics. Accepted metrics include: -'sqeuclidean' -'euclidean' -'l1' -'l1_sparse' -'cosine' -'angular' -'negdotprod' -'levenshtein' -'hamming' -'jaccard' -'jansen-shan'

nmslib_n_jobs : int (optional, default None) : Number of threads to use for approximate-nearest neighbor search.

nmslib_efC : int (optional, default 100) : increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range is 100-2000.

nmslib_efS : int (optional, default 100) : similarly to efC, improving this value improves recall at the expense of longer retrieval time. A reasonable range is 100-2000.

nmslib_M: int (optional, default 30). defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2M. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check more about NMSLIB at https://github.com/nmslib/nmslib . n_epochs: int (optional, default None) The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). knn_indices* : array of shape (n_samples, n_neighbors) (optional) : If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the indices of the k-nearest neighbors as a row for each data point.

knn_dists : array of shape (n_samples, n_neighbors) (optional) : If the k-nearest neighbors of each point has already been calculated you can pass them in here to save computation time. This should be an array with the distances of the k-nearest neighbors as a row for each data point.

set_op_mix_ratio : float (optional, default 1.0) : Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity : int (optional, default 1) : The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

verbose : bool (optional, default False) : Whether to report information on the current progress of the algorithm.

Returns

fuzzy_simplicial_set : coo_matrix : A fuzzy simplicial set represented as a sparse matrix. The (i, j) entry of the matrix represents the membership strength of the 1-simplex between the ith and jth sample points.

Function get_igraph_from_adjacency {#dbmap.graph_utils.get_igraph_from_adjacency}

def get_igraph_from_adjacency(
    adjacency,
    directed=None
)

Get igraph graph from adjacency matrix.

Function get_sparse_matrix_from_indices_distances_dbmap {#dbmap.graph_utils.get_sparse_matrix_from_indices_distances_dbmap}

def get_sparse_matrix_from_indices_distances_dbmap(
    knn_indices,
    knn_dists,
    n_obs,
    n_neighbors
)

Function smooth_knn_dist {#dbmap.graph_utils.smooth_knn_dist}

def smooth_knn_dist(
    distances,
    k,
    n_iter=64,
    local_connectivity=1.0,
    bandwidth=1.0
)

Compute a continuous version of the distance to the kth nearest neighbor. That is, this is similar to knn-distance but allows continuous k values rather than requiring an integral k. In essence we are simply computing the distance such that the cardinality of fuzzy set we generate is k.

Parameters

distances : array of shape (n_samples, n_neighbors) : Distances to nearest neighbors for each samples. Each row should be a sorted list of distances to a given samples nearest neighbors.

k : float : The number of nearest neighbors to approximate for.

n_iter : int (optional, default 64) : We need to binary search for the correct distance value. This is the max number of iterations to use in such a search.

local_connectivity : int (optional, default 1) : The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

bandwidth : float (optional, default 1) : The target bandwidth of the kernel, larger values will produce larger return values.

Returns

knn_dist : array of shape (n_samples,) : The distance to kth nearest neighbor, as suitably approximated.

nn_dist : array of shape (n_samples,) : The distance to the 1st nearest neighbor for each point.

Graph layout {#dbmap.layout}

Class force_directed_layout {#dbmap.layout.force_directed_layout}

class force_directed_layout(
    layout='fa',
    init_pos=None,
    use_paga=False,
    root=None,
    random_state=0,
    n_jobs=10,
    **kwds
)

Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18]. An alternative to tSNE that often preserves the topology of the data better. This requires to run :func:~scanpy.pp.neighbors, first. The default layout ('fa', ForceAtlas2) [Jacomy14] uses the package |fa2| [Chippada18], which can be installed via pip install fa2. Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Islam11]. Many other layouts as implemented in igraph [Csardi06] are available. Similar approaches have been used by [Zunder15] or [Weinreb17]_. .. |fa2| replace:: fa2 .. _fa2: https://github.com/bhargavchippada/forceatlas2 .. _Force-directed graph drawing: https://en.wikipedia.org/wiki/Force-directed_graph_drawing

Parameters

data : Data matrix. Accepts numpy arrays and csr matrices.

layout : 'fa' (ForceAtlas2) or any valid igraph layout <http://igraph.org/c/doc/igraph-Layout.html>__. Of particular interest are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold, faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and 'rt' (Reingold Tilford tree layout).

root : Root for tree layouts.

random_state : For layouts with random initialization like 'fr', change this to use different intial states for the optimization. If None, no seed is set.

proceed : Continue computation, starting off with 'X_draw_graph_layout'.

init_pos : 'paga'/True, None/False, or any valid 2d-.obsm key. Use precomputed coordinates for initialization. If False/None (the default), initialize randomly.

**kwds : Parameters of chosen igraph layout. See e.g. fruchterman-reingold [Fruchterman91]. One of the most important ones is maxiter. .. _fruchterman-reingold: http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold

Returns

Depending on copy, returns or updates adata with the following field. X_draw_graph_layout : adata.obsm Coordinates of graph layout. E.g. for layout='fa' (the default), the field is called 'X_draw_graph_fa'

Ancestors (in MRO)

Methods

Method fit {#dbmap.layout.force_directed_layout.fit}
def fit(
    self,
    data
)
Method plot_graph {#dbmap.layout.force_directed_layout.plot_graph}
def plot_graph(
    self,
    node_size=20,
    with_labels=False,
    node_color='blue',
    node_alpha=0.4,
    plot_edges=True,
    edge_color='green',
    edge_alpha=0.05
)
Method transform {#dbmap.layout.force_directed_layout.transform}
def transform(
    self,
    X,
    y=None,
    **fit_params
)

Mapping - UMAP/TriMaps {#dbmap.map}

Class Mapper {#dbmap.map.Mapper}

class Mapper(
    n_components=2,
    n_neighbors=15,
    metric='euclidean',
    output_metric='euclidean',
    n_epochs=None,
    learning_rate=1.5,
    init='spectral',
    min_dist=0.6,
    spread=1.5,
    low_memory=False,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0,
    repulsion_strength=1.0,
    negative_sample_rate=5,
    transform_queue_size=4.0,
    a=None,
    b=None,
    random_state=None,
    angular_rp_forest=False,
    target_n_neighbors=-1,
    target_metric='categorical',
    target_weight=0.5,
    transform_seed=42,
    force_approximation_algorithm=False,
    verbose=False,
    unique=False
)

Layouts diffusion structure with UMAP to achieve dbMAP dimensional reduction. This class refers to the lower dimensional representation of diffusion components obtained through an adaptive diffusion maps algorithm initially proposed by [Setty18]. Alternatively, other diffusion approaches can be used, such as

To do: Fazer a adaptacao p outros algoritmos de diff maps

:param n_components: int (optional, default 2). The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to K, K being the number of samples or diffusion components to embedd. :param n_neighbors: The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. :param n_jobs: Number of threads to use in calculations. Defaults to all but one. :param min_dist: The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. :param spread: The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. :param learning_rate: The initial learning rate for the embedding optimization. :return: dbMAP embeddings.

Ancestors (in MRO)

Methods

Method fit {#dbmap.map.Mapper.fit}
def fit(
    self,
    data,
    y=0
)
Method fit_transform {#dbmap.map.Mapper.fit_transform}
def fit_transform(
    self,
    data,
    y=0
)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) :  

y : ndarray of shape (n_samples,), default=None : Target values.

**fit_params : dict : Additional fit parameters.

Returns

X_new : ndarray array of shape (n_samples, n_features_new) : Transformed array.

Multiscale diffusion {#dbmap.multiscale}

Function multiscale {#dbmap.multiscale.multiscale}

def multiscale(
    res,
    n_eigs=None
)

Determine multi scale space of the data :param n_eigs: Number of eigen vectors to use. If None specified, the number of eigen vectors will be determined using eigen gap identification. :return: Multi scaled data matrix

Plotting utilities {#dbmap.plot}

Function scatter_plot {#dbmap.plot.scatter_plot}

def scatter_plot(
    res,
    title=None,
    fontsize=18,
    labels=None,
    pt_size=None,
    marker='o',
    opacity=1
)

Optimized UMAP (AMAP) {#dbmap.umapper}

Class AMAP {#dbmap.umapper.AMAP}

class AMAP(
    n_neighbors=15,
    n_components=2,
    metric='euclidean',
    metric_kwds=None,
    output_metric='euclidean',
    output_metric_kwds=None,
    use_nmslib=True,
    nmslib_metric='cosine',
    nmslib_n_jobs=10,
    nmslib_efC=100,
    nmslib_efS=100,
    nmslib_M=30,
    n_epochs=None,
    learning_rate=1.5,
    init='spectral',
    min_dist=0.6,
    spread=1.5,
    low_memory=False,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0,
    repulsion_strength=1.0,
    negative_sample_rate=5,
    transform_queue_size=4.0,
    a=None,
    b=None,
    random_state=None,
    angular_rp_forest=False,
    target_n_neighbors=-1,
    target_metric='categorical',
    target_metric_kwds=None,
    target_weight=0.5,
    transform_seed=42,
    force_approximation_algorithm=False,
    verbose=False,
    unique=False
)

Adaptive Manifold Approximation and Projection Finds a low dimensional embedding of the data that approximates the underlying manifold through fuzzy-union layout. Accelerated when use_nmslib = True.

Parameters

n_neighbors : float (optional, default 15) : The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_components : int (optional, default 2) : The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.

use_nmslib : bool (optional, default True) : Whether to use NMSLibTransformer to compute fast approximate nearest neighbors. This is a wrapper aroud NMSLIB that supports fast and parallelized computation with an array of handy features. If set to True, distances are measured in the space defined on the ann_metric parameter.

nmslib_metric : str (optional, default 'cosine') : accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include: -'sqeuclidean' -'euclidean' -'l1' -'cosine' -'angular' -'negdotprod' -'levenshtein' -'hamming' -'jaccard' -'jansen-shan'

nmslib_n_jobs : int (optional, default None) : Number of threads to use for approximate-nearest neighbor search.

nmslib_efC : int (optional, default 100) : increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range is 100-2000.

nmslib_efS : int (optional, default 100) : similarly to efC, improving this value improves recall at the expense of longer retrieval time. A reasonable range is 100-2000.

nmslib_M: int (optional, default 30). defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2M. For more information on HSNW, please check https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check more about NMSLIB at https://github.com/nmslib/nmslib . n_epochs: int (optional, default None) The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small). learning_rate* : float (optional, default 1.0) : The initial learning rate for the embedding optimization.

init : string (optional, default 'spectral') : How to initialize the low dimensional embedding. Options are: * 'spectral': use a spectral embedding of the fuzzy 1-skeleton * 'random': assign initial embedding positions at random. * A numpy array of initial embedding positions.

min_dist : float (optional, default 0.1) : The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread : float (optional, default 1.0) : The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

metric : string or function (optional, default 'euclidean') : Used if use_nmslib = False. The metric to use to compute distances in high dimensional space. If a string is passed it must match a valid predefined metric. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit'd function. Valid string metrics that should be used within AMAP include: * euclidean * manhattan * seuclidean * cosine * correlation * haversine * hamming * jaccard

low_memory : bool (optional, default False) : If you find that AMAP is failing due to memory constraints consider setting use_nmslib to False and this option to True. This approach is more computationally expensive, but avoids excessive memory use.

set_op_mix_ratio : float (optional, default 1.0) : Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity : int (optional, default 1) : The local connectivity required -- i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength : float (optional, default 1.0) : Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate : int (optional, default 5) : The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

transform_queue_size : float (optional, default 4.0) : For transform operations (embedding new points using a trained model_ this will control how aggressively to search for nearest neighbors. Larger values will result in slower performance but more accurate nearest neighbor evaluation.

a : float (optional, default None) : More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b : float (optional, default None) : More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

random_state : int, RandomState instance or None, optional (default: None) : If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

metric_kwds : dict (optional, default None) : Arguments to pass on to the metric, such as the p value for Minkowski distance. If None then no arguments are passed on.

angular_rp_forest : bool (optional, default False) : Whether to use an angular random projection forest to initialise the approximate nearest neighbor search. This can be faster, but is mostly on useful for metric that use an angular style distance such as cosine, correlation etc. In the case of those metrics angular forests will be chosen automatically.

target_n_neighbors : int (optional, default -1) : The number of nearest neighbors to use to construct the target simplcial set. If set to -1 use the n_neighbors value.

target_metric : string or callable (optional, default 'categorical') : The metric used to measure distance for a target array is using supervised dimension reduction. By default this is 'categorical' which will measure distance in terms of whether categories match or are different. Furthermore, if semi-supervised is required target values of -1 will be trated as unlabelled under the 'categorical' metric. If the target array takes continuous values (e.g. for a regression problem) then metric of 'l1' or 'l2' is probably more appropriate.

target_metric_kwds : dict (optional, default None) : Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.

target_weight : float (optional, default 0.5) : weighting factor between data topology and target topology. A value of 0.0 weights entirely on data, a value of 1.0 weights entirely on target. The default of 0.5 balances the weighting equally between data and target.

transform_seed : int (optional, default 42) : Random seed used for the stochastic aspects of the transform operation. This ensures consistency in transform operations.

verbose : bool (optional, default False) : Controls verbosity of logging.

unique : bool (optional, default False) : Controls if the rows of your data should be uniqued before being embedded. If you have more duplicates than you have n_neighbour you can have the identical data points lying in different regions of your space. It also violates the definition of a metric.

Ancestors (in MRO)

Methods

Method fit {#dbmap.umapper.AMAP.fit}
def fit(
    self,
    X,
    y=None
)

Fit X into an embedded space. Optionally use y for supervised dimension reduction.

Parameters

X : array, shape (n_samples, n_features) or (n_samples, n_samples) : If the metric is 'precomputed' X must be a square distance matrix. Otherwise it contains a sample per row. If the method is 'exact', X may be a sparse matrix of type 'csr', 'csc' or 'coo'.

y : array, shape (n_samples) : A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are target_metric and target_metric_kwds.

Method fit_transform {#dbmap.umapper.AMAP.fit_transform}
def fit_transform(
    self,
    X,
    y=None
)

Fit X into an embedded space and return that transformed output.

Parameters

X : array, shape (n_samples, n_features) or (n_samples, n_samples) : If the metric is 'precomputed' X must be a square distance matrix. Otherwise it contains a sample per row.

y : array, shape (n_samples) : A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are target_metric and target_metric_kwds.

Returns

X_new : array, shape (n_samples, n_components) : Embedding of the training data in low-dimensional space.

Method inverse_transform {#dbmap.umapper.AMAP.inverse_transform}
def inverse_transform(
    self,
    X
)

Transform X in the existing embedded space back into the input data space and return that transformed output.

Parameters

X : array, shape (n_samples, n_components) : New points to be inverse transformed.

Returns

X_new : array, shape (n_samples, n_features) : Generated data points new data in data space.

Method transform {#dbmap.umapper.AMAP.transform}
def transform(
    self,
    X
)

Transform X into the existing embedded space and return that transformed output.

Parameters

X : array, shape (n_samples, n_features) : New data to be transformed.

Returns

X_new : array, shape (n_samples, n_components) : Embedding of the new data in low-dimensional space.

Class DataFrameUMAP {#dbmap.umapper.DataFrameUMAP}

class DataFrameUMAP(
    metrics,
    n_neighbors=15,
    n_components=2,
    output_metric='euclidean',
    output_metric_kwds=None,
    n_epochs=None,
    learning_rate=1.0,
    init='spectral',
    min_dist=0.1,
    spread=1.0,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0,
    repulsion_strength=1.0,
    negative_sample_rate=5,
    transform_queue_size=4.0,
    a=None,
    b=None,
    random_state=None,
    angular_rp_forest=False,
    target_n_neighbors=-1,
    target_metric='categorical',
    target_metric_kwds=None,
    target_weight=0.5,
    transform_seed=42,
    verbose=False
)

Base class for all estimators in scikit-learn

Notes

All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs).

Ancestors (in MRO)

Methods

Method fit {#dbmap.umapper.DataFrameUMAP.fit}
def fit(
    self,
    X,
    y=None
)

Generated by pdoc 0.9.1 (https://pdoc3.github.io).