Module dbMAP
Sub-modules
- dbmap.ann
- dbmap.diffusion
- dbmap.graph_utils
- dbmap.layout
- dbmap.map
- dbmap.multiscale
- dbmap.plot
- dbmap.spectral
- dbmap.umapper
- dbmap.utils
Approximate Nearest Neighbors {#dbmap.ann}
Class NMSlibTransformer
{#dbmap.ann.NMSlibTransformer}
class NMSlibTransformer( n_neighbors=30, metric='cosine', method='hnsw', n_jobs=10, p=None, M=30, efC=100, efS=100, dense=False, verbose=False )
Wrapper for using nmslib as sklearn's KNeighborsTransformer. This implements an escalable approximate k-nearest-neighbors graph on spaces defined by nmslib. Read more about nmslib and its various available metrics at https://github.com/nmslib/nmslib. Calling 'nn <- NMSlibTransformer()' initializes the class with neighbour search parameters.
Parameters
n_neighbors
: int (optional
, default 30)
: number of nearest-neighbors to look for. In practice,
this should be considered the average neighborhood size and thus vary depending
on your number of features, samples and data intrinsic dimensionality. Reasonable values
range from 5 to 100. Smaller values tend to lead to increased graph structure
resolution, but users should beware that a too low value may render granulated and vaguely
defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger
values can slightly increase computational time.
metric
: str (optional
, default 'cosine')
: accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include:
-'sqeuclidean'
-'euclidean'
-'l1'
-'lp' - requires setting the parameter p
-'cosine'
-'angular'
-'negdotprod'
-'levenshtein'
-'hamming'
-'jaccard'
-'jansen-shan'
method
: str (optional
, default 'hsnw')
: approximate-neighbor search method. Available methods include:
-'hnsw' : a Hierarchical Navigable Small World Graph.
-'sw-graph' : a Small World Graph.
-'vp-tree' : a Vantage-Point tree with a pruning rule adaptable to non-metric distances.
-'napp' : a Neighborhood APProximation index.
-'simple_invindx' : a vanilla, uncompressed, inverted index, which has no parameters.
-'brute_force' : a brute-force search, which has no parameters.
'hnsw' is usually the fastest method, followed by 'sw-graph' and 'vp-tree'.
n_jobs
: int (optional
, default 1)
: number of threads to be used in computation. Defaults to 1. The algorithm is highly
scalable to multi-threading.
M
: int (optional
, default 30)
: defines the maximum number of neighbors in the zero and above-zero layers during HSNW
(Hierarchical Navigable Small World Graph). However, the actual default maximum number
of neighbors for the zero layer is 2*M. A reasonable range for this parameter
is 5-100. For more information on HSNW, please check https://arxiv.org/abs/1603.09320.
HSNW is implemented in python via NMSlib. Please check more about NMSlib at https://github.com/nmslib/nmslib.
efC
: int (optional
, default 100)
: A 'hnsw' parameter. Increasing this value improves the quality of a constructed graph
and leads to higher accuracy of search. However this also leads to longer indexing times.
A reasonable range for this parameter is 50-2000.
efS
: int (optional
, default 100)
: A 'hnsw' parameter. Similarly to efC, increasing this value improves recall at the
expense of longer retrieval time. A reasonable range for this parameter is 100-2000.
dense
: bool (optional
, default False)
: Whether to force the algorithm to use dense data, such as np.ndarrays and pandas DataFrames.
Returns
Class for really fast approximate-nearest-neighbors search.
Example
import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
from dbmap.ann import NMSlibTransformer
#
### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)
#
### Start class with parameters
nn = NMSlibTransformer()
nn = nn.fit(data)
#
### Obtain kNN graph
knn = nn.transform(data)
#
### Obtain kNN indices, distances and distance gradient
ind, dist, grad = nn.ind_dist_grad(data)
#
### Test for recall efficiency during approximate nearest neighbors search
test = nn.test_efficiency(data)
Ancestors (in MRO)
Methods
Method fit
{#dbmap.ann.NMSlibTransformer.fit}
def fit( self, data )
Method ind_dist_grad
{#dbmap.ann.NMSlibTransformer.ind_dist_grad}
def ind_dist_grad( self, data, return_grad=True, return_graph=True )
Method test_efficiency
{#dbmap.ann.NMSlibTransformer.test_efficiency}
def test_efficiency( self, data, data_use=0.1 )
Test if NMSlibTransformer and KNeighborsTransformer give same results
Method transform
{#dbmap.ann.NMSlibTransformer.transform}
def transform( self, data )
Method update_search
{#dbmap.ann.NMSlibTransformer.update_search}
def update_search( self, n_neighbors )
Updates number of neighbors for kNN distance computation.
Parameters
n_neighbors: New number of neighbors to look for.
Diffusion harmonics {#dbmap.diffusion}
Class Diffusor
{#dbmap.diffusion.Diffusor}
class Diffusor( n_components=50, n_neighbors=10, alpha=1, n_jobs=10, ann=True, ann_dist='cosine', p=None, M=30, efC=100, efS=100, knn_dist='cosine', kernel_use='decay', transitions=True, eigengap=True, norm=False, verbose=True )
Sklearn estimator for using fast anisotropic diffusion with a multiscaling adaptive algorithm as proposed by Setty et al, 2018, and optimized by Sidarta-Oliveira, 2020.
Parameters
n_components
: Number
of diffusion components to compute. Defaults to 100. We suggest larger values if
: analyzing more than 10,000 cells.
n_neighbors
: Number
of k-nearest-neighbors to compute. The adaptive kernel will normalize distances by each cell
: distance of its median neighbor.
knn_dist
: Distance metric for building kNN graph. Defaults to 'euclidean'. Users are encouraged to explore
: different metrics, such as 'cosine' and 'jaccard'. The 'hamming' and 'jaccard' distances are also available
for string vectors.
ann : Boolean. Whether to use approximate nearest neighbors for graph construction. Defaults to True.
alpha : Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which is suitable for normalized data.
n_jobs : Number of threads to use in calculations. Defaults to all but one.
verbose : controls verbosity.
Returns
Diffusion components ['EigenVectors'], associated eigenvalues ['EigenValues'] and suggested number of
resulting components to use during Multiscaling.
Example
import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap
### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)
### Fit the anisotropic diffusion process
diff = dbmap.diffusion.Diffusor()
res = diff.fit_transform(data)
Ancestors (in MRO)
Methods
Method fit
{#dbmap.diffusion.Diffusor.fit}
def fit( self, data )
Fits an adaptive anisotropic kernel to the data. :param data: input data. Takes in numpy arrays and scipy csr sparse matrices. Use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset. Read more at https://github.com/davisidarta/dbmap
Method ind_dist_grad
{#dbmap.diffusion.Diffusor.ind_dist_grad}
def ind_dist_grad( self, data, n_components=None, dense=False )
Effectively computes on data. Also returns the normalized diffusion distances, indexes and gradient obtained by approximating the Laplace-Beltrami operator. :param plot_knee: Whether to plot the scree plot of diffusion eigenvalues. :param data: input data. Takes in numpy arrays and scipy csr sparse matrices. Please use with sparse data for top performance. You can adjust a series of parameters that can make the process faster and more informational depending on your dataset. Read more at https://github.com/davisidarta/dbmap
Method return_dict
{#dbmap.diffusion.Diffusor.return_dict}
def return_dict( self )
:return: Dictionary containing normalized and multiscaled Diffusion Components (['StructureComponents']), their eigenvalues ['EigenValues'], non-normalized components (['EigenVectors']) and the kernel used for transformation of distances into affinities (['kernel']).
Method transform
{#dbmap.diffusion.Diffusor.transform}
def transform( self, data )
Graph utilities {#dbmap.graph_utils}
Function approximate_n_neighbors
{#dbmap.graph_utils.approximate_n_neighbors}
def approximate_n_neighbors( data, n_neighbors=30, metric='cosine', method='hnsw', n_jobs=10, efC=100, efS=100, M=30, dense=False, verbose=False )
Simple function using NMSlibTransformer from dbmap.ann. This implements a very fast and scalable approximate k-nearest-neighbors graph on spaces defined by nmslib. Read more about nmslib and its various available metrics at https://github.com/nmslib/nmslib. Read more about dbMAP at https://github.com/davisidarta/dbMAP.
Parameters
n_neighbors
: number
of nearest-neighbors to look for. In practice,
: this should be considered the average neighborhood size and thus vary depending
on your number of features, samples and data intrinsic dimensionality. Reasonable values
range from 5 to 100. Smaller values tend to lead to increased graph structure
resolution, but users should beware that a too low value may render granulated and vaguely
defined neighborhoods that arise as an artifact of downsampling. Defaults to 30. Larger
values can slightly increase computational time.
metric
: accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include:
: -'sqeuclidean'
-'euclidean'
-'l1'
-'cosine'
-'angular'
-'negdotprod'
-'levenshtein'
-'hamming'
-'jaccard'
-'jansen-shan'
method: approximate-neighbor search method. Defaults to 'hsnw' (usually the fastest).
n_jobs: number of threads to be used in computation. Defaults to 10 (~5 cores).
efC
: increasing this value improves the quality
of a constructed graph and leads to higher
: accuracy of search. However this also leads to longer indexing times. A reasonable
range is 100-2000. Defaults to 100.
efS
: similarly to efC, improving this value improves recall at the expense
of longer
: retrieval time. A reasonable range is 100-2000.
M
: defines the maximum number
of neighbors in the zero and above-zero layers during HSNW
: (Hierarchical Navigable Small World Graph). However, the actual default maximum number
of neighbors for the zero layer is 2*M. For more information on HSNW, please check
https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check
more about NMSLIB at https://github.com/nmslib/nmslib .
:returns: k-nearest-neighbors indices and distances. Can be customized to also return return the k-nearest-neighbors graph and its gradient.
Example
knn_indices, knn_dists = approximate_n_neighbors(data)
Function compute_connectivities_adapmap
{#dbmap.graph_utils.compute_connectivities_adapmap}
def compute_connectivities_adapmap( data, n_components=100, n_neighbors=30, alpha=0.0, n_jobs=10, ann=True, ann_dist='cosine', M=30, efC=100, efS=100, knn_dist='euclidean', kernel_use='sidarta', sensitivity=1, set_op_mix_ratio=1.0, local_connectivity=1.0 )
Sklearn estimator for using fast anisotropic diffusion with an anisotropic adaptive algorithm as proposed by Setty et al, 2018, and optimized by Sidarta-Oliveira, 2020. This procedure generates diffusion components that effectivelly carry the maximum amount of information regarding the data geometric structure (structure components). These structure components then undergo a fuzzy-union of simplicial sets. This step is from umap.fuzzy_simplicial_set [McInnes18]_. Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.
Parameters
n_components
: Number
of diffusion components to compute. Defaults to 100. We suggest larger values if
: analyzing more than 10,000 cells.
n_neighbors
: Number
of k-nearest-neighbors to compute. The adaptive kernel will normalize distances by each cell
: distance of its median neighbor.
knn_dist
: Distance metric for building kNN graph. Defaults to 'euclidean'.
:
ann : Boolean. Whether to use approximate nearest neighbors for graph construction. Defaults to True.
alpha : Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which is suitable for normalized data.
n_jobs : Number of threads to use in calculations. Defaults to all but one.
sensitivity
: Sensitivity to select eigenvectors if diff_normalization is set to 'knee'. Useful when dealing wit
:
:returns: Diffusion components ['EigenVectors'], associated eigenvalues ['EigenValues'] and suggested number of resulting components to use during Multiscaling.
Example
import numpy as np
from sklearn.datasets import load_digits
from scipy.sparse import csr_matrix
import dbmap
##### Load the MNIST digits data, convert to sparse for speed
digits = load_digits()
data = csr_matrix(digits)
##### Fit the anisotropic diffusion process
diff = dbmap.diffusion.Diffusor()
res = diff.fit_transform(data)
Function compute_membership_strengths
{#dbmap.graph_utils.compute_membership_strengths}
def compute_membership_strengths( knn_indices, knn_dists, sigmas, rhos )
Construct the membership strength data for the 1-skeleton of each local fuzzy simplicial set -- this is formed as a sparse matrix where each row is a local fuzzy simplicial set, with a membership strength for the 1-simplex to each other data point.
Parameters
knn_indices
: array
of shape (n_samples, n_neighbors)
: The indices on the n_neighbors
closest points in the dataset.
knn_dists
: array
of shape (n_samples, n_neighbors)
: The distances to the n_neighbors
closest points in the dataset.
sigmas
: array
of shape(n_samples)
: The normalization factor derived from the metric tensor approximation.
rhos
: array
of shape(n_samples)
: The local connectivity adjustment.
Returns
rows
: array
of shape (n_samples * n_neighbors)
: Row data for the resulting sparse matrix (coo format)
cols
: array
of shape (n_samples * n_neighbors)
: Column data for the resulting sparse matrix (coo format)
vals
: array
of shape (n_samples * n_neighbors)
: Entries for the resulting sparse matrix (coo format)
Function fuzzy_simplicial_set_nmslib
{#dbmap.graph_utils.fuzzy_simplicial_set_nmslib}
def fuzzy_simplicial_set_nmslib( X, n_neighbors, knn_indices=None, knn_dists=None, nmslib_metric='cosine', nmslib_n_jobs=None, nmslib_efC=100, nmslib_efS=100, nmslib_M=30, set_op_mix_ratio=1.0, local_connectivity=1.0, apply_set_operations=True, verbose=False )
Given a set of data X, a neighborhood size, and a measure of distance compute the fuzzy simplicial set (here represented as a fuzzy graph in the form of a sparse matrix) associated to the data. This is done by locally approximating geodesic distance at each point, creating a fuzzy simplicial set for each such point, and then combining all the local fuzzy simplicial sets into a global one via a fuzzy union.
Parameters
X
: array
of shape (n_samples, n_features)
: The data to be modelled as a fuzzy simplicial set.
n_neighbors
: int
: The number of neighbors to use to approximate geodesic distance.
Larger numbers induce more global estimates of the manifold that can
miss finer detail, while smaller values will focus on fine manifold
structure to the detriment of the larger picture.
nmslib_metric
: str (optional
, default 'cosine')
: accepted NMSLIB metrics. Accepted metrics include:
-'sqeuclidean'
-'euclidean'
-'l1'
-'l1_sparse'
-'cosine'
-'angular'
-'negdotprod'
-'levenshtein'
-'hamming'
-'jaccard'
-'jansen-shan'
nmslib_n_jobs
: int (optional
, default None)
: Number of threads to use for approximate-nearest neighbor search.
nmslib_efC
: int (optional
, default 100)
: increasing this value improves the quality of a constructed graph and leads to higher
accuracy of search. However this also leads to longer indexing times. A reasonable
range is 100-2000.
nmslib_efS
: int (optional
, default 100)
: similarly to efC, improving this value improves recall at the expense of longer
retrieval time. A reasonable range is 100-2000.
nmslib_M: int (optional, default 30).
defines the maximum number of neighbors in the zero and above-zero layers during HSNW
(Hierarchical Navigable Small World Graph). However, the actual default maximum number
of neighbors for the zero layer is 2M. For more information on HSNW, please check
https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check
more about NMSLIB at https://github.com/nmslib/nmslib . n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the
low dimensional embedding. Larger values result in more accurate
embeddings. If None is specified a value will be selected based on
the size of the input dataset (200 for large datasets, 500 for small).
knn_indices
* : array
of shape (n_samples, n_neighbors) (optional)
: If the k-nearest neighbors of each point has already been calculated
you can pass them in here to save computation time. This should be
an array with the indices of the k-nearest neighbors as a row for
each data point.
knn_dists
: array
of shape (n_samples, n_neighbors) (optional)
: If the k-nearest neighbors of each point has already been calculated
you can pass them in here to save computation time. This should be
an array with the distances of the k-nearest neighbors as a row for
each data point.
set_op_mix_ratio
: float (optional
, default 1.0)
: Interpolate between (fuzzy) union and intersection as the set operation
used to combine local fuzzy simplicial sets to obtain a global fuzzy
simplicial sets. Both fuzzy set operations use the product t-norm.
The value of this parameter should be between 0.0 and 1.0; a value of
1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy
intersection.
local_connectivity
: int (optional
, default 1)
: The local connectivity required -- i.e. the number of nearest
neighbors that should be assumed to be connected at a local level.
The higher this value the more connected the manifold becomes
locally. In practice this should be not more than the local intrinsic
dimension of the manifold.
verbose
: bool (optional
, default False)
: Whether to report information on the current progress of the algorithm.
Returns
fuzzy_simplicial_set
: coo_matrix
: A fuzzy simplicial set represented as a sparse matrix. The (i,
j) entry of the matrix represents the membership strength of the
1-simplex between the ith and jth sample points.
Function get_igraph_from_adjacency
{#dbmap.graph_utils.get_igraph_from_adjacency}
def get_igraph_from_adjacency( adjacency, directed=None )
Get igraph graph from adjacency matrix.
Function get_sparse_matrix_from_indices_distances_dbmap
{#dbmap.graph_utils.get_sparse_matrix_from_indices_distances_dbmap}
def get_sparse_matrix_from_indices_distances_dbmap( knn_indices, knn_dists, n_obs, n_neighbors )
Function smooth_knn_dist
{#dbmap.graph_utils.smooth_knn_dist}
def smooth_knn_dist( distances, k, n_iter=64, local_connectivity=1.0, bandwidth=1.0 )
Compute a continuous version of the distance to the kth nearest neighbor. That is, this is similar to knn-distance but allows continuous k values rather than requiring an integral k. In essence we are simply computing the distance such that the cardinality of fuzzy set we generate is k.
Parameters
distances
: array
of shape (n_samples, n_neighbors)
: Distances to nearest neighbors for each samples. Each row should be a
sorted list of distances to a given samples nearest neighbors.
k
: float
: The number of nearest neighbors to approximate for.
n_iter
: int (optional
, default 64)
: We need to binary search for the correct distance value. This is the
max number of iterations to use in such a search.
local_connectivity
: int (optional
, default 1)
: The local connectivity required -- i.e. the number of nearest
neighbors that should be assumed to be connected at a local level.
The higher this value the more connected the manifold becomes
locally. In practice this should be not more than the local intrinsic
dimension of the manifold.
bandwidth
: float (optional
, default 1)
: The target bandwidth of the kernel, larger values will produce
larger return values.
Returns
knn_dist
: array
of shape (n_samples,)
: The distance to kth nearest neighbor, as suitably approximated.
nn_dist
: array
of shape (n_samples,)
: The distance to the 1st nearest neighbor for each point.
Graph layout {#dbmap.layout}
Class force_directed_layout
{#dbmap.layout.force_directed_layout}
class force_directed_layout( layout='fa', init_pos=None, use_paga=False, root=None, random_state=0, n_jobs=10, **kwds )
Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18].
An alternative to tSNE that often preserves the topology of the data
better. This requires to run :func:~scanpy.pp.neighbors
, first.
The default layout ('fa', ForceAtlas2
) [Jacomy14] uses the package |fa2|
[Chippada18], which can be installed via pip install fa2
.
Force-directed graph drawing
describes a class of long-established
algorithms for visualizing graphs.
It has been suggested for visualizing single-cell data by [Islam11].
Many other layouts as implemented in igraph [Csardi06] are available.
Similar approaches have been used by [Zunder15] or [Weinreb17]_.
.. |fa2| replace:: fa2
.. _fa2: https://github.com/bhargavchippada/forceatlas2
.. _Force-directed graph drawing: https://en.wikipedia.org/wiki/Force-directed_graph_drawing
Parameters
data
: Data matrix. Accepts numpy arrays and csr matrices.
layout
: 'fa' (ForceAtlas2
) or any valid igraph layout
<http://igraph.org/c/doc/igraph-Layout.html>
__. Of particular interest
are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold,
faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large
Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and
'rt' (Reingold Tilford tree layout).
root
: Root for tree layouts.
random_state
: For layouts with random initialization like 'fr', change this to use
different intial states for the optimization. If None
, no seed is set.
proceed
: Continue computation, starting off with 'X_draw_graph_layout
'.
init_pos
: 'paga'
/True
, None
/False
, or any valid 2d-.obsm
key.
Use precomputed coordinates for initialization.
If False
/None
(the default), initialize randomly.
**kwds
: Parameters of chosen igraph layout. See e.g. fruchterman-reingold
[Fruchterman91]. One of the most important ones is maxiter
.
.. _fruchterman-reingold: http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold
Returns
Depending on copy
, returns or updates adata
with the following field.
X_draw_graph_layout : adata.obsm
Coordinates of graph layout. E.g. for layout='fa' (the default),
the field is called 'X_draw_graph_fa'
Ancestors (in MRO)
Methods
Method fit
{#dbmap.layout.force_directed_layout.fit}
def fit( self, data )
Method plot_graph
{#dbmap.layout.force_directed_layout.plot_graph}
def plot_graph( self, node_size=20, with_labels=False, node_color='blue', node_alpha=0.4, plot_edges=True, edge_color='green', edge_alpha=0.05 )
Method transform
{#dbmap.layout.force_directed_layout.transform}
def transform( self, X, y=None, **fit_params )
Mapping - UMAP/TriMaps {#dbmap.map}
Class Mapper
{#dbmap.map.Mapper}
class Mapper( n_components=2, n_neighbors=15, metric='euclidean', output_metric='euclidean', n_epochs=None, learning_rate=1.5, init='spectral', min_dist=0.6, spread=1.5, low_memory=False, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_weight=0.5, transform_seed=42, force_approximation_algorithm=False, verbose=False, unique=False )
Layouts diffusion structure with UMAP to achieve dbMAP dimensional reduction. This class refers to the lower dimensional representation of diffusion components obtained through an adaptive diffusion maps algorithm initially proposed by [Setty18]. Alternatively, other diffusion approaches can be used, such as
To do: Fazer a adaptacao p outros algoritmos de diff maps
:param n_components: int (optional, default 2). The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to K, K being the number of samples or diffusion components to embedd. :param n_neighbors: The size of local neighborhood (in terms of number of neighboring sample points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100. :param n_jobs: Number of threads to use in calculations. Defaults to all but one. :param min_dist: The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. :param spread: The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. :param learning_rate: The initial learning rate for the embedding optimization. :return: dbMAP embeddings.
Ancestors (in MRO)
Methods
Method fit
{#dbmap.map.Mapper.fit}
def fit( self, data, y=0 )
Method fit_transform
{#dbmap.map.Mapper.fit_transform}
def fit_transform( self, data, y=0 )
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters
X
: {array-like, sparse matrix, dataframe}
of shape (n_samples, n_features)
:
y
: ndarray
of shape (n_samples,)
, default=None
: Target values.
**fit_params
: dict
: Additional fit parameters.
Returns
X_new
: ndarray array
of shape (n_samples, n_features_new)
: Transformed array.
Multiscale diffusion {#dbmap.multiscale}
Function multiscale
{#dbmap.multiscale.multiscale}
def multiscale( res, n_eigs=None )
Determine multi scale space of the data :param n_eigs: Number of eigen vectors to use. If None specified, the number of eigen vectors will be determined using eigen gap identification. :return: Multi scaled data matrix
Plotting utilities {#dbmap.plot}
Function scatter_plot
{#dbmap.plot.scatter_plot}
def scatter_plot( res, title=None, fontsize=18, labels=None, pt_size=None, marker='o', opacity=1 )
Optimized UMAP (AMAP) {#dbmap.umapper}
Class AMAP
{#dbmap.umapper.AMAP}
class AMAP( n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, use_nmslib=True, nmslib_metric='cosine', nmslib_n_jobs=10, nmslib_efC=100, nmslib_efS=100, nmslib_M=30, n_epochs=None, learning_rate=1.5, init='spectral', min_dist=0.6, spread=1.5, low_memory=False, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, force_approximation_algorithm=False, verbose=False, unique=False )
Adaptive Manifold Approximation and Projection
Finds a low dimensional embedding of the data that approximates
the underlying manifold through fuzzy-union layout. Accelerated
when use_nmslib = True
.
Parameters
n_neighbors
: float (optional
, default 15)
: The size of local neighborhood (in terms of number of neighboring
sample points) used for manifold approximation. Larger values
result in more global views of the manifold, while smaller
values result in more local data being preserved. In general
values should be in the range 2 to 100.
n_components
: int (optional
, default 2)
: The dimension of the space to embed into. This defaults to 2 to
provide easy visualization, but can reasonably be set to any
integer value in the range 2 to 100.
use_nmslib
: bool (optional
, default True)
: Whether to use NMSLibTransformer to compute fast approximate nearest
neighbors. This is a wrapper aroud NMSLIB that supports fast and parallelized
computation with an array of handy features. If set to True, distances
are measured in the space defined on the ann_metric parameter.
nmslib_metric
: str (optional
, default 'cosine')
: accepted NMSLIB metrics. Defaults to 'cosine'. Accepted metrics include:
-'sqeuclidean'
-'euclidean'
-'l1'
-'cosine'
-'angular'
-'negdotprod'
-'levenshtein'
-'hamming'
-'jaccard'
-'jansen-shan'
nmslib_n_jobs
: int (optional
, default None)
: Number of threads to use for approximate-nearest neighbor search.
nmslib_efC
: int (optional
, default 100)
: increasing this value improves the quality of a constructed graph and leads to higher
accuracy of search. However this also leads to longer indexing times. A reasonable
range is 100-2000.
nmslib_efS
: int (optional
, default 100)
: similarly to efC, improving this value improves recall at the expense of longer
retrieval time. A reasonable range is 100-2000.
nmslib_M: int (optional, default 30).
defines the maximum number of neighbors in the zero and above-zero layers during HSNW
(Hierarchical Navigable Small World Graph). However, the actual default maximum number
of neighbors for the zero layer is 2M. For more information on HSNW, please check
https://arxiv.org/abs/1603.09320. HSNW is implemented in python via NMSLIB. Please check
more about NMSLIB at https://github.com/nmslib/nmslib . n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the
low dimensional embedding. Larger values result in more accurate
embeddings. If None is specified a value will be selected based on
the size of the input dataset (200 for large datasets, 500 for small).
learning_rate
* : float (optional
, default 1.0)
: The initial learning rate for the embedding optimization.
init
: string (optional
, default 'spectral')
: How to initialize the low dimensional embedding. Options are:
* 'spectral': use a spectral embedding of the fuzzy 1-skeleton
* 'random': assign initial embedding positions at random.
* A numpy array of initial embedding positions.
min_dist
: float (optional
, default 0.1)
: The effective minimum distance between embedded points. Smaller values
will result in a more clustered/clumped embedding where nearby points
on the manifold are drawn closer together, while larger values will
result on a more even dispersal of points. The value should be set
relative to the spread
value, which determines the scale at which
embedded points will be spread out.
spread
: float (optional
, default 1.0)
: The effective scale of embedded points. In combination with min_dist
this determines how clustered/clumped the embedded points are.
metric
: string
or function (optional
, default 'euclidean')
: Used if use_nmslib = False
. The metric to use to compute distances
in high dimensional space. If a string is passed it must match a valid
predefined metric. If a general metric is required a function that takes
two 1d arrays and returns a float can be provided. For performance purposes
it is required that this be a numba jit'd function. Valid string metrics
that should be used within AMAP include:
* euclidean
* manhattan
* seuclidean
* cosine
* correlation
* haversine
* hamming
* jaccard
low_memory
: bool (optional
, default False)
: If you find that AMAP is failing due to memory constraints
consider setting use_nmslib to False
and this option to True
. This approach
is more computationally expensive, but avoids excessive memory use.
set_op_mix_ratio
: float (optional
, default 1.0)
: Interpolate between (fuzzy) union and intersection as the set operation
used to combine local fuzzy simplicial sets to obtain a global fuzzy
simplicial sets. Both fuzzy set operations use the product t-norm.
The value of this parameter should be between 0.0 and 1.0; a value of
1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy
intersection.
local_connectivity
: int (optional
, default 1)
: The local connectivity required -- i.e. the number of nearest
neighbors that should be assumed to be connected at a local level.
The higher this value the more connected the manifold becomes
locally. In practice this should be not more than the local intrinsic
dimension of the manifold.
repulsion_strength
: float (optional
, default 1.0)
: Weighting applied to negative samples in low dimensional embedding
optimization. Values higher than one will result in greater weight
being given to negative samples.
negative_sample_rate
: int (optional
, default 5)
: The number of negative samples to select per positive sample
in the optimization process. Increasing this value will result
in greater repulsive force being applied, greater optimization
cost, but slightly more accuracy.
transform_queue_size
: float (optional
, default 4.0)
: For transform operations (embedding new points using a trained model_
this will control how aggressively to search for nearest neighbors.
Larger values will result in slower performance but more accurate
nearest neighbor evaluation.
a
: float (optional
, default None)
: More specific parameters controlling the embedding. If None these
values are set automatically as determined by min_dist
and
spread
.
b
: float (optional
, default None)
: More specific parameters controlling the embedding. If None these
values are set automatically as determined by min_dist
and
spread
.
random_state
: int, RandomState instance
or None
, optional (default: None)
: If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by np.random
.
metric_kwds
: dict (optional
, default None)
: Arguments to pass on to the metric, such as the p
value for
Minkowski distance. If None then no arguments are passed on.
angular_rp_forest
: bool (optional
, default False)
: Whether to use an angular random projection forest to initialise
the approximate nearest neighbor search. This can be faster, but is
mostly on useful for metric that use an angular style distance such
as cosine, correlation etc. In the case of those metrics angular forests
will be chosen automatically.
target_n_neighbors
: int (optional
, default -1)
: The number of nearest neighbors to use to construct the target simplcial
set. If set to -1 use the n_neighbors
value.
target_metric
: string
or callable (optional
, default 'categorical')
: The metric used to measure distance for a target array is using supervised
dimension reduction. By default this is 'categorical' which will measure
distance in terms of whether categories match or are different. Furthermore,
if semi-supervised is required target values of -1 will be trated as
unlabelled under the 'categorical' metric. If the target array takes
continuous values (e.g. for a regression problem) then metric of 'l1'
or 'l2' is probably more appropriate.
target_metric_kwds
: dict (optional
, default None)
: Keyword argument to pass to the target metric when performing
supervised dimension reduction. If None then no arguments are passed on.
target_weight
: float (optional
, default 0.5)
: weighting factor between data topology and target topology. A value of
0.0 weights entirely on data, a value of 1.0 weights entirely on target.
The default of 0.5 balances the weighting equally between data and target.
transform_seed
: int (optional
, default 42)
: Random seed used for the stochastic aspects of the transform operation.
This ensures consistency in transform operations.
verbose
: bool (optional
, default False)
: Controls verbosity of logging.
unique
: bool (optional
, default False)
: Controls if the rows of your data should be uniqued before being
embedded. If you have more duplicates than you have n_neighbour
you can have the identical data points lying in different regions of
your space. It also violates the definition of a metric.
Ancestors (in MRO)
Methods
Method fit
{#dbmap.umapper.AMAP.fit}
def fit( self, X, y=None )
Fit X into an embedded space. Optionally use y for supervised dimension reduction.
Parameters
X
: array, shape (n_samples, n_features)
or (n_samples, n_samples)
: If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row. If the method
is 'exact', X may be a sparse matrix of type 'csr', 'csc'
or 'coo'.
y
: array, shape (n_samples)
: A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are target_metric
and
target_metric_kwds
.
Method fit_transform
{#dbmap.umapper.AMAP.fit_transform}
def fit_transform( self, X, y=None )
Fit X into an embedded space and return that transformed output.
Parameters
X
: array, shape (n_samples, n_features)
or (n_samples, n_samples)
: If the metric is 'precomputed' X must be a square distance
matrix. Otherwise it contains a sample per row.
y
: array, shape (n_samples)
: A target array for supervised dimension reduction. How this is
handled is determined by parameters UMAP was instantiated with.
The relevant attributes are target_metric
and
target_metric_kwds
.
Returns
X_new
: array, shape (n_samples, n_components)
: Embedding of the training data in low-dimensional space.
Method inverse_transform
{#dbmap.umapper.AMAP.inverse_transform}
def inverse_transform( self, X )
Transform X in the existing embedded space back into the input data space and return that transformed output.
Parameters
X
: array, shape (n_samples, n_components)
: New points to be inverse transformed.
Returns
X_new
: array, shape (n_samples, n_features)
: Generated data points new data in data space.
Method transform
{#dbmap.umapper.AMAP.transform}
def transform( self, X )
Transform X into the existing embedded space and return that transformed output.
Parameters
X
: array, shape (n_samples, n_features)
: New data to be transformed.
Returns
X_new
: array, shape (n_samples, n_components)
: Embedding of the new data in low-dimensional space.
Class DataFrameUMAP
{#dbmap.umapper.DataFrameUMAP}
class DataFrameUMAP( metrics, n_neighbors=15, n_components=2, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, verbose=False )
Base class for all estimators in scikit-learn
Notes
All estimators should specify all the parameters that can be set
at the class level in their __init__
as explicit keyword
arguments (no *args
or **kwargs
).
Ancestors (in MRO)
Methods
Method fit
{#dbmap.umapper.DataFrameUMAP.fit}
def fit( self, X, y=None )
Generated by pdoc 0.9.1 (https://pdoc3.github.io).