Utilities

sumo.utils.adjusted_rand_index(cl: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b400>, org: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b390>)

Clustering accuracy measure calculated by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings

sumo.utils.check_accuracy(cl: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b828>, org: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b860>, method='purity')

Check clustering accuracy

Args:
cl (Numpy.ndarray): one dimensional array containing computed clusters ids for every node org (Numpy.ndarray): one dimensional array containing true classes ids for every node method (str): accuracy assessment function from [‘NMI’, ‘purity’, ‘ARI’]
sumo.utils.check_categories(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b208>)

Check categories in data

sumo.utils.check_matrix_symmetry(m: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b0b8>, tol=1e-08, equal_nan=True)

Check symmetry of numpy array, after removal of missing samples

sumo.utils.close_logger(logger)

Remove all handlers of logger

sumo.utils.docstring_formatter(*args, **kwargs)

Decorator allowing for printing variable values in docstrings

sumo.utils.extract_max_value(h: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b3c8>)

Select clusters based on maximum value in feature matrix H for every sample/row

Args:
h (Numpy.ndarray): feature matrix from optimization algorithm run of shape (n,k), where ‘n’ is a number of nodes
and ‘k’ is a number of clusters
Returns:
one dimensional array containing clusters ids for every node
sumo.utils.extract_ncut(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b128>, k: int)

Select clusters using normalized cut based on graph similarity matrix

Args:
a (Numpy.ndarray): symmetric similarity matrix k (int): number of clusters
Returns:
one dimensional array containing clusters ids for every node
sumo.utils.extract_spectral(h: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b4a8>, assign_labels: str = 'kmeans', n_neighbors: int = 10, n_clusters: int = None)

Select clusters using spectral clustering of feature matrix H

Args:
h (Numpy.ndarray): feature matrix from optimization algorithm run of shape (n,k), where ‘n’ is a number of nodes and ‘k’ is a number of clusters assign_labels : {‘kmeans’, ‘discretize’}, strategy to use to assign labels in the embedding space n_neighbors (int): number of neighbors to use when constructing the affinity matrix n_clusters (int): number of clusters, if not set use number of columns of ‘a’
Returns:
one dimensional array containing clusters ids for every node
sumo.utils.filter_features_and_samples(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b8d0>, drop_features: float = 0.1, drop_samples: float = 0.1)

Filter data frame features and samples

Args:
data (pandas.DataFrame): data frame (with samples in columns and features in rows) drop_features (float): if percentage of missing values for feature exceeds this value, remove this feature drop_samples (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove this sample
Returns:
filtered data frame
sumo.utils.is_standardized(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b898>, axis: int = 1, atol: float = 0.001)

Check if matrix values are standardized (have mean equal 0 and standard deviation equal 1)

Args:
a (Numpy.ndarray):feature matrix axis: either 0 (column-wise standardization) or 1 (row-wise standardization) atol (float): absolute tolerance
Returns:
is_standard (bool): True if data is standardized mean (float): maximum and minimum mean of columns/rows std (float): maximum and minimum standard deviation of columns/rows
sumo.utils.load_data_text(file_path: str, sample_names: int = None, feature_names: int = None, drop_features: float = 0.1, drop_samples: float = 0.1)

Loads data from text file (with samples in columns and features in rows) into pandas.DataFrame

Args:
file_path (str): path to the tab delimited .txt file sample_names (int): index of row with sample names feature_names (int): index of column with feature names drop_features (float): if percentage of missing values for feature exceeds this value, remove this feature drop_samples (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove this sample
Returns:
data (pandas.DataFrame): data frame loaded from file, with missing values removed
sumo.utils.load_npz(file_path: str)

Load data from .npz file

Args:
file_path (str): path to .npz file
Returns:
dictionary with arrays as values and their indices used during saving to .npz file as keys
sumo.utils.normalized_mutual_information(cl: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b7b8>, org: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b7f0>)

Clustering accuracy measure, which takes into account mutual information between two clusterings and entropy of each cluster

sumo.utils.plot_metric(x: list, y: list, xlabel='x', ylabel='y', title='', file_path: str = None, color='blue', allow_omit_xticks: bool = False)

Create plot of median metric values, with ribbon between min and max values for each x

sumo.utils.purity(cl: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b470>, org: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e868b438>)

Clustering accuracy measure representing percentage of total number of nodes classified correctly

sumo.utils.save_arrays_to_npz(data: Union[dict, list], file_path: str)

Save numpy arrays to .npz file

Args:
data (dict/list): list of numpy arrays or dictionary with specified keywords for every array file_path (str): optional path to output file
sumo.utils.setup_logger(logger_name, level='INFO', log_file: str = None)

Create and configure logging object