SUMO modes¶

class sumo.modes.mode.SumoMode(**kwargs)¶

Defines modes of sumo

run()¶: Run mode specific functionality

prepare¶

SumoPrepare Class¶

class sumo.modes.prepare.prepare.SumoPrepare(**kwargs)¶

Sumo mode for data pre-processing and creation of multiplex network files. Constructor args are set in ‘prepare’ subparser.

Args:: infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows (supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])

outfile (str): path to output .npz file

method (list): comma-separated list of methods for every layer (available methods: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’])

k (float): fraction of nearest neighbours to use for sample similarity calculation using Euclidean distance similarity

alpha (float): hypherparameter of RBF similarity kernel, for Euclidean distance similarity

missing (list): acceptable fraction of available (not missing) values for assessment of distance/similarity between pairs of samples, either one value or different values for every layer

atol (float): if input files have continuous values, sumo checks if data is standardized feature-wise, meaning all features should have mean close to zero, with standard deviation around one; use this parameter to set tolerance of standardization checks

sn (int): index of row with sample names for .txt input files

fn (int): index of column with feature names for .txt input files

df (float): if percentage of missing values for feature exceeds this value, remove feature

ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample

logfile (str): path to save log file, if set to None stdout is used

log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]

plot (str): path to save adjacency matrix heatmap(s), if set None plots are displayed on screen

load_all_data()¶

Load all of input files

Returns:: list of tuples, every containing file name (str) and filtered feature matrix (pandas.DataFrame))

run()¶: Generate similarity matrices for samples based on biological data

Similarity Metrics¶

sumo.modes.prepare.similarity.correlation(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338908>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338940>, missing: float, method='pearson')¶: Calculate correlation similarity between two vectors

sumo.modes.prepare.similarity.cosine_sim(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3382b0>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3381d0>, missing: float)¶: Calculate cosine similarity between two vectors

sumo.modes.prepare.similarity.euclidean_dist(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338860>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338128>, missing: float)¶: Calculate euclidean distance between two vectors of continuous variables

sumo.modes.prepare.similarity.feature_rbf_similarity(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3389b0>, missing: float = 0.1, n: float = 0.1, alpha: float = 0.5, distance=<function euclidean_dist>)¶

Generate similarity matrix using RBF kernel and supplied distance function

Args:: f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) n (float): fraction of nearest neighbours to use for samples similarity calculation missing (float): acceptable fraction of values for assessment of distance/similarity between two samples alpha (float): hyperparameter of RBF kernel distance: distance function accepting two vectors and missing parameter (default of Euclidean distance)
Returns:: w (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)

sumo.modes.prepare.similarity.feature_to_adjacency(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338978>, missing: float = 0.1, method: str = 'euclidean', n: float = None, alpha: float = None)¶

Generate similarity matrix from genomic assay

Args:: f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) missing (float): acceptable fraction of values for assessment of distance/similarity between two samples (default of 0.1, means that up to 90 % of missing values is acceptable) method (str): similarity method selected from: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’] n (float): parameter of euclidean similarity method, fraction of nearest neighbours of sample alpha (float): parameter of euclidean similarity method, RBF kernel hyperparameter
Returns:: sim (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)

run¶

SumoRun Class¶

class sumo.modes.run.run.SumoRun(**kwargs)¶

Sumo mode for factorization of multiplex network to identify molecular subtypes. Constructor args are set in ‘run’ subparser.

Args:: infile (str): input .npz file containing adjacency matrices for every network layer and sample names (file created by running program with mode “prepare”) - consecutive adjacency arrays in file are indexed in following way: “0”, “1” … and index of sample name vector is “samples”

k (int): number of clusters

outdir (str) path to save output files

sparsity (list): list of sparsity penalty values for H matrix (if multiple values sumo will try all and select the best results

n (int): number of repetitions

method (str): method of cluster extraction, selected from [‘max_value’, ‘spectral’]

max_iter (int): maximum number of iterations for factorization

tol (float): if objective cost function value fluctuation is smaller than this value, stop iterations before reaching max_iter

subsample (float): fraction of samples randomly removed from each run, cannot be greater then 0.5

calc_cost (int): number of steps between every calculation of objective cost function

logfile (str): path to save log file, if set to None stdout is used

log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]

h_init (int): index of adjacency matrix to use for H matrix initialization, if set to None average adjacency matrix is used

t (int): number of threads

rep (int): number of times consensus matrix is created for the purpose of assessing clustering quality

run()¶: Cluster multiplex network using non-negative matrix tri-factorization

NMF Solvers¶

class sumo.modes.run.solver.SumoNMFResults(graph: sumo.network.MultiplexNet, h: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b385128>, s: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b37a4e0>, objval: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b37a278>, steps: int, logger: logging.Logger, sample_ids: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b367c88>, **kwargs)¶

Wrapper class for SumoNMF factorization results

extract_clusters(method: str)¶

Extract cluster labels using selected method

Args:: method (str): either “max_value” for extraction based on maximum value in each row of h matrix or “spectral” for spectral clustering on h matrix values

class sumo.modes.run.solver.SumoSolver(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)¶

Defines solver of sumo

Args:: graph (MultiplexNet): network object, containing data about connections between nodes in each layer in form of adjacency matrices

nbins (int): number of bins, to distribute samples into

bin_size (int): size of bin, if None set to number of samples

calculate_avg_adjacency() → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b367dd8>¶: Creates average adjacency matrix

create_sample_bins() → list¶

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin.

Returns: list of arrays containing indices of samples allocated to the bin

factorize(sparsity_penalty: float, k: int, max_iter: int, tol: float, calc_cost: int, logger_name: str, bin_id: int) → sumo.modes.run.solver.SumoNMFResults¶: Run solver specific factorization

Unsupervised SUMO¶

class sumo.modes.run.solvers.unsupervised_sumo.UnsupervisedSumoNMF(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)¶

Unsupervised SUMO solver (A(i)=HS(i)H^T formulation)

factorize(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, h_init: int = None, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults¶

Run tri-factorization

Args:: sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function h_init (int): index of adjacency matrix to use for H matrix initialization or None for initialization using average adjacency logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
Returns:: SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)

Supervised SUMO¶

class sumo.modes.run.solvers.supervised_sumo.SupervisedSumoNMF(graph: sumo.network.MultiplexNet, priors: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b4077b8>, nbins: int, bin_size: int = None, rseed: int = None)¶

Supervised SUMO solver (A(i)=HS(i)H^T formulation)

create_sample_bins() → list¶

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin and each prior label in represented equally.

Returns: list of arrays containing indices of samples allocated to the bin

factorize(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults¶

Run tri-factorization

Args:: sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
Returns:: SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)

evaluate¶

SumoEvaluate Class¶

class sumo.modes.evaluate.evaluate.SumoEvaluate(**kwargs)¶

Sumo mode for evaluating accuracy of clustering. Constructor args are set in ‘evaluate’ subparser.

Args:: infile (str): input .tsv file containing sample names in ‘sample’ and clustering labels in ‘label’ column (clusters.tsv file created by running sumo with mode ‘run’)

labels (str): .tsv of the same structure as input file

metric (str): one of metrics ([‘NMI’, ‘purity’, ‘ARI’]) for accuracy evaluation, if set to None all metrics are calculated

logfile (str): path to save log file, if set to None stdout is used

log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]

load_tsv(fname: str)¶: Load .tsv file

run()¶: Evaluate clustering results, given set of labels

interpret¶

SumoInterpret Class¶

class sumo.modes.interpret.interpret.SumoInterpret(**kwargs)¶

Sumo mode for interpreting clustering results. Constructor args are set in ‘interpret’ subparser.

Args:: sumo_results (str): path to sumo_results.npz (created by running program with mode “run”)

infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows(supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])

output_prefix (str): prefix of output files - sumo will create two output files (1) .tsv file containing matrix (features x clusters), where the value in each cell is the importance of the feature in that cluster; (2) .hits.tsv file containing features of most importance

hits (int): sets number of most important features for every cluster, that are logged in .hits.tsv file

max_iter (int): maximum number of iterations, while searching through hyperparameter space

n_folds (int): number of folds for model cross validation

t (int): number of threads

seed (int): random state

sn (int): index of row with sample names for .txt input files

fn (int): index of column with feature names for .txt input files

df (float): if percentage of missing values for feature exceeds this value, remove feature

ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample

logfile (str): path to save log file, if set to None stdout is used

log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]

create_classifier(x: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3326a0>, y: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3326d8>)¶

Create a gradient boosting method classifier

Args:: x (Numpy.ndarray): input feature matrix y (Numpy.ndarray): one dimensional array of labels in classification
Returns:: LGBM classifier

run()¶: Find features that support clusters separation