SUMO modes

class sumo.modes.mode.SumoMode(**kwargs)

Defines modes of sumo

run()

Run mode specific functionality

prepare

SumoPrepare Class

class sumo.modes.prepare.prepare.SumoPrepare(**kwargs)

Sumo mode for data pre-processing and creation of multiplex network files. Constructor args are set in ‘prepare’ subparser.

Args:
infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows (supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])
outfile (str): path to output .npz file
method (list): comma-separated list of methods for every layer (available methods: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’])
k (float): fraction of nearest neighbours to use for sample similarity calculation using Euclidean distance similarity
alpha (float): hypherparameter of RBF similarity kernel, for Euclidean distance similarity
missing (list): acceptable fraction of available (not missing) values for assessment of distance/similarity between pairs of samples, either one value or different values for every layer
atol (float): if input files have continuous values, sumo checks if data is standardized feature-wise, meaning all features should have mean close to zero, with standard deviation around one; use this parameter to set tolerance of standardization checks
sn (int): index of row with sample names for .txt input files
fn (int): index of column with feature names for .txt input files
df (float): if percentage of missing values for feature exceeds this value, remove feature
ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
plot (str): path to save adjacency matrix heatmap(s), if set None plots are displayed on screen
load_all_data()

Load all of input files

Returns:
list of tuples, every containing file name (str) and filtered feature matrix (pandas.DataFrame))
run()

Generate similarity matrices for samples based on biological data

Similarity Metrics

sumo.modes.prepare.similarity.correlation(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb9b0>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb9e8>, missing: float, method='pearson')

Calculate correlation similarity between two vectors

sumo.modes.prepare.similarity.cosine_sim(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb358>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb278>, missing: float)

Calculate cosine similarity between two vectors

sumo.modes.prepare.similarity.euclidean_dist(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb908>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fb1d0>, missing: float)

Calculate euclidean distance between two vectors of continuous variables

sumo.modes.prepare.similarity.feature_rbf_similarity(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fba58>, missing: float = 0.1, n: float = 0.1, alpha: float = 0.5, distance=<function euclidean_dist>)

Generate similarity matrix using RBF kernel and supplied distance function

Args:
f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) n (float): fraction of nearest neighbours to use for samples similarity calculation missing (float): acceptable fraction of values for assessment of distance/similarity between two samples alpha (float): hyperparameter of RBF kernel distance: distance function accepting two vectors and missing parameter (default of Euclidean distance)
Returns:
w (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)
sumo.modes.prepare.similarity.feature_to_adjacency(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fba20>, missing: float = 0.1, method: str = 'euclidean', n: float = None, alpha: float = None)

Generate similarity matrix from genomic assay

Args:
f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) missing (float): acceptable fraction of values for assessment of distance/similarity between two samples (default of 0.1, means that up to 90 % of missing values is acceptable) method (str): similarity method selected from: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’] n (float): parameter of euclidean similarity method, fraction of nearest neighbours of sample alpha (float): parameter of euclidean similarity method, RBF kernel hyperparameter
Returns:
sim (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)

run

SumoRun Class

class sumo.modes.run.run.SumoRun(**kwargs)

Sumo mode for factorization of multiplex network to identify molecular subtypes. Constructor args are set in ‘run’ subparser.

Args:
infile (str): input .npz file containing adjacency matrices for every network layer and sample names (file created by running program with mode “prepare”) - consecutive adjacency arrays in file are indexed in following way: “0”, “1” … and index of sample name vector is “samples”
k (int): number of clusters
outdir (str) path to save output files
sparsity (list): list of sparsity penalty values for H matrix (if multiple values sumo will try all and select the best results
n (int): number of repetitions
method (str): method of cluster extraction, selected from [‘max_value’, ‘spectral’]
max_iter (int): maximum number of iterations for factorization
tol (float): if objective cost function value fluctuation is smaller than this value, stop iterations before reaching max_iter
subsample (float): fraction of samples randomly removed from each run, cannot be greater then 0.5
calc_cost (int): number of steps between every calculation of objective cost function
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
h_init (int): index of adjacency matrix to use for H matrix initialization, if set to None average adjacency matrix is used
t (int): number of threads
rep (int): number of times consensus matrix is created for the purpose of assessing clustering quality
run()

Cluster multiplex network using non-negative matrix tri-factorization

NMF Solvers

class sumo.modes.run.solver.SumoNMFResults(graph: sumo.network.MultiplexNet, h: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fbf60>, s: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fbf28>, objval: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e85fbe48>, steps: int, logger: logging.Logger, sample_ids: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e8612b38>, **kwargs)

Wrapper class for SumoNMF factorization results

extract_clusters(method: str)

Extract cluster labels using selected method

Args:
method (str): either “max_value” for extraction based on maximum value in each row of h matrix or “spectral” for spectral clustering on h matrix values
class sumo.modes.run.solver.SumoSolver(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)

Defines solver of sumo

Args:
graph (MultiplexNet): network object, containing data about connections between nodes in each layer in form of adjacency matrices
nbins (int): number of bins, to distribute samples into
bin_size (int): size of bin, if None set to number of samples
calculate_avg_adjacency() → <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e8612940>

Creates average adjacency matrix

create_sample_bins() → list

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin.

Returns: list of arrays containing indices of samples allocated to the bin

factorize(sparsity_penalty: float, k: int, max_iter: int, tol: float, calc_cost: int, logger_name: str, bin_id: int) → sumo.modes.run.solver.SumoNMFResults

Run solver specific factorization

Unsupervised SUMO

class sumo.modes.run.solvers.unsupervised_sumo.UnsupervisedSumoNMF(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)

Unsupervised SUMO solver (A(i)=HS(i)H^T formulation)

factorize(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, h_init: int = None, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults

Run tri-factorization

Args:
sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function h_init (int): index of adjacency matrix to use for H matrix initialization or None for initialization using average adjacency logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
Returns:
SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)

Supervised SUMO

class sumo.modes.run.solvers.supervised_sumo.SupervisedSumoNMF(graph: sumo.network.MultiplexNet, priors: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e8612f28>, nbins: int, bin_size: int = None, rseed: int = None)

Supervised SUMO solver (A(i)=HS(i)H^T formulation)

create_sample_bins() → list

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin and each prior label in represented equally.

Returns: list of arrays containing indices of samples allocated to the bin

factorize(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults

Run tri-factorization

Args:
sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
Returns:
SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)

evaluate

SumoEvaluate Class

class sumo.modes.evaluate.evaluate.SumoEvaluate(**kwargs)

Sumo mode for evaluating accuracy of clustering. Constructor args are set in ‘evaluate’ subparser.

Args:
infile (str): input .tsv file containing sample names in ‘sample’ and clustering labels in ‘label’ column (clusters.tsv file created by running sumo with mode ‘run’)
labels (str): .tsv of the same structure as input file
metric (str): one of metrics ([‘NMI’, ‘purity’, ‘ARI’]) for accuracy evaluation, if set to None all metrics are calculated
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
load_tsv(fname: str)

Load .tsv file

run()

Evaluate clustering results, given set of labels

interpret

SumoInterpret Class

class sumo.modes.interpret.interpret.SumoInterpret(**kwargs)

Sumo mode for interpreting clustering results. Constructor args are set in ‘interpret’ subparser.

Args:
sumo_results (str): path to sumo_results.npz (created by running program with mode “run”)
infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows(supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])
output_prefix (str): prefix of output files - sumo will create two output files (1) .tsv file containing matrix (features x clusters), where the value in each cell is the importance of the feature in that cluster; (2) .hits.tsv file containing features of most importance
hits (int): sets number of most important features for every cluster, that are logged in .hits.tsv file
max_iter (int): maximum number of iterations, while searching through hyperparameter space
n_folds (int): number of folds for model cross validation
t (int): number of threads
seed (int): random state
sn (int): index of row with sample names for .txt input files
fn (int): index of column with feature names for .txt input files
df (float): if percentage of missing values for feature exceeds this value, remove feature
ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
create_classifier(x: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e877e208>, y: <sphinx.ext.autodoc.importer._MockObject object at 0x7f10e877e240>)

Create a gradient boosting method classifier

Args:
x (Numpy.ndarray): input feature matrix y (Numpy.ndarray): one dimensional array of labels in classification
Returns:
LGBM classifier
run()

Find features that support clusters separation