SUMO modes

class sumo.modes.mode.SumoMode(**kwargs)

Defines modes of sumo


Run mode specific functionality


SumoPrepare Class

class sumo.modes.prepare.prepare.SumoPrepare(**kwargs)

Sumo mode for data pre-processing and creation of multiplex network files. Constructor args are set in ‘prepare’ subparser.

infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows (supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])
outfile (str): path to output .npz file
method (list): comma-separated list of methods for every layer (available methods: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’])
k (float): fraction of nearest neighbours to use for sample similarity calculation using Euclidean distance similarity
alpha (float): hypherparameter of RBF similarity kernel, for Euclidean distance similarity
missing (list): acceptable fraction of available (not missing) values for assessment of distance/similarity between pairs of samples, either one value or different values for every layer
atol (float): if input files have continuous values, sumo checks if data is standardized feature-wise, meaning all features should have mean close to zero, with standard deviation around one; use this parameter to set tolerance of standardization checks
sn (int): index of row with sample names for .txt input files
fn (int): index of column with feature names for .txt input files
df (float): if percentage of missing values for feature exceeds this value, remove feature
ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
plot (str): path to save adjacency matrix heatmap(s), if set None plots are displayed on screen

Load all of input files

list of tuples, every containing file name (str) and filtered feature matrix (pandas.DataFrame))

Generate similarity matrices for samples based on biological data

Similarity Metrics

Calculate correlation similarity between two vectors

Calculate correlation similarity between two vectors

Calculate cosine similarity between two vectors

Calculate cosine similarity between two vectors

Calculate euclidean distance between two vectors of continuous variables

Calculate euclidean distance between two vectors of continuous variables

Generate similarity matrix using RBF kernel and supplied distance function

Generate similarity matrix using RBF kernel and supplied distance function

f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) n (float): fraction of nearest neighbours to use for samples similarity calculation missing (float): acceptable fraction of values for assessment of distance/similarity between two samples alpha (float): hyperparameter of RBF kernel distance: distance function accepting two vectors and missing parameter (default of Euclidean distance)
w (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)
Generate similarity matrix from genomic assay

Generate similarity matrix from genomic assay

f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) missing (float): acceptable fraction of values for assessment of distance/similarity between two samples (default of 0.1, means that up to 90 % of missing values is acceptable) method (str): similarity method selected from: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’] n (float): parameter of euclidean similarity method, fraction of nearest neighbours of sample alpha (float): parameter of euclidean similarity method, RBF kernel hyperparameter
sim (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)


SumoRun Class


Sumo mode for factorization of multiplex network to identify molecular subtypes. Constructor args are set in ‘run’ subparser.

infile (str): input .npz file containing adjacency matrices for every network layer and sample names (file created by running program with mode “prepare”) - consecutive adjacency arrays in file are indexed in following way: “0”, “1” … and index of sample name vector is “samples”
k (int): number of clusters
outdir (str) path to save output files
sparsity (list): list of sparsity penalty values for H matrix (if multiple values sumo will try all and select the best results
n (int): number of repetitions
method (str): method of cluster extraction, selected from [‘max_value’, ‘spectral’]
max_iter (int): maximum number of iterations for factorization
tol (float): if objective cost function value fluctuation is smaller than this value, stop iterations before reaching max_iter
subsample (float): fraction of samples randomly removed from each run, cannot be greater then 0.5
calc_cost (int): number of steps between every calculation of objective cost function
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
h_init (int): index of adjacency matrix to use for H matrix initialization, if set to None average adjacency matrix is used
t (int): number of threads
rep (int): number of times consensus matrix is created for the purpose of assessing clustering quality

Cluster multiplex network using non-negative matrix tri-factorization

NMF Solvers

class

Wrapper class for SumoNMF factorization results

Wrapper class for SumoNMF factorization results

extract_clusters(method: str)

Extract cluster labels using selected method

method (str): either “max_value” for extraction based on maximum value in each row of h matrix or “spectral” for spectral clustering on h matrix values
class

Defines solver of sumo

Defines solver of sumo

graph (MultiplexNet): network object, containing data about connections between nodes in each layer in form of adjacency matrices
nbins (int): number of bins, to distribute samples into
bin_size (int): size of bin, if None set to number of samples
Creates average adjacency matrix

Creates average adjacency matrix

create_sample_bins() → list

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin.

Returns: list of arrays containing indices of samples allocated to the bin

Run solver specific factorization

Run solver specific factorization

Unsupervised SUMO

class

Unsupervised SUMO solver (A(i)=HS(i)H^T formulation)

Unsupervised SUMO solver (A(i)=HS(i)H^T formulation)

Run tri-factorization

Run tri-factorization

sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function h_init (int): index of adjacency matrix to use for H matrix initialization or None for initialization using average adjacency logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)

Supervised SUMO

class

Supervised SUMO solver (A(i)=HS(i)H^T formulation)

Supervised SUMO solver (A(i)=HS(i)H^T formulation)

create_sample_bins() → list

Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin and each prior label in represented equally.

Returns: list of arrays containing indices of samples allocated to the bin

Run tri-factorization

Run tri-factorization

sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)


SumoEvaluate Class

class sumo.modes.evaluate.evaluate.SumoEvaluate(**kwargs)

Sumo mode for evaluating accuracy of clustering. Constructor args are set in ‘evaluate’ subparser.

infile (str): input .tsv file containing sample names in ‘sample’ and clustering labels in ‘label’ column (clusters.tsv file created by running sumo with mode ‘run’)
labels (str): .tsv of the same structure as input file
metric (str): one of metrics ([‘NMI’, ‘purity’, ‘ARI’]) for accuracy evaluation, if set to None all metrics are calculated
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
load_tsv(fname: str)

Load .tsv file


Evaluate clustering results, given set of labels


SumoInterpret Class

class sumo.modes.interpret.interpret.SumoInterpret(**kwargs)

Sumo mode for interpreting clustering results. Constructor args are set in ‘interpret’ subparser.

sumo_results (str): path to sumo_results.npz (created by running program with mode “run”)
infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows(supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])
output_prefix (str): prefix of output files - sumo will create two output files (1) .tsv file containing matrix (features x clusters), where the value in each cell is the importance of the feature in that cluster; (2) .hits.tsv file containing features of most importance
hits (int): sets number of most important features for every cluster, that are logged in .hits.tsv file
max_iter (int): maximum number of iterations, while searching through hyperparameter space
n_folds (int): number of folds for model cross validation
t (int): number of threads
seed (int): random state
sn (int): index of row with sample names for .txt input files
fn (int): index of column with feature names for .txt input files
df (float): if percentage of missing values for feature exceeds this value, remove feature
ds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove sample
logfile (str): path to save log file, if set to None stdout is used
log (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
Create a gradient boosting method classifier

Create a gradient boosting method classifier

x (Numpy.ndarray): input feature matrix y (Numpy.ndarray): one dimensional array of labels in classification
LGBM classifier

Find features that support clusters separation