SUMO modes¶
-
class
sumo.modes.mode.
SumoMode
(**kwargs)¶ Defines modes of sumo
-
run
()¶ Run mode specific functionality
-
prepare¶
SumoPrepare Class¶
-
class
sumo.modes.prepare.prepare.
SumoPrepare
(**kwargs)¶ Sumo mode for data pre-processing and creation of multiplex network files. Constructor args are set in ‘prepare’ subparser.
- Args:
- infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows (supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])outfile (str): path to output .npz filemethod (list): comma-separated list of methods for every layer (available methods: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’])k (float): fraction of nearest neighbours to use for sample similarity calculation using Euclidean distance similarityalpha (float): hypherparameter of RBF similarity kernel, for Euclidean distance similaritymissing (list): acceptable fraction of available (not missing) values for assessment of distance/similarity between pairs of samples, either one value or different values for every layeratol (float): if input files have continuous values, sumo checks if data is standardized feature-wise, meaning all features should have mean close to zero, with standard deviation around one; use this parameter to set tolerance of standardization checkssn (int): index of row with sample names for .txt input filesfn (int): index of column with feature names for .txt input filesdf (float): if percentage of missing values for feature exceeds this value, remove featureds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove samplelogfile (str): path to save log file, if set to None stdout is usedlog (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]plot (str): path to save adjacency matrix heatmap(s), if set None plots are displayed on screen
-
load_all_data
()¶ Load all of input files
- Returns:
- list of tuples, every containing file name (str) and filtered feature matrix (pandas.DataFrame))
-
run
()¶ Generate similarity matrices for samples based on biological data
Similarity Metrics¶
-
sumo.modes.prepare.similarity.
correlation
(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338908>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338940>, missing: float, method='pearson')¶ Calculate correlation similarity between two vectors
-
sumo.modes.prepare.similarity.
cosine_sim
(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3382b0>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3381d0>, missing: float)¶ Calculate cosine similarity between two vectors
-
sumo.modes.prepare.similarity.
euclidean_dist
(a: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338860>, b: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338128>, missing: float)¶ Calculate euclidean distance between two vectors of continuous variables
-
sumo.modes.prepare.similarity.
feature_rbf_similarity
(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3389b0>, missing: float = 0.1, n: float = 0.1, alpha: float = 0.5, distance=<function euclidean_dist>)¶ Generate similarity matrix using RBF kernel and supplied distance function
- Args:
- f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) n (float): fraction of nearest neighbours to use for samples similarity calculation missing (float): acceptable fraction of values for assessment of distance/similarity between two samples alpha (float): hyperparameter of RBF kernel distance: distance function accepting two vectors and missing parameter (default of Euclidean distance)
- Returns:
- w (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)
-
sumo.modes.prepare.similarity.
feature_to_adjacency
(f: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b338978>, missing: float = 0.1, method: str = 'euclidean', n: float = None, alpha: float = None)¶ Generate similarity matrix from genomic assay
- Args:
- f (Numpy.ndarray): Feature matrix (n x k, where ‘n’ - samples, ‘k’ - measurements) missing (float): acceptable fraction of values for assessment of distance/similarity between two samples (default of 0.1, means that up to 90 % of missing values is acceptable) method (str): similarity method selected from: [‘euclidean’, ‘cosine’, ‘pearson’, ‘spearman’] n (float): parameter of euclidean similarity method, fraction of nearest neighbours of sample alpha (float): parameter of euclidean similarity method, RBF kernel hyperparameter
- Returns:
- sim (Numpy.ndarray): symmetric matrix describing similarity between samples (n x n)
run¶
SumoRun Class¶
-
class
sumo.modes.run.run.
SumoRun
(**kwargs)¶ Sumo mode for factorization of multiplex network to identify molecular subtypes. Constructor args are set in ‘run’ subparser.
- Args:
- infile (str): input .npz file containing adjacency matrices for every network layer and sample names (file created by running program with mode “prepare”) - consecutive adjacency arrays in file are indexed in following way: “0”, “1” … and index of sample name vector is “samples”k (int): number of clustersoutdir (str) path to save output filessparsity (list): list of sparsity penalty values for H matrix (if multiple values sumo will try all and select the best resultsn (int): number of repetitionsmethod (str): method of cluster extraction, selected from [‘max_value’, ‘spectral’]max_iter (int): maximum number of iterations for factorizationtol (float): if objective cost function value fluctuation is smaller than this value, stop iterations before reaching max_itersubsample (float): fraction of samples randomly removed from each run, cannot be greater then 0.5calc_cost (int): number of steps between every calculation of objective cost functionlogfile (str): path to save log file, if set to None stdout is usedlog (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]h_init (int): index of adjacency matrix to use for H matrix initialization, if set to None average adjacency matrix is usedt (int): number of threadsrep (int): number of times consensus matrix is created for the purpose of assessing clustering quality
-
run
()¶ Cluster multiplex network using non-negative matrix tri-factorization
NMF Solvers¶
-
class
sumo.modes.run.solver.
SumoNMFResults
(graph: sumo.network.MultiplexNet, h: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b385128>, s: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b37a4e0>, objval: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b37a278>, steps: int, logger: logging.Logger, sample_ids: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b367c88>, **kwargs)¶ Wrapper class for SumoNMF factorization results
-
extract_clusters
(method: str)¶ Extract cluster labels using selected method
- Args:
- method (str): either “max_value” for extraction based on maximum value in each row of h matrix or “spectral” for spectral clustering on h matrix values
-
-
class
sumo.modes.run.solver.
SumoSolver
(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)¶ Defines solver of sumo
- Args:
- graph (MultiplexNet): network object, containing data about connections between nodes in each layer in form of adjacency matricesnbins (int): number of bins, to distribute samples intobin_size (int): size of bin, if None set to number of samples
-
calculate_avg_adjacency
() → <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b367dd8>¶ Creates average adjacency matrix
-
create_sample_bins
() → list¶ Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin.
Returns: list of arrays containing indices of samples allocated to the bin
-
factorize
(sparsity_penalty: float, k: int, max_iter: int, tol: float, calc_cost: int, logger_name: str, bin_id: int) → sumo.modes.run.solver.SumoNMFResults¶ Run solver specific factorization
Unsupervised SUMO¶
-
class
sumo.modes.run.solvers.unsupervised_sumo.
UnsupervisedSumoNMF
(graph: sumo.network.MultiplexNet, nbins: int, bin_size: int = None, rseed: int = None)¶ Unsupervised SUMO solver (A(i)=HS(i)H^T formulation)
-
factorize
(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, h_init: int = None, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults¶ Run tri-factorization
- Args:
- sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function h_init (int): index of adjacency matrix to use for H matrix initialization or None for initialization using average adjacency logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
- Returns:
- SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)
-
Supervised SUMO¶
-
class
sumo.modes.run.solvers.supervised_sumo.
SupervisedSumoNMF
(graph: sumo.network.MultiplexNet, priors: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b4077b8>, nbins: int, bin_size: int = None, rseed: int = None)¶ Supervised SUMO solver (A(i)=HS(i)H^T formulation)
-
create_sample_bins
() → list¶ Separate samples randomly into bins of set size, while making sure that each sample is allocated in at least one bin and each prior label in represented equally.
Returns: list of arrays containing indices of samples allocated to the bin
-
factorize
(sparsity_penalty: float, k: int, max_iter: int = 500, tol: float = 1e-05, calc_cost: int = 1, logger_name: str = None, bin_id: int = None) → sumo.modes.run.solver.SumoNMFResults¶ Run tri-factorization
- Args:
- sparsity_penalty (float): ‘η’ value, corresponding to sparsity penalty for H k (int): expected number of clusters max_iter (int): maximum number of iterations tol (float): if objective cost function value fluctuation is smaller than ‘stop_val’, stop iterations before reaching max_iter calc_cost (int): number of steps between every calculation of objective cost function logger_name (str): name of existing logger object, if not supplied new main logger is used bin_id (int): id of sample bin created in SumoNMF constructor (default of None, means clustering all samples instead of samples in given bin)
- Returns:
- SumoNMFResults object (with result feature matrix / soft cluster indicator matrix (H array), and the list of result S matrices for each graph layer)
-
evaluate¶
SumoEvaluate Class¶
-
class
sumo.modes.evaluate.evaluate.
SumoEvaluate
(**kwargs)¶ Sumo mode for evaluating accuracy of clustering. Constructor args are set in ‘evaluate’ subparser.
- Args:
- infile (str): input .tsv file containing sample names in ‘sample’ and clustering labels in ‘label’ column (clusters.tsv file created by running sumo with mode ‘run’)labels (str): .tsv of the same structure as input filemetric (str): one of metrics ([‘NMI’, ‘purity’, ‘ARI’]) for accuracy evaluation, if set to None all metrics are calculatedlogfile (str): path to save log file, if set to None stdout is usedlog (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
-
load_tsv
(fname: str)¶ Load .tsv file
-
run
()¶ Evaluate clustering results, given set of labels
interpret¶
SumoInterpret Class¶
-
class
sumo.modes.interpret.interpret.
SumoInterpret
(**kwargs)¶ Sumo mode for interpreting clustering results. Constructor args are set in ‘interpret’ subparser.
- Args:
- sumo_results (str): path to sumo_results.npz (created by running program with mode “run”)infiles (list): comma-delimited list of paths to input files, containing standardized feature matrices, with samples in columns and features in rows(supported types of files: [‘.txt’, ‘.txt.gz’, ‘.txt.bz2’, ‘.tsv’, ‘.tsv.gz’, ‘.tsv.bz2’])output_prefix (str): prefix of output files - sumo will create two output files (1) .tsv file containing matrix (features x clusters), where the value in each cell is the importance of the feature in that cluster; (2) .hits.tsv file containing features of most importancehits (int): sets number of most important features for every cluster, that are logged in .hits.tsv filemax_iter (int): maximum number of iterations, while searching through hyperparameter spacen_folds (int): number of folds for model cross validationt (int): number of threadsseed (int): random statesn (int): index of row with sample names for .txt input filesfn (int): index of column with feature names for .txt input filesdf (float): if percentage of missing values for feature exceeds this value, remove featureds (float): if percentage of missing values for sample (that remains after feature dropping) exceeds this value, remove samplelogfile (str): path to save log file, if set to None stdout is usedlog (str): sets the logging level from [‘DEBUG’, ‘INFO’, ‘WARNING’]
-
create_classifier
(x: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3326a0>, y: <sphinx.ext.autodoc.importer._MockObject object at 0x7f2f5b3326d8>)¶ Create a gradient boosting method classifier
- Args:
- x (Numpy.ndarray): input feature matrix y (Numpy.ndarray): one dimensional array of labels in classification
- Returns:
- LGBM classifier
-
run
()¶ Find features that support clusters separation