Intermediate Tutorial - Prepare Data & Multi-Algorithm Runs

This tutorial builds on the introduction to SPRAS from the previous tutorial.

It guides participants through how to convert data into a format usable by pathway reconstruction algorithms, run multiple algorithms within a single workflow, and apply new tools to interpret and compare the resulting pathways.

You will learn how to:

Prepare and format data for use with SPRAS
Configure and run additional pathway reconstruction algorithms on a dataset
Enable post-analysis steps to generate post analysis information

Step 1: Transforming high throughput experimental data into SPRAS compatible input data

1.1 What is the SPRAS-standardized input data?

A pathway reconstruction algorithm requires a set of input nodes and an interactome; however, each algorithm expects its inputs to follow a unique format.

To simplify this process, SPRAS requires all input data in a dataset to be formatted once into a standardized SPRAS format. SPRAS then automatically generates algorithm-specific input files when an algorithm is enabled in the configuration file.

Note

Each algorithm uses the input nodes to guide or constrain the optimization process used to construct reconstruct subnetworks.

An algorithm maps these input nodes onto the interactome and uses the network to identify connecting paths between the input nodes to form subnetworks.

Pathway reconstruction algorithms differ in the inputs nodes they require and how they interpret those nodes to identify subnetworks.

Some use source and target nodes to defined start and end points.
Some use prizes, which assign numerical scores assigned to nodes of interest.
Some rely on active nodes, representing nodes that are significantly “on” under specific conditions.

An example of a node file required by SPRAS follows a tab-separated format:

NODEID  prize sources targets active
A       1.0     True            True
B       3.3             True    True
C       2.5     True            True
D       1.9             True    True

Note

If a user provides only one type of input node but wants to run algorithms that require a different type, SPRAS can automatically convert the inputs into the compatible format:

Source-target nodes can be used with all algorithms by making a prize column set to 1 and an active column set to True.
Prize data can be adapted for active based algorithms by automatically making an active column set to True.
Active data can be adapted for prize based algorithms by making a prize column set to 1.

Along with differences in their inputs nodes, pathway reconstruction algorithms also interpret the input interactome differently.

Some algorithms can handle only fully directed interactomes. These interactomes include edges with a specific direction (A -> B).
Others work only with fully undirected interactomes. These interactomes have edges without direction (A - B).
And some support mixed-directionaltiy interactomes. These interactomes contain both directed and undirected edges.

SPRAS automatically converts the user-provided edge file into the format expected by each algorithm, ensuring that the directionality of the interactome matches the algorithm’s requirements.

An example of an edge file required by SPRAS follows a tab-separated format. where U indicates an undirected edge and D indicates a directed edge:

A   B   0.98   U
B   C   0.77   D

Note

SPRAS supports multiple standardized input formats. More information about input data formats can be found in the inputs/README.md file within the SPRAS repository.

1.2 Example high throughput data

An example dataset is using EGF response mass spectrometry data [4]. The experiment for this data was repeated three times, known as biological replicates, to ensure the results are consistent. Each replicate measures the abundance of peptides at different time points (0-128 minutes) to capture how protein activity changes over time.

Note

Mass spectrometry is a technique used to measure and identify proteins in a sample. It works by breaking proteins into smaller pieces called peptides and measuring their mass-to-charge ratio, which enables identifying which peptide is being measured. The data show how much of each peptide is present, which can show how protein phosphorylation abundances change under different conditions.

Since proteins interact with each other in biological pathways, changes in their phosphorylation abundances can reveal which parts of a pathway are active or affected. By mapping these changing proteins onto known interaction networks, pathway reconstruction algorithms can identify which signaling pathways are likely involved in the biological response to a specific condition.

Example of one of the biological replicate A with one peptide:

peptide	protein	gene.name	modified.sites	0 min	2 min	4 min	8 min	16 min	32 min	64 min	128 mn
K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.-	Q6PD74,B4DG44,Q5JPJ4,Q6AWA0	AAGAB	S310,S311	14.97	14.81	13.99	13.98	12.87	13.88	13.91	15.60

The goal is to turn this experimental data into the format that SPRAS expects.

1.3 Filtering and normalizing the replicates

Before analysis, we filter out peptides not present in all three replicates to ensure consistency. Then, we normalize each replicate so intensity values are comparable and not biased by replicate-specific effects.

peptide	protein	gene.name	modified.sites	0 min	2 min	4 min	8 min	16 min	32 min	64 min	128 mn	replicate
K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.-	Q6PD74,B4DG44,Q5JPJ4,Q6AWA0	AAGAB	S310,S311	2.17	2.09	1.98	1.78	1.99	2.12	2.25	1.46	C
K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.-	Q6PD74,B4DG44,Q5JPJ4,Q6AWA0	AAGAB	S310,S311	4.03	3.73	3.32	3.36	3.35	3.37	3.35	3.86	B
K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.-	Q6PD74,B4DG44,Q5JPJ4,Q6AWA0	AAGAB	S310,S311	5.60	4.75	4.69	4.59	4.32	4.90	4.90	5.48	A

1.4 Computing p-values using Tukey’s HSD Test

We want to calculate the p-values per peptide. This tells us how likely changes in abundance happen by chance.

We use Tukey’s Honest Significant Difference (HSD) test to compare all time points and correct for multiple testing to get a p-value for every pair of time points.

peptide	protein	2min vs 0min	4min vs 0min	8min vs 0min	16min vs 0min	32min.vs.0min	64min.vs.0min	128min.vs.0min	4min.vs.2min	8min.vs.2min	16min.vs.2min	32min.vs.2min	64min.vs.2min	128min.vs.2min	8min.vs.4min	16min.vs.4min	32min.vs.4min	64min.vs.4min	128min.vs.4min	16min.vs.8min	32min.vs.8min	64min.vs.8min	128min.vs.8min	32min.vs.16min	64min.vs.16min	128min.vs.16min	64min.vs.32min	128min.vs.32min	128min.vs.64min
K.n[305.21]ADVLEAHEAEAEEPEAGK[432.30]S[167.00]EAEDDEDEVDDLPSSR.R	QQ6PD74,B4DG44,Q5JPJ4,Q6AWA0	0.67	0.25	0.14	0.12	0.52	0.76	0.84	0.99	0.93	0.90	1.00	1.00	1.00	1.00	1.00	1.00	0.97	0.94	1.00	0.98	0.87	0.80	0.96	0.83	0.75	1.00	1.00	1.00

Peptides with lower p-values are more statistically significant and may represent biologically meaningful changes in phosphorylation over time.

1.5 From p-values to prizes

P-values are transformed using -log10(p-value) so smaller p-values give larger prize scores.

For each peptide, the smallest p-value is selected (representing the most significant change) between each time point to the baseline (0 min) and between consecutive time points. This is because the ultimate network analysis will not use the temporal information.

For each protein mapped to multiple peptides, the maximum prize value across all its peptides is assigned.

Finally, all protein identifiers (using the first one listed for each protein) are converted to UniProt Entry Names to match the identifiers that will be used in the interactome.

Note

All node identifiers should use the same namespace across every part of the data in a dataset.

peptide	protein	uniprot entry name	min p-value	-log10(min p-value)
K.n[305.21]AFWMAIGGDRDEIEGLS[167.00]S[167.00]DEEH.-	Q6PD74,B4DG44,Q5JPJ4,Q6AWA0	AAGAB_HUMAN	0.12392034609392	0.906857382317364

Input node data put into a SPRAS-standardized format:

NODE_ID     prize
AAGAB_HUMAN 0.906857382

1.6 From Prizes to Source and Targets / Actives

The KEGG ErbB signaling pathway (has04012).

Using known pathway knowledge [1] [2] [3]:

EGF serves as a source for the pathway and was the experimental treatment.
EGF is known to initiate signaling, so it can be added and assigned a high score (greater than all other nodes) to emphasize its importance and guide algorithms to start reconstruction from this point. (EGF is currently not in the data)
EGFR is in the current data. Looking at the pathway, we can see that EGFR directly interacts with EGF in the pathway.
All other downstream proteins detected in the data can also treated as targets.
All proteins in the data can be considered active since they correspond to proteins that are active under the given biological condition.

Input node data put into a SPRAS-standardized format:

NODE_ID     prize       source  target  active
AAGAB_HUMAN 0.906857382         True   True
... more nodes
EGF_HUMAN   10              True    True    True
EGFR_HUMAN  6.787874699                 True        True
... more nodes

1.8 Finding an Interactome to use

To connect our proteins, we use a background protein-protein interaction (PPI) network (the interactome). For this dataset, two interactomes are merged (directed edges prioritized when available):

iRefIndex v13 (159,095 undirected interactions)
PhosphoSitePlus (4,080 directed kinase-substrate interactions)

The combined interactome of iRefIndex v13 and PhosphoSitePlus

The final network has 15,677 proteins and 157,984 edges (~4k of them are directed), and covers 653 of our 702 prize proteins. The proteins identifiers in the interactome are converted to use UniProt Entry Names.

Interactome data put into a SPRAS-standardized format:

TACC1_HUMAN RUXG_HUMAN      0.736771        U
TACC1_HUMAN KAT2A_HUMAN     0.292198        U
TACC1_HUMAN CKAP5_HUMAN     0.724783        U
TACC1_HUMAN YETS4_HUMAN     0.542597        U
TACC1_HUMAN LSM7_HUMAN      0.714823        U
AURKC_HUMAN TACC1_HUMAN     0.553333        D
TACC1_HUMAN AURKA_HUMAN     0.401165        U
TACC1_HUMAN KDM1A_HUMAN     0.367850        U
TACC1_HUMAN MEMO1_HUMAN     0.367850        U
TACC1_HUMAN HD_HUMAN        0.367850        U
... more edges

Note

Many databases exist that provide interactomes. One is STRING, which contains known protein-protein interactions across different species.

1.9 This SPRAS-standardized data is already saved into SPRAS

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── ...
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already, used by the intermediate.yaml file
│   └── tps-egfr-prizes.txt # pre-defined in SPRAS already, used by the intermediate.yaml file
├── outputs/
│   └── basic/
│       └── ... output files ...

The data used in this part of the tutorial can be found in the supplementary materials under data supplement 2 and supplement 3 [4].

Step 2: Running multiple algorithms

We can begin running multiple pathway reconstruction algorithms.

For this part of the tutorial, we’ll use a pre-defined configuration file that includes additional algorithms and post-analysis steps available in SPRAS. Download it here: Intermediate Config File

Save the file into the config/ folder of your SPRAS installation.

After adding this file, your directory structure will look like this (ignoring the rest of the folders):

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   ├── basic.yaml
│   ├── intermediate.yaml
│   └── ... other configs ...
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already, used by the intermediate.yaml file
│   ├── tps-egfr-prizes.txt # pre-defined in SPRAS already, used by the intermediate.yaml file
│   └── ... other input data ...
├── outputs/
│   └── basic/
│       └── ... output files ...

2.1 Algorithms in SPRAS

SPRAS supports a wide range of algorithms, each designed around different biological assumptions and optimization strategies (See Pathway Reconstruction Methods for SPRAS’s list of integrated algorithms.)

Wrapped algorithms

Each pathway reconstruction algorithm within SPRAS has been wrapped for SPRAS, meaning it has been prepared for the SPRAS framework.

For an algorithm-specific wrapper, the wrapper includes a module that will create and format the input files required by the algorithm using the SPRAS-standardized input data.

Each algorithm has an associated Docker image located on DockerHub that contains all necessary software dependencies needed to run it. For an algorithm-specific wrapper, it contains a module that will call each image to launch a container for a specified parameter combination, set of prepared algorithm-specific inputs and an output filename (raw-pathway.txt).

With each of the raw-pathway.txt files, an algorithm-specific wrapper includes a module that will convert the algorithm-specific format into a standardized SPRAS format.

2.3 Running SPRAS with multiple algorithms

In the intermediate.yaml configuration file, it is set up to have SPRAS run multiple algorithms with multiple parameter settings on a single dataset.

From the root directory, run the command below from the command line:

snakemake --cores 4 --configfile config/intermediate.yaml

What happens when you run this command

SPRAS will run “slower” when using the intermediate.yaml configuration.

Similar automated steps from the previous tutorial runs behind the scenes for intermediate.yaml. However, this configuration now runs multiple algorithms with different parameter combinations, which takes longer to complete. By increasing the number of cores to 4, it allows Snakemake to parallelize the work locally, speeding up execution when possible. (See Using SPRAS for more information on SPRAS’s parallelization.)

Snakemake starts the workflow

Snakemake reads the options set in the intermediate.yaml configuration file and determines which datasets, algorithms, and parameter combinations need to run. It also checks if any post-analysis steps were requested.

Creating algorithm-specific inputs

For each algorithm marked as include: true in the configuration, SPRAS generates input files tailored to that algorithm.

In this case, every algorithm is enabled, so SPRAS formats the input files required for each algorithm.

Organizing results with parameter hashes

Each <dataset>-<algorithm>-params-<hash> combination gets its own folder created in output/intermediate/.

A matching log file in logs/parameters-<algorithm>-params-<hash>.yaml records the exact parameter values used.

Running the algorithm

SPRAS pulls each algorithm’s Docker image from DockerHub if it isn’t already downloaded locally

SPRAS executes each algorithm by launching multiple Docker contatiners using the algorithm-specific Docker image (once for each parameter configuration), sending the prepared input files and specific parameter settings needed for execution.

Each algorithm runs independently within its Docker container and generates an output file named raw-pathway.txt, which contains the reconstructed subnetwork in the algorithm-specific format.

SPRAS then saves these files to the corresponding folder.

Standardizing the results

SPRAS parses each of the raw output into a standardized SPRAS format (pathway.txt) and SPRAS saves this file in its corresponding folder.

Logging the Snakemake run

Snakemake creates a dated log in .snakemake/log/ This log shows what jobs ran and any errors that occurred during the SPRAS run.

What your directory structure should like after this run:

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── basic.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt
│   └── tps-egfr-prizes.txt
├── outputs/
│   └── basic/
│       └── dataset-egfr-merged.pickle
│       └── egfr-meo-params-FJBHHNE
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-GKEDDFZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-JQ4DL7K
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-OXXIFMZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-mincostflow-params-42UBTQI
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-mincostflow-params-4G2PQRB
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-FZI2OGW
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-GUMLBDZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-PCWFPQW
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator2-params-EHHWPMD
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator2-params-IV3IPCJ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-4YXABT7
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-7S4SLU6
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-D4TUKMX
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-VQL7BDZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-rwr-params-34NN6EK
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-rwr-params-GGZCZBU
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-strwr-params-34NN6EK
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-strwr-params-GGZCZBU
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── logs
│            └── datasets-egfr.yaml
│            └── parameters-allpairs-params-BEH6YB2.yaml
│            └── parameters-domino-params-V3X4RW7.yaml
│            └── parameters-meo-params-FJBHHNE.yaml
│            └── parameters-meo-params-GKEDDFZ.yaml
│            └── parameters-meo-params-JQ4DL7K.yaml
│            └── parameters-meo-params-OXXIFMZ.yaml
│            └── parameters-mincostflow-params-42UBTQI.yaml
│            └── parameters-mincostflow-params-4G2PQRB.yaml
│            └── parameters-mincostflow-params-GGT4CVE.yaml
│            └── parameters-omicsintegrator1-params-FZI2OGW.yaml
│            └── parameters-omicsintegrator1-params-GUMLBDZ.yaml
│            └── parameters-omicsintegrator1-params-PCWFPQW.yaml
│            └── parameters-omicsintegrator2-params-EHHWPMD.yaml
│            └── parameters-omicsintegrator2-params-IV3IPCJ.yaml
│            └── parameters-pathlinker-params-4YXABT7.yaml
│            └── parameters-pathlinker-params-7S4SLU6.yaml
│            └── parameters-pathlinker-params-D4TUKMX.yaml
│            └── parameters-pathlinker-params-VQL7BDZ.yaml
│            └── parameters-rwr-params-34NN6EK.yaml
│            └── parameters-rwr-params-GGZCZBU.yaml
│            └── parameters-strwr-params-34NN6EK.yaml
│            └── parameters-strwr-params-GGZCZBU.yaml
│       └── prepared
│            └── egfr-domino-inputs
│                ├── active_genes.txt
│                └── network.txt
│            └── egfr-meo-inputs
│                ├── edges.txt
│                ├── sources.txt
│                └── targets.txt
│            └── egfr-mincostflow-inputs
│                ├── edges.txt
│                ├── sources.txt
│                └── targets.txt
│            └── egfr-omicsintegrator1-inputs
│                ├── dummy_nodes.txt
│                ├── edges.txt
│                └── prizes.txt
│            └── egfr-omicsintegrator2-inputs
│                ├── edges.txt
│                └── prizes.txt
│            └── egfr-pathlinker-inputs
│                ├── network.txt
│                ── nodetypes.txt
│            └── egfr-rwr-inputs
│                ├── network.txt
│                └── nodes.txt
│            └── egfr-strwr-inputs
|                ├── network.txt
|                ├── sources.txt
|                └── targets.txt

2.4 Reviewing the pathway.txt files

After running the intermediate configuration file, the output/intermediate/ directory will contain many more subfolders and files.

Again, each pathway.txt file contains the standardized reconstructed subnetworks and can be used at face value, or for further post analysis.

Locate the files

Navigate to the output directory output/intermediate/. Inside, you will find subfolders corresponding to each <dataset>-<algorithm>-params-<hash> combination.

Open a pathway.txt file

Each file lists the network edges that were reconstructed for that specific run. The format includes columns for the two interacting nodes, the rank, and the edge direction.

For example, the file egfr-mincostflow-params-42UBTQI/pathway.txt contains the following reconstructed subnetwork:

Node1       Node2   Rank    Direction
CBL_HUMAN   EGFR_HUMAN      1       U
EGFR_HUMAN  EGF_HUMAN       1       U
EMD_HUMAN   LMNA_HUMAN      1       U
FYN_HUMAN   KS6A3_HUMAN     1       U
EGF_HUMAN   HDAC6_HUMAN     1       U
HDAC6_HUMAN HS90A_HUMAN     1       U
KS6A3_HUMAN SRC_HUMAN       1       U
EGF_HUMAN   LMNA_HUMAN      1       U
MYH9_HUMAN  S10A4_HUMAN     1       U
EGF_HUMAN   S10A4_HUMAN     1       U
EMD_HUMAN   SRC_HUMAN       1       U

And the file egfr-omicsintegrator1-params-GUMLBDZ/pathway.txt contains the following reconstructed subnetwork:

Node1       Node2   Rank    Direction
CBLB_HUMAN  EGFR_HUMAN      1       U
CBL_HUMAN   CD2AP_HUMAN     1       U
CBL_HUMAN   CRKL_HUMAN      1       U
CBL_HUMAN   EGFR_HUMAN      1       U
CBL_HUMAN   PLCG1_HUMAN     1       U
CDK1_HUMAN  NPM_HUMAN       1       D
CHD4_HUMAN  HDAC2_HUMAN     1       U
EGFR_HUMAN  EGF_HUMAN       1       U
EGFR_HUMAN  GRB2_HUMAN      1       U
EIF3B_HUMAN EIF3G_HUMAN     1       U
FAK1_HUMAN  PAXI_HUMAN      1       U
GAB1_HUMAN  PTN11_HUMAN     1       U
GRB2_HUMAN  PTN11_HUMAN     1       U
GRB2_HUMAN  SHC1_HUMAN      1       U
HDAC2_HUMAN SIN3A_HUMAN     1       U
HGS_HUMAN   STAM2_HUMAN     1       U
KS6A1_HUMAN MK01_HUMAN      1       U
MK01_HUMAN  ABI1_HUMAN      1       D
MK01_HUMAN  ERF_HUMAN       1       D
MRE11_HUMAN RAD50_HUMAN     1       U

Step 3: Use ML post-analysis

3.1 Adding ML post-analysis to the intermediate configuration

To enable the ML analysis, update the analysis section in your configuration file by setting ml to true. Your analysis section in the configuration file should look like this:

analysis:
    ml:
        include: true
        ... (other parameters preset)

ml will perform unsupervised analyses such as principal component analysis (PCA), hierarchical agglomerative clustering (HAC), ensembling, and jaccard similarity comparisons of the pathways.

The ml section includes configurable parameters that let you adjust the behavior of the analyses performed.

With these updates, SPRAS will run the full set of unsupervised machine learning analyses across all outputs for a given dataset.

After saving the changes in the configuration file, rerun with:

snakemake --cores 4 --configfile config/intermediate.yaml

What happens when you run this command

Reusing cached results

Snakemake reads the options set in intermediate.yaml and checks for any requested post-analysis steps. It reuses cached results; here the pathway.txt files generated from the previously executed algorithms on the egfr dataset are reused.

Running the ml analysis

SPRAS aggregates all the reconstructed subnetworks produced across the specified algorithms for a given dataset. SPRAS then performs machine learning analyses on each these groups and saves the results in the <dataset>-ml/ (egfr-ml/) folder.

What your directory structure should like after this run:

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── basic.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt
│   └── tps-egfr-prizes.txt
├── outputs/
│   └── basic/
│       └── dataset-egfr-merged.pickle
│       └── egfr-meo-params-FJBHHNE
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-GKEDDFZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-JQ4DL7K
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-meo-params-OXXIFMZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-mincostflow-params-42UBTQI
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-mincostflow-params-4G2PQRB
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-FZI2OGW
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-GUMLBDZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator1-params-PCWFPQW
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator2-params-EHHWPMD
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-omicsintegrator2-params-IV3IPCJ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-4YXABT7
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-7S4SLU6
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-D4TUKMX
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-VQL7BDZ
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-rwr-params-34NN6EK
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-rwr-params-GGZCZBU
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-strwr-params-34NN6EK
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-strwr-params-GGZCZBU
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-ml
│            └── ensemble-pathway.txt
│            └── hac-clusters-horizontal.txt
│            └── hac-clusters-vertical.txt
│            └── hac-horizontal.png
│            └── hac-vertical.png
│            └── jaccard-heatmap.png
│            └── jaccard-matrix.txt
│            └── pca-coordinates.txt
│            └── pca-variance.txt
│            └── pca.png
│       └── logs
│            └── datasets-egfr.yaml
│            └── parameters-allpairs-params-BEH6YB2.yaml
│            └── parameters-domino-params-V3X4RW7.yaml
│            └── parameters-meo-params-FJBHHNE.yaml
│            └── parameters-meo-params-GKEDDFZ.yaml
│            └── parameters-meo-params-JQ4DL7K.yaml
│            └── parameters-meo-params-OXXIFMZ.yaml
│            └── parameters-mincostflow-params-42UBTQI.yaml
│            └── parameters-mincostflow-params-4G2PQRB.yaml
│            └── parameters-mincostflow-params-GGT4CVE.yaml
│            └── parameters-omicsintegrator1-params-FZI2OGW.yaml
│            └── parameters-omicsintegrator1-params-GUMLBDZ.yaml
│            └── parameters-omicsintegrator1-params-PCWFPQW.yaml
│            └── parameters-omicsintegrator2-params-EHHWPMD.yaml
│            └── parameters-omicsintegrator2-params-IV3IPCJ.yaml
│            └── parameters-pathlinker-params-4YXABT7.yaml
│            └── parameters-pathlinker-params-7S4SLU6.yaml
│            └── parameters-pathlinker-params-D4TUKMX.yaml
│            └── parameters-pathlinker-params-VQL7BDZ.yaml
│            └── parameters-rwr-params-34NN6EK.yaml
│            └── parameters-rwr-params-GGZCZBU.yaml
│            └── parameters-strwr-params-34NN6EK.yaml
│            └── parameters-strwr-params-GGZCZBU.yaml
│       └── prepared
│            └── egfr-domino-inputs
│                ├── active_genes.txt
│                └── network.txt
│            └── egfr-meo-inputs
│                ├── edges.txt
│                ├── sources.txt
│                └── targets.txt
│            └── egfr-mincostflow-inputs
│                ├── edges.txt
│                ├── sources.txt
│                └── targets.txt
│            └── egfr-omicsintegrator1-inputs
│                ├── dummy_nodes.txt
│                ├── edges.txt
│                └── prizes.txt
│            └── egfr-omicsintegrator2-inputs
│                ├── edges.txt
│                └── prizes.txt
│            └── egfr-pathlinker-inputs
│                ├── network.txt
│                ── nodetypes.txt
│            └── egfr-rwr-inputs
│                ├── network.txt
│                └── nodes.txt
│            └── egfr-strwr-inputs
|                ├── network.txt
|                ├── sources.txt
|                └── targets.txt

Step 3.2: Reviewing the ML outputs

Ensembles

Open the ensemble file

In your file explorer, go to output/intermediate/egfr-ml/ensemble-pathway.txt and open it locally.

After running multiple algorithms or parameter settings on the same dataset, SPRAS can ensemble the resulting pathways to identify consistent, high-frequency interactions. SPRAS calculates the edge frequency by calculating the proportion of times each edge appears across the outputs.

Node1       Node2   Frequency       Direction
EGF_HUMAN   EGFR_HUMAN      0.42857142857142855     D
EGF_HUMAN   S10A4_HUMAN     0.38095238095238093     D
S10A4_HUMAN MYH9_HUMAN      0.38095238095238093     D
K7PPA8_HUMAN        MDM2_HUMAN      0.09523809523809523     D
MDM2_HUMAN  P53_HUMAN       0.19047619047619047     D
S10A4_HUMAN K7PPA8_HUMAN    0.19047619047619047     D
K7PPA8_HUMAN        SIR1_HUMAN      0.19047619047619047     D
MDM2_HUMAN  MDM4_HUMAN      0.09523809523809523     D
MDM4_HUMAN  P53_HUMAN       0.09523809523809523     D
CD2A2_HUMAN CDK4_HUMAN      0.09523809523809523     D
CDK4_HUMAN  RB_HUMAN        0.09523809523809523     D
MDM2_HUMAN  CD2A2_HUMAN     0.09523809523809523     D
EP300_HUMAN P53_HUMAN       0.2857142857142857      D
K7PPA8_HUMAN        EP300_HUMAN     0.09523809523809523     D
...

High frequency edges indicate interactions consistently recovered by multiple algorithms. Low frequency edges may reflect noise or algorithm-specific connections.

Hierarchical agglomerative clustering

Open the HAC image(s)

In your file explorer, go to output/intermediate/egfr-ml/hac-horizontal.png and/or output/intermediate/egfr-ml/hac-vertical.png and open it locally.

SPRAS includes HAC to group similar pathways outputs based on shared edges. This helps identify clusters of algorithms that produce comparable subnetworks and highlights distinct reconstruction behaviors.

In the plots below, each branch represents a cluster of related pathways. Shorter distances between branches indicate outputs with greater similarity.

Hierarchical agglomerative clustering horizontal view

Hierarchical agglomerative clustering vertical view with colors only

HAC visualizations help compare which algorithms and parameter settings produce similar pathway structures. Tight clusters indicate similar output behavior, while isolated branches may reveal unique results.

Principal component analysis

Open the PCA image

In your file explorer, go to output/intermediate/egfr-ml/pca.png and open it locally.

SPRAS also includes PCA to visualize variation across pathway outputs. Each point represents a pathway, placed based on its overall network structure. Pathways that cluster together in PCA space are more similar, while those farther apart differ in their reconstructed subnetworks.

Principal component analysis visualization across pathway outputs

PCA may help identify patterns such as clusters of similar algorithms outputs, parameter sensitivities, and/or outlier outputs.

Jaccard similarity

Open the jaccard heatmap image

In your file explorer, go to output/intermediate/egfr-ml/jaccard-heatmap.png and open it locally.

SPRAS computes pairwise jaccard similarity between pathway outputs to measure how much overlap exists between their reconstructed subnetworks. The heatmap visualizes how similar the output pathways are between algorithms and their parameter settings.

Jaccard heatmap of the overlap between pathway outputs

Higher similarity values indicate that pathways share many of the same edges, while lower values suggest distinct reconstructions.

Intermediate Tutorial - Prepare Data & Multi-Algorithm Runs

Step 1: Transforming high throughput experimental data into SPRAS compatible input data

1.1 What is the SPRAS-standardized input data?

1.2 Example high throughput data

1.3 Filtering and normalizing the replicates

1.4 Computing p-values using Tukey’s HSD Test

1.5 From p-values to prizes

1.6 From Prizes to Source and Targets / Actives

1.8 Finding an Interactome to use

1.9 This SPRAS-standardized data is already saved into SPRAS

Step 2: Running multiple algorithms

2.1 Algorithms in SPRAS

Wrapped algorithms

2.3 Running SPRAS with multiple algorithms

What happens when you run this command

What your directory structure should like after this run:

2.4 Reviewing the pathway.txt files

Step 3: Use ML post-analysis

3.1 Adding ML post-analysis to the intermediate configuration

What happens when you run this command

What your directory structure should like after this run:

Step 3.2: Reviewing the ML outputs

Ensembles

Hierarchical agglomerative clustering

Principal component analysis

Jaccard similarity

References