Written by: Vineet K. Raghu PhD
© 2018 Dimitrios Manatakis, Vineet K. Raghu, Panagiotis V. Benos and the University of Pittsburgh
This release describes a method called piMGM which learns an undirected graphical model from data with categorical and continuous variables. The method uses prior knowledge in the forms of edge probabilities, which can be curated from several different sources. Instead of using the knowledge as definite "truth" the method attempts to evaluate how reliable each source is before integrating the knowledge from each to learn a final model.
Thus the prior knowledge used can be both unreliable and incomplete.
software availability: http://www.benoslab.pitt.edu/Software/pimgm
-run
Name of the current job, all results will be placed in a folder with this name
-priors
This is the path to the directory where the prior knowledge files are stored. Prior knowledge can either be in the form of .sif files or tsv matrices. See example prior knowledge files for details.
-data
The filename of the dataset, genes (variables) should be in columns, and samples in rows
-ns
Number of subsamples to use to compute edge stabilities
Usually you will set this around 10-20 depending upon the number of samples you have
Default: 20
-nl
Number of sparsity parameter values to test
A larger number may increase accuracy, but will greatly increase runtime
Default: 40
-sif
Specify this switch if the prior information is in the form of an .sif file instead of a matrix
-loocv
Use this switch if you have less than 50 or so samples to do Leave-one-out cross validation instead of subsampling
-rm <Variable_1>, ..., <Variable_N>
Remove the listed variables from the dataset to be analyzed
Please download "Data for piMGM Examples" in order to run the following examples. These examples use All_Data.txt as pseudo gene expression data with just one clinical variable, "Outcome."
java -jar runPriors.jar -run exampleRun -priors Example_Coexpression_Priors -data All_Data.txt
The output of this will be printed in the "exampleRun" directory. This includes a list of connections among genes and between genes and the outcome of interest, along with an evaluation of each prior knowledge source. This evaluation is in a tabular format with the following four fields.
Prior Weight refers to the normalized weight given to each prior. These weights should sum to one, and specify the relative confidence the algorithm has in each prior information source.
The p-value here is the probability of having a piMGM deviance score given that the prior is a random prior of the same size. This can be thought of as a probability of seeing a prior pathway being equally poor given that it was truly random.
This has the same meaning as the corrected p-value but is uncorrected for multiple comparisons
This score reflects how much more present in the data, the information contained in this prior is, compared to a random prior of equal size. A deviance score of 1 indicates a prior deviance equal to the mean of all random priors, whereas less than 1 indicates a prior with better information than random about the data.
java -jar runPriors.jar -run examplePositive -priors Example_Pathway_Priors -data Good_Outcome.txt -sif -rm Outcome
Note that -sif must be specified since the pathways in this case are presented as .sif files (lists of edges). Also, the outcome variable must be removed since it is constant among the patients with Good Outcomes.
The output of this run of piMGM can be used to determine "active" pathways in the good outcome patients. A similar procedure can be run on the poor outcome patients in order to find differences in pathway activities among the two groups.
If the goal is to evaluate pathway activity, all included pathways should have at least 20 or so pieces of information (gene-gene associations) where both genes are present in the final expression data. All other pathways should be excluded
If the goal is to evaluate pathway activity, pathways should have a reasonable percentage of their information present in the expression data (~25 % of edges should have both genes included in the expression data).
If the goal is to learn an informed graphical model only, then the quantity of the prior information is not so important, and smaller priors can be used without harm.
For datasets larger than 500 genes in the expression data, we recommend using a computing cluster to run piMGM. This will take too long on a laptop.