InformMe.jl Documentation
RUNNING informME
Run the informME software using the following steps in the indicated order (see below for the detailed documentation of each function).
1. REFERENCE GENOME ANALYSIS:
fastaToCpg(FASTAfilename)
This step analyzes the reference genome FASTA_FILE (in FASTA format) and produces a julia JLD2 file CpGlocationChr#.jld2 for each chromosome, which is stored by default in REFGENEDIR, and contains the following information:
location of CpG sites
CpG density for each CpG site
distance between neighboring CpG sites
location of the last CpG site in the chromosome
length of chromosome (in base pairs)
NOTE1: This step only needs to be completed one time for a given reference genome. Start analyzing samples at step 2 if you have previously completed step 1 for your sample's reference genome.
NOTE2: At this time the statistical model of informME has been designed to work only with autosomes, and so the informME software is not recommended for mitochondrial chromosomes, lambda spike-ins, partial contigs, sex chromosomes, et cetera.
2-5. COMBINED SINGLE SAMPLE ANALYSIS:
convertBAMtoBits(bamFilenames,phenoName)
Steps 2 through 5 below can be invoked by the single command above, where bamFilenames is a list of bam files to be modeled and assigned the phenotype name given by the argument phenoName which should be unique to this sample.
2. METHYLATION DATA MATRIX GENERATION:
matrixFromBam(BAM_FILE,CHR_NUM)
This step takes the BAM file BAM_FILE as input and generates the methylation data matrix for chromosome number CHR_NUM. By default, the file BAM_FILE and its associated index file (with extension .bai) is expected to be in BAMDIR, and the output file produced by this step is stored in a subdirectory in INTERDIR named after the chromosome number CHR_NUM. The output file preserves the prefix from the file BAM_FILE and the suffix '_matrices.mat' is appended to it (e.g. if BAM_FILE is normal_sample.bam and CHR_NUM is 10, then the output file is saved as INTERDIR/chr10/normal_sample_matrices.mat). The file produced contains the following information for each genomic region, which is subsequently used for model estimation:
data matrix with -1,0,1 values for methylation status
CpG locations broken down by region
NOTE1: See reference [1], "Online Methods: Quality control and alignment" for our suggested preprocessing steps when generating a sorted, indexed, deduplicated BAM file to input to informME.
3. MODEL ESTIMATION:
estParamsForChr(MAT_FILES,PHENO,matrices_path,reference_path,CHR_NUM)
informME learns the parameters of the Ising probability distribution by combining the methylation data matrices provided through the argument MAT_FILES (comma-separated list) for chromosome number CHR_NUM. By default, the MAT_FILES are expected to be in a subdirectory named after CHR_NUM in "./results/". The output generated during this phase is also stored in a subdirectory in results named after chromosome number CHR_NUM. The output file has as prefix PHENO and the suffix '_fit.jld2' appended to it (e.g. if 'normal' is the PHENO, and CHR_NUM is 10, then the output is stored as ./results/chr10/normal_fit.jld2). The file produced contains the following information:
CpG distances
CpG densities
estimated alpha, beta, and gamma parameters of the Ising model
initial and transition probabilities of the inhomogeneous Markov chain representation of the Ising model
marginal probabilities at each CpG site
the log partition function of the estimated Ising model
4. MODEL ANALYSIS:
methAnalysisForChr(prefix,chr_num,reference_path,estimation_path)
This step consists of analyzing the model learned by computing a number of statistical summaries of the methylation state, including probability distributions of methylation levels, mean methylation levels, and normalized methylation entropies, as well as mean and entropy based classifications. This step also computes entropic sensitivity indices, methylation sensitivity indices, as well information-theoretic quantities associated with methylation channels, such as turnover ratios, channel capacities, and relative dissipated energies. The output generated during this phase is stored in the same directory as the output generated during the first phase, using the same prefix as before. However, the suffix is now '_analysis.mat' (e.g. following the previous example, the output file of this phase is stored as ./results/chr10/normal_analysis.mat). This file contains the following information:
the locations of the CpG sites within the genomic region
numbers of CpG sites within the analysis subregions
which analysis subregions are modeled and which are not
estimated parameters of Ising model in genomic region
methylation level probabilities in modeled subregions
coarse methylation level probabilities
mean methylation levels
normalized methylation entropies
entropic sensitivity indices
methylation sensitivity indices
turnover ratios
channel capacities
relative dissipated energies
5. GENERATE BED FILES FOR SINGLE ANALYSIS:
makeBedsForMethAnalysis(PHENO,analysis_path,reference_path)
This function makes BED files from the methylation analysis results obtained after running methAnalysisForChr for a given phenotype PHENO. By default, the input file (analysis file) is expected to be located in ./results/chr#/PHENO_analysis.mat. In addition, the output files are stored in "./singleMethAnalysisToBed_out/" and have the following names and content:
MML-PHENO.bed: mean methylation levels
NME-PHENO.bed: normalized methylation entropy
METH-PHENO.bed: methylation-based classification (non-variable)
VAR-PHENO.bed: methylation-based classification (variable)
ENTR-PHENO.bed: entropy-based classification
ESI-PHENO.bed (if ESIflag passed): entropic sensitivity indices
MSI-PHENO.bed (if MSIflag passed): methylation sensitivity indices
TURN-PHENO.bed (if MCflag passed): turnover ratios
CAP-PHENO.bed (if MCflag passed): channel capacities
RDE-PHENO.bed (if MCflag passed): relative dissipated energies
6. GENERATE BED FILES FOR DIFFERENTIAL ANALYSIS:
makeBedsForDiffMethAnalysis(PHENO1,PHENO2,analysis_path_1, analysis_path_2,reference_path)
This function makes BED files for the differential methylation analysis results obtained after running methAnalysisForChr for two given phenotypes PHENO1 and PHENO2. The input files (both analysis files) are expected to be located in analysispath1/chr#/PHENO1_analysis.jld2 and analysispath2/chr#/PHENO2_analysis.jld2 respectively. In addition, the output files are stored in "./makeBedsForDiffMethAnalysis_out/" and have the following names and content:
dMML-PHENO1-VS-PHENO2.bed: differences in mean methylation levels
DMU-PHENO1-VS-PHENO2.bed: differential mean-based classification
dNME-PHENO1-VS-PHENO2.bed: differences in normalized methylation entropies
DEU-PHENO1-VS-PHENO2.bed: differential entropy-based classification
JSD-PHENO1-VS-PHENO2.bed: Jensen-Shannon distances
dESI-PHENO1-VS-PHENO2.bed (if –ESI flag passed): differences in entropic sensitivity indices
dMSI-PHENO1-VS-PHENO2.bed (if –MSI flag passed): differences in methylation sensitivity indices
dCAP-PHENO1-VS-PHENO2.bed (if –MC flag passed): differences in channel capacities
dRDE-PHENO1-VS-PHENO2.bed (if –MC flag passed): differences in relative dissipated energies
7. POST PROCESSING:
See the original informME package for post processing scripts in R for gene/region rankings and DMR finding.
Functions
InformMe.fastaToCpG
— MethodThis function is used to analyze a reference genome in order to find and store the locations of all CpG sites within each chromosome and to compute the CpG densities at each CpG site as well as the distances between neighboring CpG sites. A 1-based coordinate system is used, in which the first base is assigned to position 1 and the location of a CpG site is defined by the position of the C nucleotide on the forward strand of the reference genome.
This function must be run ONLY ONCE before proceeding with analysis of BAM files.
USAGE (default):
fastaToCpg(FASTAfilename)
USAGE (optional):
Example of optional usage with additional input parameters.
FastaToCpG(FASTAfilename,outdir="/path/to/outputdir/")
MADATORY INPUT:
FASTAfilename
Full path of FASTA-formatted reference genome to which
available BAM files have been aligned to.
OPTIONAL INPUTS:
outdir
Path where the output will be stored at.
Default value: "./"
wsize
Window size used in CpG density calculations.
Default value: 1000
InformMe.convertBAMtoBits
— MethodRuns the entire MethToBits pipeline.
This function depends on a working instalation of SAMtools that is on the system path PATH.
Before running this function, FastaToCpG.m must be run ONCE.
USAGE (default):
convertBAMtoBits(bamFilenames,phenoName)
USAGE (optional):
Example of optional usage with additional input parameters.
matrixFromBam(bam_prefix,chr_num,reference_path="/path/to/ref")
MADATORY INPUTS:
bamFilenames
A comma seperated string with list of input bam file
names without the ".bam" extension. These
files must be sorted from the least to the greatest base
pair position along the reference sequence and must be
indexed (i.e., the associated BAI file must be available).
The file name must not contain "." characters, but can
contain "_" instead. Moreover, the file name should be
unique.
phenoName
A string which will be the unique identifier of this
sample/model that is built.
##OPTIONAL INPUTS:
reference_path
Path to the root subdirectory where the outputs of this
function are stored.
Default value: "./genome/"
bamfile_path
Path to the subdirectory where the BAM file is located.
Default value: "./indexedBAMfiles/"
matrices_path
Path to the subdirectory where the output of this function
is stored.
Default value: "./matrices/"
estimation_path
A string that specifies the path to the directory that
contains the results of parameter estimation performed
by estParamsForChr.jl.
Default value: "./estimation/"
outdir
A string that specifies the path of the directory in which
the methylation analysis results are written.
Default value: "./output/"
pairedEnds
Flag for paired end read support. A value of 1 indicates
that the sequencer employed paired end reads, whereas a
value of 0 indicates that the sequencer employed single
end reads.
Default value: true
numBasesToTrim
A vector of integers specifying how many bases should be
trimmed from the begining of each read. If the vector
contains two integers, then the first integer specifies
how many bases to trim from the first read in a read pair,
whereas the second integer specifies how many bases should
be trimmed from the second read in the pair. If the
vector contains one integer, then all reads will have
that number of bases trimmed from the beginning of the
read. If no bases are to be trimmed, then this input
must be set to 0.
Default value: 0
regionSize
The size of the genomic regions for which methylation
information is produced (in number of base pairs).
Default value: 3000
minCpGsReqToModel
The minimum number of CpG sites within a genomic region
required to perform statistical estimation.
Default value: 10
boundaryConditions
Flag to decide if boundary conditions should be estimated
freely in MLE.
Default value: false
MSIflag
Flag that determines whether this function performs
computation of the methylation sensitivity index (MSI).
false: no MSI computation.
true: allow MSI computation.
Default value: false
ESIflag
Flag that determines whether this function performs
computation of the entropic sensitivity index (ESI).
false: no ESI computation.
true: allow ESI computation.
Default value: false
MCflag
Flag that determines whether this function performs
computation of turnover ratios, CpG entropies, capacities,
and relative dissipated energies of methylation
channels (MCs).
false: no MC computations.
true: allow MC computations.
Default value: false
chr_nums
A vector with the chromosomes to be processed (without
"chr" string).
Default value: 1:22
numProcessors
The number of processors to use in the computations.
Note that julia must be started as "julia -p 4" if
four processors are desired. The nprocs() function
tells how many cores are available in julia, and
we default to use them all.
Default value: nprocs()
The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.
InformMe.matrixFromBam
— MethodThis function processes a BAM file with aligned reads to a reference genome and produces methylation information for nonoverlapping genomic regions (containing the same number of base pairs) in a given chromosome. The final output for each genomic region is a matrix with -1,0,1 values. Each row of the matrix is a methylation read, whereas each column represents a CpG site within the genomic region. A value of -1 indicates no methylation information is available for the CPG site, 0 indicates that the CpG site is unmethylated, and 1 indicates that the CpG site is methylated.
This function depends on a working instalation of SAMtools that is on the system path PATH.
Before running this function, FastaToCpG.m must be run ONCE.
USAGE (default):
matrixFromBam(bam_prefix,chr_num)
USAGE (optional):
Example of optional usage with additional input parameters.
matrixFromBam(bam_prefix,chr_num; reference_path="/path/to/ref")
MADATORY INPUTS:
bam_prefix
Prefix of the BAM file (without the .bam extension). This
file must be sorted from the least to the greatest base
pair position along the reference sequence and must be
indexed (i.e., the associated BAI file must be available).
The file name must not contain "." characters, but can
contain "_" instead. Moreover, the file name should be
unique.
chr_num
Number representing the chromosome to be processed.
OPTIONAL INPUTS:
reference_path
Path to the root subdirectory where the outputs of this
function are stored.
Default value: "./genome/"
bamfile_path
Path to the subdirectory where the BAM file is located.
Default value: "./indexedBAMfiles/"
matrices_path
Path to the subdirectory where the output of this function
is stored.
Default value: "./matrices/"
pairedEnds
Flag for paired end read support. A value of true indicates
that the sequencer employed paired end reads, whereas a
value of false indicates that the sequencer employed single
end reads.
Default value: true
numBasesToTrim
A vector of integers specifying how many bases should be
trimmed from the begining of each read. If the vector
contains two integers, then the first integer specifies
how many bases to trim from the first read in a read pair,
whereas the second integer specifies how many bases should
be trimmed from the second read in the pair. If the
vector contains one integer, then all reads will have
that number of bases trimmed from the beginning of the
read. If no bases are to be trimmed, then this input
must be set to 0.
Default value: 0
regionSize
The size of the genomic regions for which methylation
information is produced (in number of base pairs).
Default value: 3000
minCpGsReqToModel
The minimum number of CpG sites within a genomic region
required to perform statistical estimation.
Default value: 10
The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.
InformMe.estParamsForChr
— MethodThis function takes a list of BAM files (which correspond to the same phenotype) and performs statistical model estimation within a specific chromosome of interest. The function can be used on a computing cluster to break the work of model estimation to many independent parallel job processes. This is performed only after matrixFromBam.m.
USAGE (default):
estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num)
USAGE (optional):
Example of optional usage with additional input parameters.
estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num, regionSize=2000)
MANDATORY INPUTS:
mat_files
All the .mat files to be included in the model. This can be a
single .mat file or multiple files in the form of a comma-sepa-
rated list of files.
prefix
A string that specifies the name of the modeled phenotype.
The output files produced will contain this prefix.
matrices_path
A string that specifies the path to the directory that
where the output will be stored.
reference_path
A string that specifies the path to the directory that
contains the results of analysis of the reference genome
performed by FastaToCpG.m as well as the results of
methylation calling performed by matrixFromBam.m.
chr_num
Chromosome number 1 to 22 (in humans) specifying the
chromosome for which statistical estimation must be
performed.
OPTIONAL INPUTS:
regionSize
The size of the genomic region used for parameter
estimation (in number of base pairs).
Default value: 3000
boundaryConditions
Flag to decide if boundary conditions should be estimated
freely in MLE.
Default value: false
The default value of regionSize should only be changed by an expert with a detailed understanding of the code and the methods used.
InformMe.methAnalysisForChr
— MethodThis function performs methylation analysis of a given chromosome in a single phenotype. The function can be used on a computing cluster to break the analysis work to many independent parallel job processes. This is performed only after estParamsForChr.m in the Modeling subdirectory is run to build the Ising models for the phenotype.
USAGE (default):
methAnalysisForChr(prefix,chr_num,reference_path,estimation_path)
USAGE (optional):
Example of optional usage with additional input parameters.
methAnalysisForChr(prefix,chr_num,reference_path,estimation_path, outdir="/path/to/output")
MANDATORY INPUTS:
prefix
A string that specifies the name of the phenotype to be
analyzed.
chr_num
Chromosome number (1 to 22 in humans) specifying the
chromosome for which methylation analyis must be
performed.
reference_path
A string that specifies the path to the directory that
contains the results of analysis of the reference genome
performed by FastaToCpG.m as well as the results of
methylation calling performed by matrixFromBam.jl.
estimation_path
A string that specifies the path to the directory that
contains the results of parameter estimation performed
by estParamsForChr.jl.
OPTIONAL INPUTS:
outdir
A string that specifies the path of the directory in which
the methylation analysis results are written.
Default value: "./results/"
MSIflag
Flag that determines whether this function performs
computation of the methylation sensitivity index (MSI).
false: no MSI computation.
true: allow MSI computation.
Default value: false
ESIflag
Flag that determines whether this function performs
computation of the entropic sensitivity index (ESI).
false: no ESI computation.
true: allow ESI computation.
Default value: false
MCflag
Flag that determines whether this function performs
computation of turnover ratios, CpG entropies, capacities,
and relative dissipated energies of methylation
channels (MCs).
false: no MC computations.
true: allow MC computations.
Default value: false
regionSize
The size of the genomic regions used for parameter
estimation (in number of base pairs).
Default value: 3000
subregionSize
The size of the subregions of a genomic region used
for methylation analysis (in number of base pairs).
The ratio regionSize/subregionSize must be an integer.
Default value: 150
The default values of regionSize and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.
InformMe.makeBedsForMethAnalysis
— MethodThis function makes BED files for the methylation analysis results obtained by means of MethAnalysisForChr.m for a single phenotype.
USAGE (default):
makeBedsForMethAnalysis(prefix,analysis_path,reference_path)
USAGE (optional):
Example of optional usage with additional input parameters.
makeBedsForMethAnalysis(prefix,analysis_path,reference_path, outdir="/path/to/output")
MANDATORY INPUTS:
prefix
A string that specifies the name of the phenotype.
analysis_path
A string that specifies the path of the directory in which
the model was constructed.
reference_path
A string that specifies the path to the directory that
contains the results of analysis of the reference genome
performed by FastaToCpG.m as well as the results of
methylation calling performed by matrixFromBam.jl.
OPTIONAL INPUTS:
chrs
A vector of strings for the chromosomes to output to the
final bed files. Default value: `[string("chr",i) for i=1:22]`
outdir
A string that specifies the path of the directory in which
the output BED files are written.
Default value: "./"
MSIflag
Flag that determines whether this function performs
computation of the methylation sensitivity index (MSI).
false: no MSI computation.
true: allow MSI computation.
Default value: false
ESIflag
Flag that determines whether this function performs
computation of the entropic sensitivity index (ESI).
false: no ESI computation.
true: allow ESI computation.
Default value: false
MCflag
Flag that determines whether this function performs
computation of turnover ratios, CpG entropies, capacities,
and relative dissipated energies of methylation
channels (MCs).
false: no MC computations.
true: allow MC computations.
Default value: false
thresh
A scalar used as a threshold in methylation-based
classification.
Default value: 0.4
regionSize
The size of the genomic regions used for parameter
estimation (in number of base pairs).
Default value: 3000
subregionSize
The size of the subregions of a genomic region used
for methylation analysis (in number of base pairs).
The ratio regionSize/subregionSize must be an integer.
Default value: 150
The default values of thresh, regionSize, and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.
InformMe.diffMethAnalysisToBed
— FunctionThis function makes BED files for the differential version of the methylation analysis results obtained by means of MethAnalysisForChr.m applied on two dinstict phenotypes.
USAGE (default):
makeBedsForDiffMethAnalysis(prefix_1,prefix_2,analysis_path_1, analysis_path_2,reference_path)
MANDATORY INPUTS:
prefix_X
Strings with the first string specifying
the name of the first phenotype and the second string
specifying the name of the second phenotype used for
differential methylation analysis. Both phenotypes
must have already been analyzed with methAnalysisForChr.jl.
analysis_path_X
A string that specifies the path of the directory in which
the methylation analysis results obtained by
MethAnalysisForChr.jl are stored.
reference_path
A string that specifies the path to the directory that
contains the results of analysis of the reference genome
performed by FastaToCpG.m as well as the results of
methylation calling performed by matrixFromBam.jl.
OPTIONAL INPUTS:
chrs
A vector of strings of chromosomes to be output to the final
bed files. Default value: `[string("chr",i) for i=1:22]`
outdir
A string that specifies the path of the directory in which
the output BED files are written.
Default value: "./makeBedsForDiffMethAnalysis_out/"
MSIflag
Flag that determines whether this function performs
computation of the methylation sensitivity index (MSI).
false: no MSI computation.
true: allow MSI computation.
Default value: false
ESIflag
Flag that determines whether this function performs
computation of the entropic sensitivity index (ESI).
false: no ESI computation.
true: allow ESI computation.
Default value: false
MCflag
Flag that determines whether this function performs
computation of turnover ratios, CpG entropies, capacities,
and relative dissipated energies of methylation
channels (MCs).
false: no MC computations.
true: allow MC computations.
Default value: false
regionSize
The size of the genomic regions used for parameter
estimation (in number of base pairs).
Default value: 3000
subregionSize
The size of the subregions of a genomic region used
for methylation analysis (in number of base pairs).
The ratio regionSize/subregionSize must be an integer.
Default value: 150
minNumCpG
The minimum number of CpG sites within an analysis
subregion required for performing full methylation-based
differential classification.
Default value: 2
thresh
A scalar used as a threshold in methylation-based
differential classification.
Default value: 0.55
threshDMU
A 1x6 vector containing threshold values used for
methylation-based differential classification.
Default value: [-1,-0.55,-0.1,0.1,0.55,1]
threshDEU
A 1x8 vector containing threshold values used for
entropy-based differential classification.
Default value: [-1,-0.5,-0.3,-0.05,0.05,0.3,0.5,1]
The default values of regionSize, subregionSize, minNumCpG, thresh, threshDMU, and threshDEU should only be changed by an expert with a detailed understanding of the code and the methods used.