InformMe.jl Documentation

RUNNING informME

Run the informME software using the following steps in the indicated order (see below for the detailed documentation of each function).

1. REFERENCE GENOME ANALYSIS:

fastaToCpg(FASTAfilename)

This step analyzes the reference genome FASTA_FILE (in FASTA format) and produces a julia JLD2 file CpGlocationChr#.jld2 for each chromosome, which is stored by default in REFGENEDIR, and contains the following information:

  • location of CpG sites

  • CpG density for each CpG site

  • distance between neighboring CpG sites

  • location of the last CpG site in the chromosome

  • length of chromosome (in base pairs)

NOTE1: This step only needs to be completed one time for a given reference genome. Start analyzing samples at step 2 if you have previously completed step 1 for your sample's reference genome.

NOTE2: At this time the statistical model of informME has been designed to work only with autosomes, and so the informME software is not recommended for mitochondrial chromosomes, lambda spike-ins, partial contigs, sex chromosomes, et cetera.

2-5. COMBINED SINGLE SAMPLE ANALYSIS:

convertBAMtoBits(bamFilenames,phenoName)

Steps 2 through 5 below can be invoked by the single command above, where bamFilenames is a list of bam files to be modeled and assigned the phenotype name given by the argument phenoName which should be unique to this sample.

2. METHYLATION DATA MATRIX GENERATION:

matrixFromBam(BAM_FILE,CHR_NUM)

This step takes the BAM file BAM_FILE as input and generates the methylation data matrix for chromosome number CHR_NUM. By default, the file BAM_FILE and its associated index file (with extension .bai) is expected to be in BAMDIR, and the output file produced by this step is stored in a subdirectory in INTERDIR named after the chromosome number CHR_NUM. The output file preserves the prefix from the file BAM_FILE and the suffix '_matrices.mat' is appended to it (e.g. if BAM_FILE is normal_sample.bam and CHR_NUM is 10, then the output file is saved as INTERDIR/chr10/normal_sample_matrices.mat). The file produced contains the following information for each genomic region, which is subsequently used for model estimation:

  • data matrix with -1,0,1 values for methylation status

  • CpG locations broken down by region

NOTE1: See reference [1], "Online Methods: Quality control and alignment" for our suggested preprocessing steps when generating a sorted, indexed, deduplicated BAM file to input to informME.

3. MODEL ESTIMATION:

estParamsForChr(MAT_FILES,PHENO,matrices_path,reference_path,CHR_NUM)

informME learns the parameters of the Ising probability distribution by combining the methylation data matrices provided through the argument MAT_FILES (comma-separated list) for chromosome number CHR_NUM. By default, the MAT_FILES are expected to be in a subdirectory named after CHR_NUM in "./results/". The output generated during this phase is also stored in a subdirectory in results named after chromosome number CHR_NUM. The output file has as prefix PHENO and the suffix '_fit.jld2' appended to it (e.g. if 'normal' is the PHENO, and CHR_NUM is 10, then the output is stored as ./results/chr10/normal_fit.jld2). The file produced contains the following information:

  • CpG distances

  • CpG densities

  • estimated alpha, beta, and gamma parameters of the Ising model

  • initial and transition probabilities of the inhomogeneous Markov chain representation of the Ising model

  • marginal probabilities at each CpG site

  • the log partition function of the estimated Ising model

4. MODEL ANALYSIS:

methAnalysisForChr(prefix,chr_num,reference_path,estimation_path)

This step consists of analyzing the model learned by computing a number of statistical summaries of the methylation state, including probability distributions of methylation levels, mean methylation levels, and normalized methylation entropies, as well as mean and entropy based classifications. This step also computes entropic sensitivity indices, methylation sensitivity indices, as well information-theoretic quantities associated with methylation channels, such as turnover ratios, channel capacities, and relative dissipated energies. The output generated during this phase is stored in the same directory as the output generated during the first phase, using the same prefix as before. However, the suffix is now '_analysis.mat' (e.g. following the previous example, the output file of this phase is stored as ./results/chr10/normal_analysis.mat). This file contains the following information:

  • the locations of the CpG sites within the genomic region

  • numbers of CpG sites within the analysis subregions

  • which analysis subregions are modeled and which are not

  • estimated parameters of Ising model in genomic region

  • methylation level probabilities in modeled subregions

  • coarse methylation level probabilities

  • mean methylation levels

  • normalized methylation entropies

  • entropic sensitivity indices

  • methylation sensitivity indices

  • turnover ratios

  • channel capacities

  • relative dissipated energies

5. GENERATE BED FILES FOR SINGLE ANALYSIS:

makeBedsForMethAnalysis(PHENO,analysis_path,reference_path)

This function makes BED files from the methylation analysis results obtained after running methAnalysisForChr for a given phenotype PHENO. By default, the input file (analysis file) is expected to be located in ./results/chr#/PHENO_analysis.mat. In addition, the output files are stored in "./singleMethAnalysisToBed_out/" and have the following names and content:

  • MML-PHENO.bed: mean methylation levels

  • NME-PHENO.bed: normalized methylation entropy

  • METH-PHENO.bed: methylation-based classification (non-variable)

  • VAR-PHENO.bed: methylation-based classification (variable)

  • ENTR-PHENO.bed: entropy-based classification

  • ESI-PHENO.bed (if ESIflag passed): entropic sensitivity indices

  • MSI-PHENO.bed (if MSIflag passed): methylation sensitivity indices

  • TURN-PHENO.bed (if MCflag passed): turnover ratios

  • CAP-PHENO.bed (if MCflag passed): channel capacities

  • RDE-PHENO.bed (if MCflag passed): relative dissipated energies

6. GENERATE BED FILES FOR DIFFERENTIAL ANALYSIS:

makeBedsForDiffMethAnalysis(PHENO1,PHENO2,analysis_path_1, analysis_path_2,reference_path)

This function makes BED files for the differential methylation analysis results obtained after running methAnalysisForChr for two given phenotypes PHENO1 and PHENO2. The input files (both analysis files) are expected to be located in analysispath1/chr#/PHENO1_analysis.jld2 and analysispath2/chr#/PHENO2_analysis.jld2 respectively. In addition, the output files are stored in "./makeBedsForDiffMethAnalysis_out/" and have the following names and content:

  • dMML-PHENO1-VS-PHENO2.bed: differences in mean methylation levels

  • DMU-PHENO1-VS-PHENO2.bed: differential mean-based classification

  • dNME-PHENO1-VS-PHENO2.bed: differences in normalized methylation entropies

  • DEU-PHENO1-VS-PHENO2.bed: differential entropy-based classification

  • JSD-PHENO1-VS-PHENO2.bed: Jensen-Shannon distances

  • dESI-PHENO1-VS-PHENO2.bed (if –ESI flag passed): differences in entropic sensitivity indices

  • dMSI-PHENO1-VS-PHENO2.bed (if –MSI flag passed): differences in methylation sensitivity indices

  • dCAP-PHENO1-VS-PHENO2.bed (if –MC flag passed): differences in channel capacities

  • dRDE-PHENO1-VS-PHENO2.bed (if –MC flag passed): differences in relative dissipated energies

7. POST PROCESSING:

See the original informME package for post processing scripts in R for gene/region rankings and DMR finding.

Functions

InformMe.fastaToCpGMethod

This function is used to analyze a reference genome in order to find and store the locations of all CpG sites within each chromosome and to compute the CpG densities at each CpG site as well as the distances between neighboring CpG sites. A 1-based coordinate system is used, in which the first base is assigned to position 1 and the location of a CpG site is defined by the position of the C nucleotide on the forward strand of the reference genome.

This function must be run ONLY ONCE before proceeding with analysis of BAM files.

USAGE (default):

fastaToCpg(FASTAfilename)

USAGE (optional):

Example of optional usage with additional input parameters.

FastaToCpG(FASTAfilename,outdir="/path/to/outputdir/")

MADATORY INPUT:

FASTAfilename

           Full path of FASTA-formatted reference genome to which
           available BAM files have been aligned to.

OPTIONAL INPUTS:

outdir

           Path where the output will be stored at.
           Default value: "./"

wsize

           Window size used in CpG density calculations.
           Default value: 1000
source
InformMe.convertBAMtoBitsMethod

Runs the entire MethToBits pipeline.

This function depends on a working instalation of SAMtools that is on the system path PATH.

Before running this function, FastaToCpG.m must be run ONCE.

USAGE (default):

convertBAMtoBits(bamFilenames,phenoName)

USAGE (optional):

Example of optional usage with additional input parameters.

matrixFromBam(bam_prefix,chr_num,reference_path="/path/to/ref")

MADATORY INPUTS:

bamFilenames

            A comma seperated string with list of input bam file
            names without the ".bam" extension. These
            files must be sorted from the least to the greatest base
            pair position along the reference sequence and must be
            indexed (i.e., the associated BAI file must be available).
            The file name must not contain "." characters, but can
            contain "_" instead. Moreover, the file name should be
            unique.

phenoName

            A string which will be the unique identifier of this 
            sample/model that is built.

##OPTIONAL INPUTS:

reference_path

            Path to the root subdirectory where the outputs of this
            function are stored.
            Default value: "./genome/"

bamfile_path

            Path to the subdirectory where the BAM file is located.
            Default value: "./indexedBAMfiles/"

matrices_path

            Path to the subdirectory where the output of this function
            is stored.
            Default value: "./matrices/"

estimation_path

            A string that specifies the path to the directory that
            contains the results of parameter estimation performed
            by estParamsForChr.jl.
            Default value: "./estimation/"

outdir

           A string that specifies the path of the directory in which
           the methylation analysis results are written.
           Default value: "./output/"

pairedEnds

            Flag for paired end read support. A value of 1 indicates
            that the sequencer employed paired end reads, whereas a
            value of 0 indicates that the sequencer employed single
            end reads.
            Default value: true

numBasesToTrim

            A vector of integers specifying how many bases should be
            trimmed from the begining of each read. If the vector
            contains two integers, then the first integer specifies
            how many bases to trim from the first read in a read pair,
            whereas the second integer specifies how many bases should
            be trimmed from the second read in the pair. If the
            vector contains one integer, then all reads will have
            that number of bases trimmed from the beginning of the
            read. If no bases are to be trimmed, then this input
            must be set to 0.
            Default value: 0

regionSize

            The size of the genomic regions for which methylation
            information is produced (in number of base pairs).
            Default value: 3000

minCpGsReqToModel

            The minimum number of CpG sites within a genomic region
            required to perform statistical estimation.
            Default value: 10

boundaryConditions

            Flag to decide if boundary conditions should be estimated
            freely in MLE.
            Default value: false

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

chr_nums

          A vector with the chromosomes to be processed (without 
          "chr" string). 
          Default value: 1:22

numProcessors

          The number of processors to use in the computations.
          Note that julia must be started as "julia -p 4" if
          four processors are desired. The nprocs() function
          tells how many cores are available in julia, and
          we default to use them all.
          Default value: nprocs()

The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.matrixFromBamMethod

This function processes a BAM file with aligned reads to a reference genome and produces methylation information for nonoverlapping genomic regions (containing the same number of base pairs) in a given chromosome. The final output for each genomic region is a matrix with -1,0,1 values. Each row of the matrix is a methylation read, whereas each column represents a CpG site within the genomic region. A value of -1 indicates no methylation information is available for the CPG site, 0 indicates that the CpG site is unmethylated, and 1 indicates that the CpG site is methylated.

This function depends on a working instalation of SAMtools that is on the system path PATH.

Before running this function, FastaToCpG.m must be run ONCE.

USAGE (default):

matrixFromBam(bam_prefix,chr_num)

USAGE (optional):

Example of optional usage with additional input parameters.

matrixFromBam(bam_prefix,chr_num; reference_path="/path/to/ref")

MADATORY INPUTS:

bam_prefix

            Prefix of the BAM file (without the .bam extension). This
            file must be sorted from the least to the greatest base
            pair position along the reference sequence and must be
            indexed (i.e., the associated BAI file must be available).
            The file name must not contain "." characters, but can
            contain "_" instead. Moreover, the file name should be
            unique.

chr_num

            Number representing the chromosome to be processed.

OPTIONAL INPUTS:

reference_path

            Path to the root subdirectory where the outputs of this
            function are stored.
            Default value: "./genome/"

bamfile_path

            Path to the subdirectory where the BAM file is located.
            Default value: "./indexedBAMfiles/"

matrices_path

            Path to the subdirectory where the output of this function
            is stored.
            Default value: "./matrices/"

pairedEnds

            Flag for paired end read support. A value of true indicates
            that the sequencer employed paired end reads, whereas a
            value of false indicates that the sequencer employed single
            end reads.
            Default value: true

numBasesToTrim

            A vector of integers specifying how many bases should be
            trimmed from the begining of each read. If the vector
            contains two integers, then the first integer specifies
            how many bases to trim from the first read in a read pair,
            whereas the second integer specifies how many bases should
            be trimmed from the second read in the pair. If the
            vector contains one integer, then all reads will have
            that number of bases trimmed from the beginning of the
            read. If no bases are to be trimmed, then this input
            must be set to 0.
            Default value: 0

regionSize

            The size of the genomic regions for which methylation
            information is produced (in number of base pairs).
            Default value: 3000

minCpGsReqToModel

            The minimum number of CpG sites within a genomic region
            required to perform statistical estimation.
            Default value: 10

The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.estParamsForChrMethod

This function takes a list of BAM files (which correspond to the same phenotype) and performs statistical model estimation within a specific chromosome of interest. The function can be used on a computing cluster to break the work of model estimation to many independent parallel job processes. This is performed only after matrixFromBam.m.

USAGE (default):

estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num)

USAGE (optional):

Example of optional usage with additional input parameters.

estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num, regionSize=2000)

MANDATORY INPUTS:

mat_files

    All the .mat files to be included in the model. This can be a
    single .mat file or multiple files in the form of a comma-sepa-
    rated list of files.

prefix

    A string that specifies the name of the modeled phenotype.
    The output files produced will contain this prefix.

matrices_path

    A string that specifies the path to the directory that
    where the output will be stored.

reference_path

    A string that specifies the path to the directory that
    contains the results of analysis of the reference genome
    performed by FastaToCpG.m as well as the results of
    methylation calling performed by matrixFromBam.m.

chr_num

    Chromosome number 1 to 22 (in humans) specifying the
    chromosome for which statistical estimation must be
    performed.

OPTIONAL INPUTS:

regionSize

    The size of the genomic region used for parameter
    estimation (in number of base pairs).
    Default value: 3000

boundaryConditions

    Flag to decide if boundary conditions should be estimated
    freely in MLE.
    Default value: false

The default value of regionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.methAnalysisForChrMethod

This function performs methylation analysis of a given chromosome in a single phenotype. The function can be used on a computing cluster to break the analysis work to many independent parallel job processes. This is performed only after estParamsForChr.m in the Modeling subdirectory is run to build the Ising models for the phenotype.

USAGE (default):

methAnalysisForChr(prefix,chr_num,reference_path,estimation_path)

USAGE (optional):

Example of optional usage with additional input parameters.

methAnalysisForChr(prefix,chr_num,reference_path,estimation_path, outdir="/path/to/output")

MANDATORY INPUTS:

prefix

           A string that specifies the name of the phenotype to be
           analyzed.

chr_num

           Chromosome number (1 to 22 in humans) specifying the
           chromosome for which methylation analyis must be
           performed.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

estimation_path

           A string that specifies the path to the directory that
           contains the results of parameter estimation performed
           by estParamsForChr.jl.

OPTIONAL INPUTS:

outdir

           A string that specifies the path of the directory in which
           the methylation analysis results are written.
           Default value: "./results/"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

The default values of regionSize and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.makeBedsForMethAnalysisMethod

This function makes BED files for the methylation analysis results obtained by means of MethAnalysisForChr.m for a single phenotype.

USAGE (default):

makeBedsForMethAnalysis(prefix,analysis_path,reference_path)

USAGE (optional):

Example of optional usage with additional input parameters.

makeBedsForMethAnalysis(prefix,analysis_path,reference_path, outdir="/path/to/output")

MANDATORY INPUTS:

prefix

           A string that specifies the name of the phenotype.

analysis_path

           A string that specifies the path of the directory in which
           the model was constructed.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

OPTIONAL INPUTS:

chrs

          A vector of strings for the chromosomes to output to the
          final bed files. Default value: `[string("chr",i) for i=1:22]`

outdir

           A string that specifies the path of the directory in which
           the output BED files are written.
           Default value: "./"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

thresh

           A scalar used as a threshold in methylation-based
           classification.
           Default value: 0.4

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

The default values of thresh, regionSize, and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.diffMethAnalysisToBedFunction

This function makes BED files for the differential version of the methylation analysis results obtained by means of MethAnalysisForChr.m applied on two dinstict phenotypes.

USAGE (default):

makeBedsForDiffMethAnalysis(prefix_1,prefix_2,analysis_path_1, analysis_path_2,reference_path)

MANDATORY INPUTS:

prefix_X

           Strings with the first string specifying
           the name of the first phenotype and the second string
           specifying the name of the second phenotype used for
           differential methylation analysis. Both phenotypes
           must have already been analyzed with methAnalysisForChr.jl.

analysis_path_X

           A string that specifies the path of the directory in which
           the methylation analysis results obtained by
           MethAnalysisForChr.jl are stored.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

OPTIONAL INPUTS:

chrs

           A vector of strings of chromosomes to be output to the final
           bed files. Default value: `[string("chr",i) for i=1:22]`

outdir

           A string that specifies the path of the directory in which
           the output BED files are written.
           Default value: "./makeBedsForDiffMethAnalysis_out/"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

minNumCpG

           The minimum number of CpG sites within an analysis
           subregion required for performing full methylation-based
           differential classification.
           Default value: 2

thresh

           A scalar used as a threshold in methylation-based
           differential classification.
           Default value: 0.55

threshDMU

           A 1x6 vector containing threshold values used for
           methylation-based differential classification.
           Default value: [-1,-0.55,-0.1,0.1,0.55,1]

threshDEU

           A 1x8 vector containing threshold values used for
           entropy-based differential classification.
           Default value: [-1,-0.5,-0.3,-0.05,0.05,0.3,0.5,1]

The default values of regionSize, subregionSize, minNumCpG, thresh, threshDMU, and threshDEU should only be changed by an expert with a detailed understanding of the code and the methods used.

source

Index