InformMe.jl Documentation

Functions

InformMe.fastaToCpGMethod

This function is used to analyze a reference genome in order to find and store the locations of all CpG sites within each chromosome and to compute the CpG densities at each CpG site as well as the distances between neighboring CpG sites. A 1-based coordinate system is used, in which the first base is assigned to position 1 and the location of a CpG site is defined by the position of the C nucleotide on the forward strand of the reference genome.

This function must be run ONLY ONCE before proceeding with analysis of BAM files.

USAGE (default):

fastaToCpg(FASTAfilename)

USAGE (optional):

Example of optional usage with additional input parameters.

FastaToCpG(FASTAfilename,outdir="/path/to/outputdir/")

MADATORY INPUT:

FASTAfilename

           Full path of FASTA-formatted reference genome to which
           available BAM files have been aligned to.

OPTIONAL INPUTS:

outdir

           Path where the output will be stored at.
           Default value: "./"

wsize

           Window size used in CpG density calculations.
           Default value: 1000
source
InformMe.convertBAMtoBitsMethod

Runs the entire MethToBits pipeline.

This function depends on a working instalation of SAMtools that is on the system path PATH.

Before running this function, FastaToCpG.m must be run ONCE.

USAGE (default):

convertBAMtoBits(bamFilenames,phenoName)

USAGE (optional):

Example of optional usage with additional input parameters.

matrixFromBam(bam_prefix,chr_num,reference_path="/path/to/ref")

MADATORY INPUTS:

bamFilenames

            A comma seperated string with list of input bam file
            names without the ".bam" extension. These
            files must be sorted from the least to the greatest base
            pair position along the reference sequence and must be
            indexed (i.e., the associated BAI file must be available).
            The file name must not contain "." characters, but can
            contain "_" instead. Moreover, the file name should be
            unique.

phenoName

            A string which will be the unique identifier of this 
            sample/model that is built.

##OPTIONAL INPUTS:

reference_path

            Path to the root subdirectory where the outputs of this
            function are stored.
            Default value: "./genome/"

bamfile_path

            Path to the subdirectory where the BAM file is located.
            Default value: "./indexedBAMfiles/"

matrices_path

            Path to the subdirectory where the output of this function
            is stored.
            Default value: "./matrices/"

estimation_path

            A string that specifies the path to the directory that
            contains the results of parameter estimation performed
            by estParamsForChr.jl.
            Default value: "./estimation/"

outdir

           A string that specifies the path of the directory in which
           the methylation analysis results are written.
           Default value: "./output/"

pairedEnds

            Flag for paired end read support. A value of 1 indicates
            that the sequencer employed paired end reads, whereas a
            value of 0 indicates that the sequencer employed single
            end reads.
            Default value: true

numBasesToTrim

            A vector of integers specifying how many bases should be
            trimmed from the begining of each read. If the vector
            contains two integers, then the first integer specifies
            how many bases to trim from the first read in a read pair,
            whereas the second integer specifies how many bases should
            be trimmed from the second read in the pair. If the
            vector contains one integer, then all reads will have
            that number of bases trimmed from the beginning of the
            read. If no bases are to be trimmed, then this input
            must be set to 0.
            Default value: 0

regionSize

            The size of the genomic regions for which methylation
            information is produced (in number of base pairs).
            Default value: 3000

minCpGsReqToModel

            The minimum number of CpG sites within a genomic region
            required to perform statistical estimation.
            Default value: 10

boundaryConditions

            Flag to decide if boundary conditions should be estimated
            freely in MLE.
            Default value: false

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

chr_nums

          A vector with the chromosomes to be processed (without 
          "chr" string). 
          Default value: 1:22

numProcessors

          The number of processors to use in the computations.
          Note that julia must be started as "julia -p 4" if
          four processors are desired. The nprocs() function
          tells how many cores are available in julia, and
          we default to use them all.
          Default value: nprocs()

The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.matrixFromBamMethod

This function processes a BAM file with aligned reads to a reference genome and produces methylation information for nonoverlapping genomic regions (containing the same number of base pairs) in a given chromosome. The final output for each genomic region is a matrix with -1,0,1 values. Each row of the matrix is a methylation read, whereas each column represents a CpG site within the genomic region. A value of -1 indicates no methylation information is available for the CPG site, 0 indicates that the CpG site is unmethylated, and 1 indicates that the CpG site is methylated.

This function depends on a working instalation of SAMtools that is on the system path PATH.

Before running this function, FastaToCpG.m must be run ONCE.

USAGE (default):

matrixFromBam(bam_prefix,chr_num)

USAGE (optional):

Example of optional usage with additional input parameters.

matrixFromBam(bam_prefix,chr_num; reference_path="/path/to/ref")

MADATORY INPUTS:

bam_prefix

            Prefix of the BAM file (without the .bam extension). This
            file must be sorted from the least to the greatest base
            pair position along the reference sequence and must be
            indexed (i.e., the associated BAI file must be available).
            The file name must not contain "." characters, but can
            contain "_" instead. Moreover, the file name should be
            unique.

chr_num

            Number representing the chromosome to be processed.

OPTIONAL INPUTS:

reference_path

            Path to the root subdirectory where the outputs of this
            function are stored.
            Default value: "./genome/"

bamfile_path

            Path to the subdirectory where the BAM file is located.
            Default value: "./indexedBAMfiles/"

matrices_path

            Path to the subdirectory where the output of this function
            is stored.
            Default value: "./matrices/"

pairedEnds

            Flag for paired end read support. A value of true indicates
            that the sequencer employed paired end reads, whereas a
            value of false indicates that the sequencer employed single
            end reads.
            Default value: true

numBasesToTrim

            A vector of integers specifying how many bases should be
            trimmed from the begining of each read. If the vector
            contains two integers, then the first integer specifies
            how many bases to trim from the first read in a read pair,
            whereas the second integer specifies how many bases should
            be trimmed from the second read in the pair. If the
            vector contains one integer, then all reads will have
            that number of bases trimmed from the beginning of the
            read. If no bases are to be trimmed, then this input
            must be set to 0.
            Default value: 0

regionSize

            The size of the genomic regions for which methylation
            information is produced (in number of base pairs).
            Default value: 3000

minCpGsReqToModel

            The minimum number of CpG sites within a genomic region
            required to perform statistical estimation.
            Default value: 10

The default values of regionSize and minCpGsReqToModel should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.estParamsForChrMethod

This function takes a list of BAM files (which correspond to the same phenotype) and performs statistical model estimation within a specific chromosome of interest. The function can be used on a computing cluster to break the work of model estimation to many independent parallel job processes. This is performed only after matrixFromBam.m.

USAGE (default):

estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num)

USAGE (optional):

Example of optional usage with additional input parameters.

estParamsForChr(mat_files,prefix,matrices_path,reference_path,chr_num, regionSize=2000)

MANDATORY INPUTS:

mat_files

    All the .mat files to be included in the model. This can be a
    single .mat file or multiple files in the form of a comma-sepa-
    rated list of files.

prefix

    A string that specifies the name of the modeled phenotype.
    The output files produced will contain this prefix.

matrices_path

    A string that specifies the path to the directory that
    where the output will be stored.

reference_path

    A string that specifies the path to the directory that
    contains the results of analysis of the reference genome
    performed by FastaToCpG.m as well as the results of
    methylation calling performed by matrixFromBam.m.

chr_num

    Chromosome number 1 to 22 (in humans) specifying the
    chromosome for which statistical estimation must be
    performed.

OPTIONAL INPUTS:

regionSize

    The size of the genomic region used for parameter
    estimation (in number of base pairs).
    Default value: 3000

boundaryConditions

    Flag to decide if boundary conditions should be estimated
    freely in MLE.
    Default value: false

The default value of regionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.methAnalysisForChrMethod

This function performs methylation analysis of a given chromosome in a single phenotype. The function can be used on a computing cluster to break the analysis work to many independent parallel job processes. This is performed only after estParamsForChr.m in the Modeling subdirectory is run to build the Ising models for the phenotype.

USAGE (default):

methAnalysisForChr(prefix,chr_num,reference_path,estimation_path)

USAGE (optional):

Example of optional usage with additional input parameters.

methAnalysisForChr(prefix,chr_num,reference_path,estimation_path, outdir="/path/to/output")

MANDATORY INPUTS:

prefix

           A string that specifies the name of the phenotype to be
           analyzed.

chr_num

           Chromosome number (1 to 22 in humans) specifying the
           chromosome for which methylation analyis must be
           performed.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

estimation_path

           A string that specifies the path to the directory that
           contains the results of parameter estimation performed
           by estParamsForChr.jl.

OPTIONAL INPUTS:

outdir

           A string that specifies the path of the directory in which
           the methylation analysis results are written.
           Default value: "./results/"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

The default values of regionSize and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.makeBedsForMethAnalysisMethod

This function makes BED files for the methylation analysis results obtained by means of MethAnalysisForChr.m for a single phenotype.

USAGE (default):

makeBedsForMethAnalysis(prefix,analysis_path,reference_path)

USAGE (optional):

Example of optional usage with additional input parameters.

makeBedsForMethAnalysis(prefix,analysis_path,reference_path, outdir="/path/to/output")

MANDATORY INPUTS:

prefix

           A string that specifies the name of the phenotype.

analysis_path

           A string that specifies the path of the directory in which
           the model was constructed.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

OPTIONAL INPUTS:

chrs

          A vector of strings for the chromosomes to output to the
          final bed files. Default value: `[string("chr",i) for i=1:22]`

outdir

           A string that specifies the path of the directory in which
           the output BED files are written.
           Default value: "./"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

thresh

           A scalar used as a threshold in methylation-based
           classification.
           Default value: 0.4

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

The default values of thresh, regionSize, and subregionSize should only be changed by an expert with a detailed understanding of the code and the methods used.

source
InformMe.diffMethAnalysisToBedFunction

This function makes BED files for the differential version of the methylation analysis results obtained by means of MethAnalysisForChr.m applied on two dinstict phenotypes.

USAGE (default):

makeBedsForDiffMethAnalysis(prefix_1,prefix_2,analysis_path_1, analysis_path_2,reference_path)

MANDATORY INPUTS:

prefix_X

           Strings with the first string specifying
           the name of the first phenotype and the second string
           specifying the name of the second phenotype used for
           differential methylation analysis. Both phenotypes
           must have already been analyzed with methAnalysisForChr.jl.

analysis_path_X

           A string that specifies the path of the directory in which
           the methylation analysis results obtained by
           MethAnalysisForChr.jl are stored.

reference_path

           A string that specifies the path to the directory that
           contains the results of analysis of the reference genome
           performed by FastaToCpG.m as well as the results of
           methylation calling performed by matrixFromBam.jl.

OPTIONAL INPUTS:

chrs

           A vector of strings of chromosomes to be output to the final
           bed files. Default value: `[string("chr",i) for i=1:22]`

outdir

           A string that specifies the path of the directory in which
           the output BED files are written.
           Default value: "./makeBedsForDiffMethAnalysis_out/"

MSIflag

           Flag that determines whether this function performs
           computation of the methylation sensitivity index (MSI).
           false: no MSI computation.
           true: allow MSI computation.
           Default value: false

ESIflag

           Flag that determines whether this function performs
           computation of the entropic sensitivity index (ESI).
           false: no ESI computation.
           true: allow ESI computation.
           Default value: false

MCflag

           Flag that determines whether this function performs
           computation of turnover ratios, CpG entropies, capacities,
           and relative dissipated energies of methylation
           channels (MCs).
           false: no MC computations.
           true: allow MC computations.
           Default value: false

regionSize

           The size of the genomic regions used for parameter
           estimation (in number of base pairs).
           Default value: 3000

subregionSize

           The size of the subregions of a genomic region used
           for methylation analysis (in number of base pairs).
           The ratio regionSize/subregionSize must be an integer.
           Default value: 150

minNumCpG

           The minimum number of CpG sites within an analysis
           subregion required for performing full methylation-based
           differential classification.
           Default value: 2

thresh

           A scalar used as a threshold in methylation-based
           differential classification.
           Default value: 0.55

threshDMU

           A 1x6 vector containing threshold values used for
           methylation-based differential classification.
           Default value: [-1,-0.55,-0.1,0.1,0.55,1]

threshDEU

           A 1x8 vector containing threshold values used for
           entropy-based differential classification.
           Default value: [-1,-0.5,-0.3,-0.05,0.05,0.3,0.5,1]

The default values of regionSize, subregionSize, minNumCpG, thresh, threshDMU, and threshDEU should only be changed by an expert with a detailed understanding of the code and the methods used.

source

Index