Related Links

Last updated 2015-02-25 20:30:21 EST

Home

 

What is TruHmm?

TruHmm is a reference based transcriptome assembler for prokaryotes, and is suitable for assembling transcripts for directional RNA-seq library. The software package is coded in standard C++ and Perl. The C++ code can be compiled by the GNU C++ compiler on Linux.  For details of the algorithm, please refer to the publication:

Shan Li, Xia Dong and Zhengchang Su. Directional RNA-seq reveals highly complex condition-dependent transcriptomes in E. coli K12 through accurate full-length transcripts assembling. BMC Genomics, 2013, 14:520.  doi:10.1186/1471-2164-14-520.

The project was funded by National Institutes of Health (1R01GM106013) and National Science Foundation (EF0849615 and CCF1048261).

 

How to use TruHmm?

* Download the software, decompress the file using unzip:

       unzip   TruHmm.zip

*  The TruHmm package includes three programs:

          1) Training program: TruHmm_training.cpp

          2) Operon reconstructing program: reconstruct_operons.pl

          3) Accuracy evaluation program: validate_TruHmm_accuracy.pl.

           4) Antisense RNAs reconstruction program: reconstruct_antisenseRNAs.pl

           5) Non-coding RNAs reconstruction program: reconstruct_ncRNAs.pl

 

* Use the executable file named truHmm_training, or compile the source code TruHmm_training.cpp on Linux using command:

       g++ TruHmm_training.cpp  -o  truHmm_training

 

* Training program command line:

******

USAGE:

 

******

./truHmm_new <SAMfile> <chrLength> <geneFile> <sampleFolderName> [OPTIONS]

 

<SAMfile>              mapping file in SAM format

 

<chrLength>          length of chromesome of the reference genome

 

<geneFile>             genes/tRNA/rRNA/ncRNA of the reference genome (format: example.gene)

 

<sampleFolder>    name of sample folder. Make a directory/folder for each sample. You can  name it after

                           the sample or simply called it Sample_ID, where ID = 1, 2, etc. It is also the prefix of

                           the output file name for each sample (output filename example: Sample_1.w11.parameters, the

                           parameters file for sample 1 when using window size 11).

 

 

[OPTIONS]:

-op      known operons in the reference genome (format: example.newop)

 

-opL    This option is available when '-op' is used.

          Transition probability trained from known operons (T),    otherwise trained from  contigs (F).

          Default is F. Use 'T' when there are enough known operons.

 

-pcc     cutoff of the correlation between interoperon and operon pair. Default is 0.3

 

-bin     number of bins used for calculation of PCC between interoperon and extended interoperon.

          Default is 4.

 

-c        coverage cutoff for the sufficiently covered genes. Default is 0.5

 

-top     percentage of the highly differentially expressed genes. Default is 0.1. It could be varying in different species.

 

-e        The option is available when '-op' is used.

          Emission parameters trained from known operons (T), otherwise use the emission

          paramters provided later. Default is F. Use 'T' when there are enough known operons.

 

-w      '-w' is availabe when '-op' is used, and option -e is 'T'.

          Size of the sliding window used to calculate the centroid value for each base.

          Default is 11

 

-rRNA   File containing tRNA and rRNA. If you are only interested in mRNAs and want to further remove reads

            mapped to tRNA/rRNA regions, and tRNA/rRNA is not in your reconstructed operon file or your original

            known operon file, these reads can be removed (format: example.trRNA).

 

******

 

* Running TruHmm

    1)  Before you test your own data of mapped reads (in SAM format), you can download sample data, Sample_1.sam.tar.gz and Sample_2.sam.tar.gz and go through the whole procedure. Please extract the compressed sample data first using the following command:

         tar  -zxvf   Sample_1.sam.tar.gz

    2)  Make a directory for each sample, and put each SAM file into its folder. You can name the folder for each sample as Sample_1, Sample_2, etc, and put the executable file truHmm_training, together with the perl scripts, example.gene, example.newop and example.trRNA in the same directory as the folders Sample_1, Sample2, etc. When you are using your own data, please rename them as Sample_1.sam, Sample_2.sam,etc or Sample_1.bowtie.sam, or Sample_1.*.sam, * could be any parameter you want to tag your SAM file for each sample:

        mkdir  Sample_1

        mv  Your_sam_file  Sample_1/Sample_1.sam

    3) A) If you have enough known operons in the genome of interest and at least two samples, you can provide the operon file (for the argument op ) when running truHmm_training. Run the compiled executable file using the command below as an example:

       ./truHmm_training  Sample_ID.sam  4639675  example.gene  Sample_ID  -op example.newop   

       -opL  T   -e  T    -rRNA  example.trRNA

This step will generate a Sample_ID.parameters file in each Sample_ID folder (ID = 1, 2, etc.), which will be used in the operon reconstruction step. The negative testing sets based on the adjacent operon pair (NOP_neighbor) and the entire operon structures (NOP) will also be generated after this training step, which will be used in the validation step. Please note that, you can also use the length of contigs (-opL F) to train the transition probability, but it requires much more CPU time.

       B) If you do not have enough known operons but more than one sample, you can still use the training program to train the transition probabilities by using the command:

      ./truHmm_training  Sample_ID.sam  4639675  example.gene  Sample_ID   -rRNA  example.trRNA

The output includes the Sample_ID.parameters file but no file for negative testing sets will be generated. However, we also provide users the option to use the default settings in the operon reconstruction step without any training.

      C) If you do not have enough known operons or no more than two samples, you still need to run truHmm_training first to get the normalization factor in the Sample_1.parameters file, and then move forward to the Viterbi decoding step (step 4).

 

    4) Decode hidden states for each sample using Viterbi algorithm by leave-one-out cross validation, and reconstruct operons from decoded adjacent operon pairs:

 

* Reconstruct operons command line:

******

USAGE:

******

reconstruct_operons.pl [options]:

 

-sample         Sample Name: e.g. Sample_1

-sam               SAM File Name for the sample: e.g. Sample_1.sam or Sample_1.bowtie.sam or Sample_1.bwa.sam

-g                   Gene File: the same gene file when you run truHmm_training.

-w                  Window Size: default is 11. It has to be an odd number.

-L                   Length of the chromosome: based on bp.

### NOTE:-sample, -sam, -g must be specified ###

example:

perl  reconstruct_operons.pl   -sample  Sample_1  -sam  Sample_1.sam  -g  example.gene  -L 4639675

This program will output the longest possible operons/suboperons file for the sample. Each assembled multi-gene operon could contain several alternative Transcription Start Sites (TSS) associated with certain internal genes, thus, all the predicted TSSs for an operon are listed in the output file. The operons are sorted by their order on the chromosome, e.g. Sample_1.w11.predicted.newop.with.TSS.

 

    5) Evaluate the accuracy of your predicted operons. Use this step to evaluate the prediction accuracy only if you have known operon set, otherwise please ignore it.

******

USAGE:

******

validate_TruHmm_accuracy.pl [options]:

 

-sample         Sample Name: e.g. Sample_1

-op                 Experimentally verified operons in the reference genome, e.g. the file example.newop in the package

-g                   Gene File: the same gene file when you run truHmm_training.

-w                  Window Size: default is 11. It has to be an odd number.

### NOTE:-sample, -op, -g must be specified ###

example:  perl   validate_TruHmm_accuracy.pl  -sample  Sample_1  -op  example.newop  -g example.gene 

It will generate the accuracy files based on two evaluation metrics: gene pairs and entire operon structure.

 

 

    6) Reconstruct the small RNAs (antisense RNAs and non-coding RNAs). This step must be run after operon reconstruction. Please also consider to include the predicted 'hidden' antisense/nc RNA within each folder in your study.

 

******

USAGE:

******

reconstruct_antisenseRNAs.pl [options]:

-sample         Sample Name: e.g. Sample_1

-g                   Gene File: the same gene file when you run truHmm_training.

-w                  Window Size: default is 11. It has to be an odd number.

### NOTE:-sample, -g must be specified ###

example: perl  reconstruct_antisenseRNAs.pl   -sample  Sample_1   -g  example.gene

 

reconstruct_ncRNAs.pl [options]:

-sample         Sample Name: e.g. Sample_1

-g                   Gene File: the same gene file when you run truHmm_training.

-w                  Window Size: default is 11. It has to be an odd number.

### NOTE:-sample, -g must be specified ###

example: perl  reconstruct_ncRNAs.pl   -sample  Sample_1   -g  example.gene

 

Once the operons/suboperons in each sample are reconstructed, these two scripts could be used to predict the potential antisense and non-coding transcripts.

Dr. Zhengchang Su Lab

 

Department of Bioinformatics and Genomics

 

UNC Charlotte

9201 University City Blvd,

Charlotte, NC 28223

 

 

 

 

 

 

Contact Information

 

Shan Li                        sli13@uncc.edu

 

Xia Dong                     xdong4@uncc.edu

 

Zhengchang Su           zcsu@uncc.edu

 

 

 

TRscription Units assembly by a Hidden Markov Model

Site Map