Kankan Zhao (zhaokk@zju.edu.cn)
Up to now, there is no state-of-the-art pipeline of metagenomics analysis (especially for environmental samples) because of the huge number of microbial dark matters and a wide variety of emerging bioinformatic tools. Hence, this pipeline is a assemble-based method of paired-end metagenomic data. It assumes you have a basic knowledge of metagenomics and Linux commands. Also, it is important to read the references listed to have a better understanding of each parameters (and it is better to read more articles cited the tools you used), because sometimes there is a lack of consensus among different articles. I hope it works for you and please feel free to contact me with any questions or tell me the mistakes of this pipeline.
export PATH=$PATH:/public/home/bma/application/FastQC
fastqc -o 01.rawdata/ -t 28 01.rawdata/S1_raw_R1.fq.gz
FastQC Manual and Tutorial.
If the results do not reach the standard, you should do it repeatedly.
java -jar <path to trimmomatic.jar> PE [-threads <threads>] | <input 1> <input 2>] |<paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> | ILLUMINACLIP:<path to adapter-PE.fa>:2:30:10 | LEADING:3 TRAILING:3 | SLIDINGWINDOW:<windowSize>:<requiredQuality> | MINLEN:50
# ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
# SLIDINGWINDOW: Perform a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
# LEADING: Cut bases off the start of a read, if below a threshold quality.
# TRAILING: Cut bases off the end of a read, if below a threshold quality.
# CROP: Cut the read to a specified length by removing bases from the end.
# HEADCROP: Cut the specified number of bases from the start of the read.
# MINLEN: Drop the read if it is below a specified length.
Example:
java -jar /public/home/bma/application/Trimmomatic-0-2.39/trimmomatic-0.39.jar PE -threads 28 01.rawdata/S1_raw_R1.fq.gz 01.rawdata/S1_raw_R2.fq.gz 02.paired.fq.gz/S1_paired_R1.clean.fastq.gz 02.unpaired.fq.gz/S1_unpaired_R1.clean.fastq.gz 02.paired.fq.gz/S1_paired_R2.clean.fastq.gz 02.unpaired.fq.gz/S1_unpaired_R2.clean.fastq.gz ILLUMINACLIP:/public/home/bma/application/Trimmomatic-0-2.39/adapter-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50
Bolger, A. M., Lohse, M., & Usadel, B. Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, 2014, 30(15): 2114-2120.
export PATH=$PATH:/public/home/bma/application/FastQC
fastqc -o 02.paired.fq.gz/ -t 28 02.paired.fq.gz/S1_paired_R1.clean.fastq.gz
export PATH=$PATH:/public/home/bma/application/bowtie2-2.3.2
bowtie2-build 03.dehost/01.ref/host_ref_genome.fasta 03.dehost/01.ref/index
bowtie2 -x 03.dehost/01.ref/index -1 02.paired.fq.gz/S1_paired_R1.clean.fastq.gz -2 02.paired.fq.gz/S1_paired_R2.clean.fastq.gz -S 03.dehost/02.sam/S1.sam --threads 28 --un-conc-gz 03.dehost/03.cleandata/S1.dehost.fq.gz
# BWA is also a great tool.
# S1.dehost.fq.gz would be devided into S1.dehost.fq.1.gz and S1.dehost.fq.2.gz automatically. So you could rename them as S1.dehost_R1.fq.gz and S1.dehost_R2.fq.gz for further analysis.
#export PATH=$PATH:/share/home/bmalab/samtools-1.11/
#export PATH=$PATH:/share/home/bmalab/bwa-master/
#bwa index haishen.fasta
#bwa mem -t 36 00.host/haishen.fasta 01.rawdata/SRR8524814_paired_R1.clean.fastq.gz 01.rawdata/SRR8524814_paired_R2.clean.fastq.gz|samtools sort -O bam -@ 36 -o - > 02.cleandata/SRR8524814.bam
#samtools view -@ 36 -b -f 12 -F 256 02.cleandata/SRR8524814.bam > 02.cleandata/SRR8524814_unmap.bam
#samtools sort -n 02.cleandata/SRR8524814_unmap.bam -o 02.cleandata/SRR8524814_unmap.sorted.bam
#bedtools bamtofastq -i SRR8524814_unmap.sorted.bam -fq SRR8524814_R1.fq -fq2 SRR8524814_R2.fq
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.
Li H. and Durbin R. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, (2009), 25:1754-60.
After host reads removal, the reads would be unpaired again. You could remove host reads before quality control or repair them with repair.sh in BBMap or other tools like Samtools.
Before repairation, decompression is needed.
export PATH=$PATH:/public/home/bma/application/pigz-2.4/
unpigz -k -p 28 *.gz
# -k: keep the package files.
# -p: thread number.
export PATH=$PATH:/public/home/bma/application/bbmap/
repair.sh in=S1_paired_R1.clean.fq in2=S1_paired_R1.clean.fq out1=S1_R1.fq out2=S1_R2.fq
Then, it is better to compress the fastq files again.
export PATH=$PATH:/public/home/bma/application/pigz-2.4/
pigz -p 28 *.fq
Bushnell B, Rood J, Singer E. BBMerge–Accurate paired shotgun read merging via overlap. PloS one, 2017, 12(10): e0185056.
Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009, 25(16): 2078-2079.
If you have enough computational resources and time, you could choose metaSPAdes, believed to be the best metagenome assembler. Besides, to improve reads retrieval rate, it is better to assemble contigs by single sample, divided groups and all samples, respectively. Mash could be used to separated samples into groups based on microbial community similarities like Jin et al. (2018).
export PATH=$PATH:/public/home/bma/application/MEGAHIT-1.2.9-Linux-x86_64-static/bin/
megahit -t 28 -m 0.95 --min-contig-len 1500 --k-min 21 --k-step 10 -1 S1_R1_fq.gz -2 S1_R2_fq.gz -o 04.assemble/S1/
To avoid unnecessary problems, you could remove the characters after space of contain name.
./seqkit replace -p " .+" -i assemble_contigs.fa > contigs.fa
Li D, Liu C M, Luo R, et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 2015, 31(10): 1674-1676.
Xu J, Zhang Y, Zhang P, et al. The structure and function of the global citrus rhizosphere microbiome. Nature communications, 2018, 9(1): 1-10.
If you need the information of the relationship between contigs and genes, you should skip this step.
I apply a stratagy based on depth-of-coverage by dividing the summed depth per base by the length of the respective sequence, which is same as Woodcroft et al.(2018), Salazar et al. (2019) and Herold et al. (2020). Other methods like Kraken2, Kaiju and MetaPhlAn2 are also feasible. the comparision of their performances could be seen in Sun et al. (2021). However, they usually perform not well in complex environment samples with a large amount of microbial dark matters.
export PATH=$PATH:/share/home/bmalab/bwa-master/
bwa index contigs.fa
export PATH=$PATH:/share/home/bmalab/samtools-1.11/
bwa mem -t 28 contigs.fa S1_R1.fq S1_R2.fq|samtools sort -O bam -@ 28 -o - > S1_contig.bam
BamM is no longer being maintained. Instead try CoverM which is easier to both install and use, and is faster. Parameters is same as Woodcroft et al. (2018), which is much stricter than Salazar et al. (2019) and Herold et al. (2020). The relative abundance of each contig in each sample could be calculated as its coverage divided by the total coverage of all contigs. Other coverage measure methods could also be applied.
export PATH=$PATH:/public/home/bma/application/coverm-x86_64-unknown-linux-musl-0.2.0-alpha7/
coverm filter --min-read-percent-identity 0.95 --min-read-aligned-percent 0.75 -b S1_contig.bam -o S1_contig_filter.bam -t 28
coverm contig --trim-max 90 --trim-min 10 --bam-files S1_contig_filter.bam > S1_contig_coverage.csv -t 28
Sun Z, Huang S, Zhang M, et al. Challenges in benchmarking metagenomic profilers. Nature methods, 2021, 18(6): 618-626.
Woodcroft B J, Singleton C M, Boyd J A, et al. Genome-centric view of carbon processing in thawing permafrost. Nature, 2018, 560(7716): 49-54.
Salazar G, Paoli L, Alberti A, et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell, 2019, 179(5): 1068-1083. e21.
Herold M, Arbas S M, Narayanasamy S, et al. Integration of time-series meta-omics data reveals how microbial ecosystems respond to disturbance. Nature communications, 2020, 11(1): 1-14.
Here, you could choose the raw contigs or non-redundant contigs to predict ORFs. If you need the information of the source contigs of ORFs, you should choose the raw contigs. To improve the efficiency, you could split the contig file into several parts and combine the results after prediction.
export PATH=$PATH:/public/home/bma/application/Prodigal/
prodigal -i contigs.fa -a amino.faa -d nucl.fnn -o genes.txt -p meta
# faa file contains amino acids
# fnn file contains nucleotides
To avoid unnecessary problems, you could remove the characters after space of gene name.
./seqkit replace -p " .+" -i nucl.fnn > nucl_rename.fnn
./seqkit replace -p " .+" -i amino.faa > amino_rename.faa
Hyatt D, Chen G L, LoCascio P F, et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 2010, 11(1): 1-11.
If you use the non-redundant contigs before and need the information of the source contigs of ORFs, you could skip this step.
CD-HIT is also a great tool, but much slower than MMseqs2. In MMseqs2, --cluster-mode 1 or 2 is similar to greedy incremental clustering strategy to cluster sequences in CD-HIT. Here, I choose parameters --min-seq-id 0.95 -c 0.9 --cluster-mode 2 --cov-mode 1, which is same as Salazar et al. (2019).
export PATH=$PATH:/share/home/bmalab/mmseqs/bin/
mmseqs easy-linclust amino_rename.faa clusterRes tmp --min-seq-id 0.95 -c 0.9 --threads 36 --cluster-mode 2 --cov-mode 1
# clusterRes_rep_seq.fasta is representative amino acids
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 2017, 35(11): 1026-1028.
Salazar G, Paoli L, Alberti A, et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell, 2019, 179(5): 1068-1083. e21.
If you already have target gene databases, you could mapping the reads to databases directly by BLAST-like tools (such as DIAMOND, MMseqs2 and USEARCH11) or HMMER. Here is a comparasion of BLAST-like tools. Briefly, MMseqs2 had the lowest error rates among the programs tested and DIAMOND (ultra-sensitive mode) offered the best balance between speed and quality. Then you could filter the mapping reads by some parameters like e-value, identity and so on. The remaining reads could be used to calculate gene abundance by RPKM, TPM or other methods.
I extract representative genes’ name and nucleotide sequences, because I apply a stratagy based on depth-of-coverage by dividing the summed depth per base by the length of the respective sequence, which is same as Salazar et al. (2019) and Herold et al. (2020).
To generate gene index, extract representative nucleotide sequence first.
./seqkit seq -n clusterRes_rep_seq.fasta > gene_rep.txt
sed -i 's/ //g' gene_rep.txt
./seqkit grep -f gene_rep.txt nucl_rename.fnn > gene_rep.fa
export PATH=$PATH:/share/home/bmalab/bwa-master/
bwa index gene_rep.fa
export PATH=$PATH:/share/home/bmalab/samtools-1.11/
bwa mem -t 28 gene_rep.fa S1_R1.fq S1_R2.fq|samtools sort -O bam -@ 28 -o - > S1_gene.bam
Parameters is same as Woodcroft et al. (2018), which is much stricter than Salazar et al. (2019) and Herold et al. (2020). The relative abundance of each gene in each sample could be calculated as its coverage divided by the total coverage of all genes. Other coverage measure methods could also be applied.
export PATH=$PATH:/public/home/bma/application/coverm-x86_64-unknown-linux-musl-0.2.0-alpha7/
coverm filter --min-read-percent-identity 0.95 --min-read-aligned-percent 0.75 -b S1_gene.bam -o S1_gene_filter.bam -t 28
coverm contig --trim-max 90 --trim-min 10 --bam-files S1_gene_filter.bam > S1_gene_coverage.csv
Woodcroft B J, Singleton C M, Boyd J A, et al. Genome-centric view of carbon processing in thawing permafrost. Nature, 2018, 560(7716): 49-54.
Salazar G, Paoli L, Alberti A, et al. Gene expression changes and community turnover differentially shape the global ocean metatranscriptome. Cell, 2019, 179(5): 1068-1083. e21.
Herold M, Arbas S M, Narayanasamy S, et al. Integration of time-series meta-omics data reveals how microbial ecosystems respond to disturbance. Nature communications, 2020, 11(1): 1-14.
Hernández-Salmerón J E, Moreno-Hagelsieb G. Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC genomics, 2020, 21(1): 1-9.