Maintained by me
Most software I developed is available at the my GitHub page.
Sequence alignment
- minimap2 maps short and long
genomic reads and long RNA-seq reads to a large reference database. It also
performs full-genome pairwise alignment and finds overlaps between long
reads.
- BWA aligns short sequence reads to
a reference genome. Minimap2 is recommended for long-read mapping.
De novo assembly
- miniasm is a very fast
overlapped-based de novo assembler for noisy long reads.
- FermiKit is a de novo
assembly based variant calling pipeline for deep Illumina resequencing data.
It consists of the following components:
- fermi2 finds overlaps and
does assembly for short reads.
- BFC is a standalone
high-performance tool for correcting sequencing errors from Illumina
sequencing data.
- ropeBWT2 is an tool for
constructing the FM-index for a collection of DNA sequences.
- fermi-lite is a standalone
C library as well as a command-line tool for assembling Illumina short reads
in regions from 100bp to 10 million bp in size. Fermi-lite is largely
a miniature of FermiKit.
Miscellaneous
- seqtk is a set of fast and
lightweight tools for processing sequences in the FASTA or FASTQ format.
- bioawk is an extension to Brian Kernighan's
original awk, adding the support of several common biological data
formats, including optionally gzip'ed BED, GFF, SAM, VCF, FASTA/Q and
TAB-delimited formats with column names.
- htsbox is a fork of
samtools/htslib, adding a few missing functionalities.
- lianti is a set of tools
for short reads sequenced from the LIANTI single-cell protocol. It is a fork
of htsbox.
- adna is a set of tools to
process ancient DNA reads, such as merging overlapping reads and
barcode-aware duplicate marking. It is a fork of lianti.
- tabtk is a set of tools for CSV
or TAB-delimited formats.
- BGT is a compact file format for
efficiently storing and querying whole-genome genotypes of tens to hundreds
of thousands of samples.
- dna-nn efficiently identifies
alpha satellite and hsat2,3 microsatellite sequences based on a deep learning
model.
Maintained by others
- SAMtools/htslib is a suite of programs
for interacting with high-throughput sequencing data. I created this project.
It is now maintained by a team at the Sanger Institute.
Discontinued
- MAQ is a first-generation
short-read mapper, now replaced by BWA.
- TreeSoft provides several
softwares for reconstructing or manipulating phylogenetic trees. Some of its
components are still used by Ensembl-Compara.
- TreeFam is a database of phylogenetic
trees of animal genes. I am a key founder of this database. It is not
maintained any more.
- TraceUtils is a
program that calls heterozygotes from capillary sequencing trace files.
Contributed
- wtdbg2 is a novel de novo
assembler for long noisy reads.
Discontinued
- PigGIS (Pig Genomics Informatics
System) is a pig genome database based on some ancient data.
- SNAP (SNP Annotation Platform) integrates various SNP information from
several databases.
- CAT (Cross-species
Alignment Tool) is capable cDNA-to-genome aligner.