Introduction of Course
Searching for similarities between biological sequences is the principal means by which bioinformatics contributes to our understanding of biology. Of the various informatics tools developed to accomplish this task, the most widely used is BLAST, the basic local alignment search tool. This course discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the implementation developed at the National Center for Biotechnology Information.
Similarity searching, including sequence comparison, is one of the principal techniques used by computational biologists and has found widespread use among biologists in general. The most popular tool for this purpose is BLAST (basic local alignment search tool), which performs comparisons between pairs of sequences, searching for regions of local similarity. In the 11 years since its publication, the original paper describing BLAST has been cited over 12,000 times, and use of BLAST has become a fundamental tool of biology. It is therefore important to know how it works and what it accomplishes, how to use it properly and how to interpret someone else's published results. Today there are several implementations of the BLAST algorithm, with two that share a common ancestry - NCBI BLAST and WU-BLAST - enjoying the broadest use. NCBI BLAST is available from the National Center for Biotechnology Information (NCBI), while WU-BLAST is available from Washington University in St. Louis. This article discusses the principles, workings, applications and potential pitfalls of BLAST, focusing on the NCBI version. Further details can be found in several excellent resources, and additional BLAST-based programs are listed in the upcoming lectures
BLAST is one of the more popular bioinformatics tools. Researchers use command-line applications to perform searches locally, often searching custom databases and performing searches in bulk, possibly distributing the searches on their own computer cluster. The current BLAST command-line applications were available to the public in late 1997. They are part of the NCBI C toolkit and are supported on a number of platforms that currently includes Linux, various flavors of UNIX (including Mac OS X), and Microsoft Windows.
The initial BLAST applications from 1997 lacked many features that are presently taken for granted. Within three years of the initial public release, BLAST was modified to handle databases with more than 2 billion letters, to limit a search by a list of Gen Info Identifiers, and to simultaneously search multiple databases. PHI-BLAST, IMPALA, and composition-based statistics were also introduced within this time period, followed by Mega BLAST and the concept of query-concatenation (whereby the database is scanned once for many queries). Chris Joerg of Compaq Computer Corporation suggested performance enhancements in 1999. A group at Apple, Industries. suggested other enhancements in 2002. These and other features were of great importance to BLAST users, but the continual addition of unforeseen modifications made the BLAST code fragile and difficult to maintain.
Many mammalian genomes contain a large fraction of interspersed repeats, with 38.5% of the mouse genome and 46% of the human genome reported as interspersed repeats. Traditionally, the only supported method available to mask interspersed repeats in stand-alone BLAST has been to execute a separate tool (e.g., RepeatMasker) on a query, produce a FASTA file with the masked region in lower-case letters, and have BLAST treat the lower-case letters as masked query sequence. This requires separate processing on each query before the BLAST search.
NCBI recently redesigned the BLAST web site to improve usability, which helped to identify issues that might also occur in the stand-alone BLAST command-line applications. These changes have, unfortunately, made it more difficult to match parameters used in a stand-alone search with default parameters on the NCBI web site.
The advent of complete genomes resulted in much longer query and subject sequences, leading to new challenges that the current framework cannot handle. At the same time, increases in generally available computer memory made other approaches to similarity searching viable. BLAST uses an index stored in memory. Cameron and collaborators designed a "cache-conscious" implementation of the initial word finding module of BLAST. The concerns listed in this section and the start of a new C++ toolkit at the NCBI motivated to rewrite the BLAST code and release a completely new set of command-line applications. Here we report on the design of the new BLAST code, the resulting improvements, and a new set of BLAST command-line applications.