There are increasingly many biotechnology protocols which involve hybridizing small fragments of DNA to a large pool - methods that are sensitive to cross-hybridization. (PCR is less sensitive to this because it requires a pair of matches within a reasonable space and correct orientation, and bad matches can get weeded out by the exponential growth involved.)
So how do we check for a close match? Well, usually people run BLAST. But BLAST has been designed to look for evolutionarily close matches to a sequence -- not matches close in free energy of hybridization. It's the wrong tool for the job. It penalizes mismatches too heavily, and it treats A-T and C-G bonding as equivalent.
But I didn't want to rewrite BLAST. It'd take me months or years. Well, we can hack BLAST to fit our needs.
BLAST, and other sequence alignment algorithms, uses a penalty matrix to determine the optimum alignment of any two sequences - nucleotide or amino acid. It gives you a score to add for matches and mismatches. Why aren't all mismatches equal? Because some amino acid changes are closer in "evolutionary space" than others because, in the genetic code, they share a more similar DNA triplet.
Thus, comparing two amino acid strings:
DHRC DQRK |
| You get (D-D match = +4, H-Q mismatch = +3, R-R match = +2, C-K mismatch = -5) a total of +4. The H-Q mismatch is actually considered positive because the H codons (CAU,CAC) are very similar to the Q codons (CAA,CAG). In contrast the C codons (UGU,UGC) are pretty different from the K codons (AAA,AAG). |
Thus, BLAST optimizes for matches on this basis. What we want is a scoring matrix for nucleotides, with less penalties for mismatches and a higher score for C & G matches than A & T ones. Here's one I roughly inferred from the dinucleotide free energies of hybridization at 37C (in other words I pulled it out of my ass):
A N C G T * A 5 -6 -10 -11 -11 -11 N -6 -6 -6 -6 -6 -11 C -10 -6 10 -10 -11 -11 G -11 -6 -10 10 -11 -11 T -11 -6 -11 -11 5 -11 * -11 -11 -11 -11 -11 1
Roughly speaking, a C-G bond is counted twice as much as an A-T bond, and any mismatch penalizes as much as a C-G bond adds.
Unfortunately BLAST doesn't let you use a matrix for nucleotides. Only proteins. :(
So we cheat. Here's what you do:
1) format your nucleotide database as "protein", not "nucleotide", with formatdb.
2) make a custom matrix with your own name, say, HYBBLASTMAT, and place it into the data/ directory alongside the normal matrices: BLOSUM45, BLOSUM62, PAM70, etc.
3) run blastall with blastp and the flag "-M" and your matrix name, ie:
blastall -p blastp -d testseq.fa -i test -o test.out -M HYBBLASTMAT
(4) Hah hah hah, if life were only so simple. You'll need to turn off simple-sequence filters, because the nucleotide looks like a degenerate peptide to BLAST. Secret combo:
-F F
(5) It's like molasses in January! I wish we could increase the wordsize, but we can't, because they're stupidly setting a max at 3 and we don't want to edit the source. One thing you can do is limit the distance between paired matches such that it's effectively looking for single words of 6 and building from those. This doesn't help much, maybe doubles the speed. Secret combo:
-A 4
(6) What e score to use? Hell if I know. Play around with it. I'm currently using e=1. Don't imagine that score is meaningful anymore, since blast thinks you're playing with proteins.
So you should end up with a command line that looks like this:
blastall -p blastp -d testseq.fa -i test -o test.out -M HYBBLASTMATRIX -F F -A 4 -e 1
Try it out. I've found some surprisingly close hybridization targets that normal blast was missing.