# MySQL queries in UCSC databases

I recently wanted to retrieve all SNP’s from dbSNP overlapping a set of regions using UCSC’s MySQL interface. First we connect to the database:

And then we try to retrieve the SNP’s within a region:

Taking some 40 seconds this is surprisingly slow. (Notice that no data is actually fetched before the collect command, dplyr is designed to be as lazy as possible with fetching data, see dplyr’s database vignette.)

Let’s investigate a bit and look at how the table is indexed:

mysql> SHOW INDEX in hg19.snp142;
+--------+----------+--------------+-------------+
| Table  | Key_name | Seq_in_index | Column_name |
+--------+----------+--------------+-------------+
| snp142 | name     |            1 | name        |
| snp142 | chrom    |            1 | chrom       |
| snp142 | chrom    |            2 | bin         |
+--------+----------+--------------+-------------+
3 rows in set (0.00 sec)


The (edited) output of the SHOW INDEX MySQL command reveals that the table is indexed after name, which is useless for our purposes, chrom and bin, whatever that is.

It turns out that the binning scheme was described in the original paper on the genome browser.

The bin assigned to a feature is the smallest bin the contains the entire feature. So in the figure the feature A will be assigned to bin 1, feature B to bin 4 and feature C to bin 20. SNP’s will always fall into the lowest order of bins.

To calculate the bin(s) we need to consider, we can use a slightly modified version of Jim Kent’s C script and use Rcpp.

We see that both the start and the end coordinate falls into bin 615. It is now much faster to query

And that is it.