MySQL queries in UCSC databases

I recently wanted to retrieve all SNP’s from dbSNP overlapping a set of regions using UCSC’s MySQL interface. First we connect to the database:

library(dplyr)
my_db <- src_mysql("hg19",
                   host = "genome-mysql.cse.ucsc.edu", 
                   user = "genomep", 
                   password = "password")
snpDbMySQL <- tbl(my_db, "snp142")

And then we try to retrieve the SNP’s within a region:

# NOT RUN
my_snps <- snpDbMySQL %>%
  filter(chrom == "chr1") %>%
  filter(chromStart >= 4000000) %>%
  filter(chromEnd <= 4001000) %>%
  collect

Taking some 40 seconds this is surprisingly slow. (Notice that no data is actually fetched before the collect command, dplyr is designed to be as lazy as possible with fetching data, see dplyr’s database vignette.)

Let’s investigate a bit and look at how the table is indexed:

mysql> SHOW INDEX in hg19.snp142;
+--------+----------+--------------+-------------+
| Table  | Key_name | Seq_in_index | Column_name |
+--------+----------+--------------+-------------+
| snp142 | name     |            1 | name        |
| snp142 | chrom    |            1 | chrom       |
| snp142 | chrom    |            2 | bin         |
+--------+----------+--------------+-------------+
3 rows in set (0.00 sec)

The (edited) output of the SHOW INDEX MySQL command reveals that the table is indexed after name, which is useless for our purposes, chrom and bin, whatever that is.

It turns out that the binning scheme was described in the original paper on the genome browser.

Source: Kent, et. al. Genome Research 2002

The bin assigned to a feature is the smallest bin the contains the entire feature. So in the figure the feature A will be assigned to bin 1, feature B to bin 4 and feature C to bin 20. SNP’s will always fall into the lowest order of bins.

To calculate the bin(s) we need to consider, we can use a slightly modified version of Jim Kent’s C script and use Rcpp.

library(Rcpp)
sourceCpp("UCSCBinFromRange.cpp")

(binRange <- c(binFromRangeStandard(4000000,4000001), binFromRangeStandard(4001000,4001001)))
## [1] 615 615

We see that both the start and the end coordinate falls into bin 615. It is now much faster to query

my_snps <- snpDbMySQL %>%
  filter(chrom == "chr1") %>%
  filter(bin == 615) %>%
  filter(chromStart >= 4000000) %>%
  filter(chromEnd <= 4001000) %>%
  collect
my_snps %>% select(bin, chrom, chromStart, chromEnd)
## Source: local data frame [54 x 4]
## 
##      bin chrom chromStart chromEnd
##    (dbl) (chr)      (dbl)    (dbl)
## 1    615  chr1    4000015  4000016
## 2    615  chr1    4000040  4000041
## 3    615  chr1    4000080  4000081
## 4    615  chr1    4000090  4000091
## 5    615  chr1    4000091  4000092
## 6    615  chr1    4000105  4000106
## 7    615  chr1    4000110  4000111
## 8    615  chr1    4000111  4000112
## 9    615  chr1    4000112  4000113
## 10   615  chr1    4000136  4000137
## ..   ...   ...        ...      ...

And that is it.

blog comments powered by Disqus