I recently wanted to retrieve all SNP’s from dbSNP overlapping a set of regions using UCSC’s MySQL interface. This post shows you how this can be done and how to increase the speed of queries by using UCSC’s binning scheme.
First things first - we connect to the database:
library(dplyr)
my_db <- src_mysql("hg19",
host = "genome-mysql.cse.ucsc.edu",
user = "genomep",
password = "password")
snpDbMySQL <- tbl(my_db, "snp142")
Then we try to retrieve all the SNP’s within a short region of 1kb:
# NOT RUN
my_snps <- snpDbMySQL %>%
filter(chrom == "chr1") %>%
filter(chromStart >= 4000000) %>%
filter(chromEnd <= 4001000) %>%
collect
Taking some 40 seconds this is surprisingly slow. (Notice that no data is actually fetched before the collect
command, dplyr
is designed to be as lazy as possible with fetching data, see dplyr
’s database vignette.)
Let’s investigate a bit and look at how the table is indexed:
mysql> SHOW INDEX in hg19.snp142;
+--------+----------+--------------+-------------+
| Table | Key_name | Seq_in_index | Column_name |
+--------+----------+--------------+-------------+
| snp142 | name | 1 | name |
| snp142 | chrom | 1 | chrom |
| snp142 | chrom | 2 | bin |
+--------+----------+--------------+-------------+
3 rows in set (0.00 sec)
The output of the SHOW INDEX
MySQL command reveals that the table is indexed after name
, chrom
and something called bin
.
It turns out that the binning scheme was described in the original paper on the genome browser.
The bin assigned to a feature is the smallest bin the contains the entire feature. So in the figure the feature A will be assigned to bin 1, feature B to bin 4 and feature C to bin 20. SNP’s will always fall into the lowest order of bins.
To calculate the bin(s) we need to consider, we can use a slightly modified version of Jim Kent’s C script and use Rcpp
.
#include <Rcpp.h>
using namespace Rcpp;
int binOffsets[] = {512+64+8+1, 64+8+1, 8+1, 1, 0};
// [[Rcpp::export]]
int binFromRangeStandard(int start, int end) {
int startBin = start, endBin = end-1, i;
startBin >>= 17;
endBin >>= 17;
for (i=0; i<5; ++i)
{
if (startBin == endBin)
return binOffsets[i] + startBin;
startBin >>= 3;
endBin >>= 3;
}
return 0;
}
library(Rcpp)
sourceCpp("UCSCBinFromRange.cpp")
(binRange <- c(binFromRangeStandard(4000000,4000001), binFromRangeStandard(4001000,4001001)))
## [1] 615 615
We see that both the start and the end coordinate falls into bin 615
.
It is now much faster to query for the SNP’s within the region
my_snps <- snpDbMySQL %>%
filter(chrom == "chr1") %>%
filter(bin == 615) %>%
filter(chromStart >= 4000000) %>%
filter(chromEnd <= 4001000) %>%
collect
my_snps %>% select(bin, chrom, chromStart, chromEnd)
## Source: local data frame [54 x 4]
##
## bin chrom chromStart chromEnd
## (dbl) (chr) (dbl) (dbl)
## 1 615 chr1 4000015 4000016
## 2 615 chr1 4000040 4000041
## 3 615 chr1 4000080 4000081
## 4 615 chr1 4000090 4000091
## 5 615 chr1 4000091 4000092
## 6 615 chr1 4000105 4000106
## 7 615 chr1 4000110 4000111
## 8 615 chr1 4000111 4000112
## 9 615 chr1 4000112 4000113
## 10 615 chr1 4000136 4000137
## .. ... ... ... ...
And that is it.