Wednesday, November 20, 2013

Using Database Joins to Compare Results Sets

One of the most powerful tools you can learn to use in genomics research is a relational database system, such as MySQL.  These systems are fairly easy to setup and use, and provide users the ability to organize and manipulate data and statistical results with simple commands.  As a graduate student (during the height of GWAS), this single skill quickly turned me into an “expert”.  By storing the SNP lists for common GWAS platforms and some simple annotations from the UCSC and ENSEMBL databases, I could quickly provide lists of SNPs within a gene or collection of genes, or pull a list of SNPs that overlap two genotyping platforms.  We even developed database modules that allowed us to easily define LD blocks within a database query (called LD-Spline).

Once you learn the basics of defining tables and loading data, you can start to join tables together, matching them on a common field.  This is where the true power of a database system lies.  Suppose you have two sets of results from a PLINK analysis, one from a discovery dataset and another from a replication.  Rather than clumsily matching two sets of results within a spreadsheet application, a few simple queries within MySQL will tell you which SNPs are in common between the two sets, which were not found in the replication set, which SNPs were significant in the first set but not the second, etc.  

The concept that makes these operations work is the idea of a primary key.  A primary key is some field of a dataset that uniquely identifies each row of the table/dataset.  In the above example of PLINK results, a good primary key might be the RS number of the SNP.  You can also uniquely identify rows based on two columns, a concept known as a composite key – for example, the chromosome AND position of a SNP.  Establishing a primary key allows MySQL to keep data stored in a sorted order and allows the matching operations for table joins to be performed much faster. 

Having this sorted order from a primary key prevents MySQL from having to scan an entire table to find a specific value.  Much like the index of a book, a primary key lets MySQL find a value within a table very quickly.  If a table is small, having a primary key is not as critical; the computer can quickly scan the entire contents of the table for any query.  If the table is large, however, a full scan of the entire table could be a costly operation, and the number of table scans required increases when doing a join.  For example, if we join tables for our discovery and replication results sets, the database system will take the RS number for each entry from the discovery table and attempt to find a matching RS number in the replication table.  If the replication table has the RS number as a primary key, the database system can very quickly find this entry. There is a fantastic post on the various types of database joins here.

Let's start by creating our database tables.  A typical PLINK association output contains 12 columns (CHR, SNP, BP, A1, TEST, NMISS, OR, SE, L95, U95, STAT, P).  In these tables, we've established the SNP column as the primary key.  Recall that the primary key must uniquely identify each row of the table, so if there are multiple rows per SNP -- sometimes PLINK will report multiple TEST rows per SNP.  If this is the case, we may need to either establish a composite key using PRIMARY KEY (`snp`,`test`), or simply eliminate these rows from the data file using an AWK command.

CREATE TABLE `discovery` (
 `chr` varchar(1),
        `snp` varchar(32),
        `bp` int, 
        `a1` varchar(1),
        `test` varchar(3),
        `nmiss` int,
        `or` float,
        `se` float,
        `l95` float,
        `u95` float,
        `stat` float,
 `p` float,
 PRIMARY KEY (`snp`)
);

CREATE TABLE `replication` (
       `chr` varchar(1),
       `snp` varchar(32),
       `bp` int,
       `a1` varchar(1),
       `test` varchar(3),
       `nmiss` int,
       `or` float,
       `se` float,
       `l95` float,
       `u95` float,
       `stat` float,
       `p` float,
       PRIMARY KEY (`snp`)
);

Before loading our data into these tables, a little pre-processing is helpful.  To ensure that results are easy to read on the screen, PLINK developers used leading spaces in the column format for many PLINK outputs.  These make loading the results into a database difficult.  We can resolve this by running a simple SED command:
sed -r -e 's/\s+/\t/' -e 's/^\t//g' input-file.assoc.logistic > discovery.load
This will convert all spaces to tabs and will eliminate the leading spaces and write the results to a new file, discovery.load.  Now lets load this file into our table, and repeat the procedure for our replication file.

LOAD DATA LOCAL INFILE '{PathToFile}/discovery.load' INTO TABLE 
discovery FIELDS TERMINATED BY '\t' IGNORE 1 LINES;
Now we should have two MySQL database tables with the discovery and results sets loaded into them.  We can view their contents with a simple select statement.  Then, finally, we can join these two tables to easily compare the results from the discovery and replication analyses.


SELECT * FROM discovery INNER JOIN replication ON 
discovery.snp = replication.snp;
The syntax is simple: select a set of fields -- in this case all of them (represented by the *) -- from the first table (discovery), matching each row from this table to a row in the second table (replication) where the discovery SNP equals the replication SNP.  MySQL also supports a table alias which can make these queries a bit easier to write.  An alias is simply a label specified after a table name which can be used in the rest of the query in place of the full table name.  For example, in the query below, we use a for the discovery table and b for the replication table.

SELECT * FROM discovery a INNER JOIN replication b ON 
a.snp = b.snp;
With practice and additional data, join operations can be used to annotate results by gene or region, and to match these to results from other studies, such as the NHGRI GWAS catalog.


5 comments:

  1. Do you think Bedtool's does most of these operations without the need for a bulky database?

    Now even multiple bed files can be intersected etc...

    ReplyDelete
    Replies
    1. from my limited experience, mysql will be much powerful and convenient:) although bedtools is very handy in many cases.

      Delete
  2. Hi, I think there is a typo there for the second select command

    it should be:
    SELECT * FROM discovery a INNER JOIN replication b ON
    a.snp = b.snp;

    ReplyDelete
  3. Isn't that using MySQL a little bit overkill? SQLite is totally sufficient.

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.