NIEHS Exome Project

The National Institute of Environmental Health Sciences (NIEHS) Environmental Genome Project (EGP) explores the relationship between common genetic variations and environmentally induced disease in humans. To expand the set of reference polymorphism data available from NIEHS research, the NIEHS SNPs program at the University of Washington has initiated work to comprehensively scan all gene-coding regions (i.e., the exome) on the NIEHS reference panel of 95 individuals from diverse U.S. populations (i.e., EGP-Panel 2).

Gene Name Search
gene name:
Beyond Your Target (optional)
upstream of gene (# of bases):
downstream of gene (# of bases):
Gene ID Search
gene ID:
Beyond Your Target (optional)
upstream of gene (# of bases):
downstream of gene (# of bases):
Chromosomal Location Search (hg19)

We request that any use of data obtained from the NIEHS Exome Variant Server be cited in publications.


NIEHS Environmental Genome Project, Seattle, WA (URL: [date (month, yr) accessed].

Acknowledgment for Publication

The authors would like to thank the NIEHS Environmental Genome Project for providing support for this project under contract No.HHSN273200800010C.

Public Data Release

The current data release includes complete exomes from 22 EGP Caucasian samples, 14 EGP African-American samples, 24 EGP Asian samples, 22 EGP Hispanic samples, and 13 EGP Yoruban samples. The data are analyzed in reference to the UCSC hg19 human genome reference sequence. Reads were aligned with BWA (Burrows-Wheeler Alignment). The putative variants and genotypes were called by the GATK UnifiedGenotyper. Visual inspection of the read data is highly recommended to confirm a variation before choosing it as a potential research target

Complete exome sequencing data for 95 EGP samples are available to download. Reads were aligned with BWA (Burrows-Wheeler Alignment) to the human reference sequence version hg19 = NCBI 37. The VCF file contains SNPs and genotypes called by the GATK UnifiedGenotyper, which was run for all 95 samples in multi-sample mode. For a description of the VCF format see this link. The variant data lines in the vcf file without "PASS" in the FILTER column (e.g., GATK_filter, SnpCluster) are of questionable quality. Visual inspection of the read data is highly recommended to confirm a variation before choosing it as a potential research target.

Each bam file contains the read alignments for one exome, and is on the order of more than 10 GB in size. Each also comes with a bam index file with a .bai extension. Due to the large file sizes, the files must be downloaded through our high-speed file transfer Aspera server.

To download our publicly accessible data through our Aspera server, please use the following authentication information.

username: public-aspera2

password: pub123GS

GRC-Aspera Public Download

How to Use the Data Browser
The current release has been tested successfully with Firefox v.3.0 and IE v.7.0. To use this site, your browser must have cookies and JavaScript enabled.
The gene model is that of NCBI, 2010. Chromosome positions are those of NCBI build 37 (UCSC hg19).
Please follow the steps below to query SNP, INDEL and coverage data:
    1. select search type to query
    2. select data type
    3. select population(s)
    4. display results
    5. dowload results
1. Select Search Type
There are three ways to query variations:
A. gene name (HUGO, upper or lower case)
B. gene ID (from NCBI Entrez Gene)
C. chromosomal location
For A and B, you have the option to extend the chromosome region. The choice "upstream" is on the 5' end, and "downstream" is on the 3' end of the gene.
When a search by gene name or gene ID is made, there are sometimes alternative transcripts. A region large enough to cover all transcripts is chosen.
2. Select Data Type
SNP, INDEL, and coverage data are currently provided. Each data type is summarized under an individual tab.
3. Select Populations
Database queries give genotype search results in a table of data sets categorized by the population (YORUB, CEPH, AD, ASIAN, HISP) in which the variations were identified. From the top table select one or more data sets.
4. Display Results
Once the data sets are chosen, you have a choice of 3 buttons to click (they can be clicked consecutively without re-starting the search).

The first is "display genotypes" for listing the genotypes for all individuals and all variations in the data set. A visual genotype graph (if Table/Image) can be chosen to show color-coded genotypes. Note that the color code in this graph is referenced to the allele of the human genome, and not to the common allele.

In the genotype graphic display, each color-coded square can be clicked to retrieve read data that support the genoytpe call. After a square is clicked, a sequence alignment window (see below) will be displayed.


The bases in the "Alt.Seq." row can also be clicked to display all read alignments for each position (see below).


The second button "display snp summary" presents a large number of calculated values and annotations for the variations. The page "SNP Summary Columns" details the quantities displayed.

If "Text" has been chosen, it is possible from some browsers to save the output as a text file. If your browser does not have a save-as-text option (e.g. Mac Safari), you will have to copy and paste. The fields will be space-delimited. If you import the saved file to Excel, it will be necessary to choose "Data/Get External Data/Import Text File" and select "Delimited" and "Space". If the output has columns that are comma-separated numbers, it will be necessary to force Excel to treat those columns as text.

The third button leads to a chromosome-region map, where there is a help button. Here it is possible to view the assembled reads for a particular genome location.

In this browsing view, you should first see a presentation of the alignment between the "NCBI 2010 Gene Model" and our "exome target" (see below).


The green regions representing our exome targets are clickable for retrieving snps in the specific target, and further into the mapped reads as describled above in the graphic genotype displaying mode.

5. Download Results
SNP, genotype and coverge data can be downloaded through the downloading options on the top of the diplaying pages. SNP and genotype data can be downloaded in either text or vcf format. Both summary coverage data and detailed locus coverage data can be downloaded. The downloaded data are compressed in either gzip or zip format.
If you import the text-formatted file to Excel, it will be necessary to choose "Data/Get External Data/Import Text File" and select "Delimited" and "Space". Make sure import all columns as text in Excel.

current release version: v.0.0.8.   (April 22, 2012)

Changes made in v.0.0.8,
   1) Browsing the exome targets through chromosome map was not working for the newer version Firfox browser (after 3.5), it is fixed.

current release version: v.0.0.7.   (Jan. 13, 2012)

Changes made in v.0.0.7,
   1) Set the color of homozygous reference allele genotypes to blue in the genotype graphic view.
   2) Incorporate INDEL aligments based on CIGAR strings in bam files into the read trace views.
   3) Fix a bug in read alignment views and display SNPs and INDELs separately.

version: v.0.0.6.   (Sept. 30, 2011)

Changes made in v.0.0.6,
   1) The average sample sequence read depth can now be viewed graphically thru UCSC genome browser.

version: v.0.0.5.   (Sept. 21, 2011)

Changes made in v.0.0.5,
   1) release of all 95 NIEHS exomes.

version: v.0.0.4.   (Sept. 7, 2011)

Changes made in v.0.0.4,
   1) Add sorting ability for SNP summary table.

version: v.0.0.3.   (August 11, 2011)

Changes made in v.0.0.3,
   1) Add downloading options for variation output and coverage ouput.
   2) Fix a bug in outputting summary coverage blocks which missed the very last block, and the sequencing span for each block should be 1 bp longer.

Contact us:

Colleen Davis

Privacy Terms funded by National Institute of Environmental Health Sicences logo