NHLBI Exome Sequencing Project (ESP)

Exome Variant Server
Column Description for Variant Summary Table

Variant Pos:

The SNV location on the chromosome (NCBI 37 or hg19) is 1-based. The INDEL location is also 1-based, but it is reported as 1-base before the actual insersion/deletion event.

rs ID:

dbSNP reference SNP identifier (if available)

Alleles:

The alleles are listed in the HGVS variant notation, for the ESP project, it always refers to a change from a reference allele to an alternate allele. For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

EA Allele Count

The observed allele counts for the listed alleles in European American population (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

AA Allele Count

The observed allele counts for the listed alleles in African American population (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

Allele Count

The observed allele counts for the listed alleles in all populations (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

MAF (%) (EA/AA/All):

the minor-allele frequency in percent listed in the order of European American (EA), African American(AA) and all populations (All) (delimited by /). For the multi-allelic variants, the MAF is defined as the allele frequency in percent for all the minor alleles.

EA Genotype Count

The observed genotype counts for the listed genotypes in European American population (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

AA Genotype Count

The observed genotype counts for the listed genotypes in African American population (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

Genotype Count

The observed genotype counts for the listed alleles in all populations (delimited by /). For INDELs, the alleles are listed with aliases, such as, A1, A2, or An refering to the N-th alternate allele while R refers to the reference allele.

Avg. Sample Read Depth:

the average sample read depth

Genes:

one or more genes for which the SNP is in the coding region based on NCBI Gene.

mRNA Accession #:

NCBI mRNA transcripts accession number

GVS Function:

the GVS functions are calculated locally and stored in our local database; they are based on the alleles for all populations and individuals; the bases in the coding region are divided into codons (if a multiple of 3), and the resulting amino acids are examined:
  • intergenic: between genes
  • intron: in an intron region
  • near-gene-3: near the 3' end of a gene
  • near-gene-5: near the 5' end of a gene
  • utr-3: in a 3'-utr region
  • utr-5: in a 5'-utr region
  • coding-notMod3: in a coding region leading to an ambiguous assignment of synonymous versus nonsynonymous
  • coding-synonymous: leading to no amino acid change
  • splice-3: in the 3' end of a splice site
  • splice-5: in the 5' end of a splice site
  • missense: leading to an amino acid change
  • stop-lost: leading to a loss of a stop codon
  • stop-gained: leading to a gain of a stop codon
  • coding (INDEL): in a coding region with the number of base changes a multiple of 3
  • codingComplex (INDEL): indel spanning through more than one exon involving a coding region
  • frameshift (INDEL): in a coding region with the number of base changes not a multiple of 3

cDNA Change:

Variant represented in the HGVS notation at the coding DNA level for a transcript.

cDNA Size:

The size of the coding DNA for a transcript.

Protein Change:

A protein change represented in the HGVS notation is translated based on the specific transcript listed in the column of "mRNA Accession".

NCBI 37 Allele:

The allele of the NCBI human reference sequence (also hg19).

Chimp Allele:

Chimp alleles are acquired from UCSC human/chimp alignment files. If the variation does not fall within an alignment block, or if it is an indel, the chimp allele is listed as "unknown". If the variation falls within a gap in the alignment, it is listed as "-".

Conservation (phastCons):

a number between 0 and 1 that describes the degree of sequence conservation among 17 vertebrate species; these numbers are downloaded from the UCSC Genome site and are defined as the "posterior probability that the corresponding alignment column was generated by the conserved state of the phylo-HMM, given the model parameters and the multiple alignment" (see UCSC description).

Conservation (GERP):

The Genomic Evolutionary Rate Profiling (GERP) score was obtained from the GERP website in September of 2011. It ranges from -12.3 to 6.17, with 6.17 being the most conserved. The detailed description can be found in this publication.

Grantham Score

Grantham Scores categorize codon replacements into classes of increasing chemical dissimilarity based on the publicationby Granthan R.in 1974, Amino acid difference formula to help explain protein evolution. Science 1974 185:862-864.

PolyPhen2 (Class:Score):

Prediction of possible impact of an amino acid substitution on protein structure and function based on Polymorphism Phenotyping (PolyPhen2) program. It lists both the PolyPhen2 prediction class and the PolyPhen2 score separated by a ":".

Clinical Link:

The potential clinical implications associated with a SNP (limited).

On Exome Chip:

Whether a SNP is on Illumina HumanExome chip.

Filter Status:

A machine-learning technique called support vector machine (SVM) classification was applied for SNP variant filtering. After the initial SNP calls were generated, we re-examined the BAM files to collect additional information about each variant site. Based on the information, variants are initially filtered by individual thresholds. For example, variants with posterior probability <99% (glfMultiples SNP quality <20), were <5bp away from an indel detected in the 1000 Genomes Pilot Project, had total depth across samples of <5,379 or >5,379,000 reads (~1-1000 reads per sample), having >65% of reads as heterozygotes carrying the variant allele or where the absolute squared correlation between allele (variant or reference) and strand (forward or reverse) was >0.15 were marked as problematic SNPs. Sites failed 3 or more criteria are used as negative examples to train SVM classifier. HapMap3 and OMNI polymorphic sites were used as positive examples. The SVM classifier produces scores for each site, and we marked ~8.5% of sites at threshold 0.3 as SVM filter-failed. The unfiltered set had Ti/Tv = 2.63, and the filtered set had Ti/Tv =2.78.
  • SVM: Failed SVM-based filtering at threshold 0.3.
  • INDEL5: Nearby 1000 Genomes Pilot Indels within 5bp
The INDELs were filtered with GATK VQSR model.

GWAS Hits:

Link to known PubMed records of GWAS studies associated to a SNP based on NHGRI gwascatalog.txt.

EA Est. Age (kyrs) and AA Est. Age (kyrs):

The Esitmated variant age in the European-American and the African-American populations in kilo-years from the study published in Nature 493: 216-220, 2013 by Fu W, O'Connor TD, Jun G, Kang HM, Abecasis G, et al. Analysis of 6,515 exomes reveals a very recent origin of most human protein-coding variants.

GRCh38 Position:

A GRCh38 chromosomal position which is liftovered from a variant original GRCH37 chromosomal position.

Description of Sequence Coverage

    Several important variables from second-generation sequencing are used for identifying sequence variants using second-generation sequencing. Most importantly a minimum read depth (coverage) is needed at each genomic position for a given individual. For one individual, the criterion for a position to be considered covered is that the read depth must be 8 or higher. For a read (of about 76 bp) to be counted, it must have a mapping quality of 20 or higher. For a single read base-call to be counted, the sequence base quality must be 20 or higher.

    To consolidate the results for all individuals, blocks of regions (contiguous chromosome locations) were identified, for which at least one individual was covered for each location in the block. The number of individuals covered was averaged over the block locations, and a standard deviation was calculated. In addition, the read depths were averaged over the number of individuals and over the block locations, and again a standard deviation was calculated.

Column Description for Summary of Coverage table

Chromosome, Chr. Start Pos., and Chr. Stop Pos.:

a contiguous region on a chromosome for a block as described above.

Sequencing Span:

the length of a sequencing block in base-pairs (bp).

# of Samples Covered:

the average number of samples considered to be covered, averaged over the block locations.

Avg. Sample Read Depth:

the read depths averaged over individuals and block locations.

# of EA Samples Covered:

the average number of EuropeanAmerican samples considered to be covered, averaged over the block locations.

Avg. EA Sample Read Depth:

the read depths averaged over EuropeanAmerican individuals and block locations.

# of AA Samples Covered:

the average number of AfricanAmerican samples considered to be covered, averaged over the block locations.

Avg. AA Sample Read Depth:

the read depths averaged over AfricanAmerican individuals and block locations.