Download data

The Genome of the Netherlands data distributed on this page is publicly available and can be used under the condition of citation (main paper, additional papers when appropriate).

The Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nature Genetics (2014) doi:10.1038/ng.3021. See (pubmed)
Francioli et alGenome-wide patterns and properties of de novo mutations in humans. Nature Genetics (2015) doi:10.1038/ng.3292. See (pubmed)

In addition, the following acknowledgements can be used when reusing public data on a website:

“This study/database makes use of data generated by the Genome of the Netherlands Project. A full list of the investigators is available from Funding for the project was provided by the Netherlands Organization for Scientific Research under award number 184021007, dated July 9, 2009 and made available as a Rainbow Project of the Biobanking and Biomolecular Research Infrastructure Netherlands (BBMRI-NL). The sequencing was carried out in collaboration with the Beijing Institute for Genomics (BGI).”

Single nucleotide variants (SNVs)

These files contain a total of 20.4M SNVs and the complete information output by the GATK UnifiedGenotyper v1.4 on all 767 GoNL samples. It is important to note that these calls are not trio-aware and that all genotypes were reported regardless of their quality. Both filtered and passing calls are reported in these files. Filtered calls include (1) calls failing our VQSR threshold and (2) calls in the GoNL inaccessible genome.

The following pipeline was used to call SNVs:

  1. The SNVs were called using GATK UnifiedGenotyper v1.4 using the data from all individuals simultaneously. All calls with a quality >Q10 were written to the VCF file.
  2. SNVs were filtered using GATK VQSR (using HapMap, Omni and 1000KGP sites for training, and QD, HaplotypeScore, MQRankSum, ReadPosRankSum, FS, InbreedingCoeff, DP and MQ as features).
  3. SNVs present in 1KGP-EU were kept regardless of their VQSR filtering.

Current release:

Older releases:

  • GoNL SNPs release 2: Summary counts of alternative alleles from only the parents (498 individuals) in VCF format
  • GoNL SNPs release 4: Summary counts of alternative alleles and genotypes from only the parents (498 individuals) in VCF format

Short insertions and deletions (indels)

The files contain a total of 1.1M short indels (1bp-20bp) called by PINDEL, CLEVER, SOAP de novo and GATK Unified Genotyper. All annotations produced by these methods were kept and the “set” annotation in the INFO field shows which methods called the indel. As part of this release, you will find all indels that passed the method-specific filters. Note that sites filtered as part of the inaccessible genome were kept but flagged as filtered. Genotypes and genotype likelihoods were called using the GATK Unified Genotyper.

The following pipeline was used to call the indels:

  1. Indels were called using PINDEL, CLEVER, SOAP de novo and GATK Unified Genotyper.
  2. Each method was filtered using best practices for that method.
  3. Indels of size <=20bp and detected by both GATK UnifiedGenotyper and one other method were kept.
  4. Indels were phased/imputed using MVNCall.
  5. Indels in inaccessible parts of the genome were filtered.

Current release:

Structural variants

Current release:

The following pipeline was used to call the indels and structural variants:

Twelve tools representing five different algorithmic approaches of variant calling (gapped alignment, split-read mapping, discordant read pairs, read depth, de novo assembly) were used: Pindel, GATK UnifiedGenotyper, GATK HaplotypeCaller, 123SV, BreakDancer, DWAC-Seq, CNVNator, FAÇADE, Mate-Clever, GenomeSTRiP, SOAPdenovo de novo assembly, Mobster. Calls from each of the methods were filtered according to the method best practices. Publicly available data contains:

  • Simple InDel set – A data set with simple indels (1-20 base pairs) was constructed by merging four individual callsets obtained by running GATK HaplotypeCaller, Pindel, Mate-Clever and SOAPdenovo assembly (n=1,739,300). (File: [date]_GoNL_AF_simple_indels.vcf.gz)
  • Complex InDel set – Genomic regions showing a high density of polymorphisms (distance between adjacent polymorphisms below 30 basepairs) were tested for being complex events or alleles that potentially appeared as part of the single mutational event, but called as separate adjacent events (n=52,913). (File: [date]_GoNL_AF_complex_indels.vcf.gz)
  • Structural Variants set – After creation of the algorithm-specific calls sets a consensus set of InDels and SVs was made for each of the SV types (indels, deletions, insertions, duplications, inversions, interchromosomal events, and mobile element insertions). Events were merged per variant type using an algorithm-aware merging strategy. A consensus region was defined when overlapping regions were identified by 2 different detection strategies (for example split read and discordant read pair, stratified by AF and event length), and the boundaries of the event were determined by the algorithm with the highest breakpoint accuracy (as determined by the calling strategy) in combination with a 50% reciprocal overlap. The resulted set consists of 54,696 genotyped and 4.662 non-genotyped structural variants. (Files: [date]_GoNL_AF_genotyped_SVs.vcf.gz and [date]_GoNL_AF_nongenotyped_SVs.vcf.gz)
  • Novel segments – We realigned the individual-specific sets of discordant reads using these new segments as a reference sequence, in order to determine their presence/absence in the libraries of each individual. The dataset of new segments (n=11,350, total length=7.8Mbp) was divided based on their population frequency (Fixed, > 95%; Common, 5-95%; rare < 5%), gender (Male-specific, >5% population) and a match to herpesvirus DNA (4 individuals from two families). Finally, we used NCBI BLAST to check if these segments were present in the most recent GRCh38/hg38 genome reference or a decoy dataset hg38d1. We required 99% identity in the alignment between assembled segment and latest genome reference to discard a segment as unreported by GRCh38 (total length=4.3Mbp). (File: [date]_GoNL_novel_segments.fa.gz)

Older releases:

The files contain 27.8k SV calls (>20bp). Calling was realized using 10 different approaches (see below) and a consensus strategy was used to produce this set. The SOURCE field in the INFO column lists all methods that called each of the events. As most methods do not report genotypes but rather presence/absence of an SV in an individual, we report here either a homozygous reference (0/0) in case of the absence of SV or a genotype with one alternative allele and one unknown allele (./1) in case of the presence of a SV.

The following pipeline was used to call the SVs:

  1. SVs were called using 123SV, Breakdancer, CNVnator, DWACSeq, FACADE, GATK Unified Genotyper, GenomeSTRiP, MATE-CLEVER, PINDEL and SOAPdenovo.
  2. Each method was filtered according to the method best practices .
  3. SVs called by at least 2 difference approaches AND present in at least 3 families AND transmitted to at least 1 offspring were kept.
  4. Sites in the inaccessible genome were filtered.
  • GoNL SV release 5: Summary counts of consensus deletion regions >=20 basepairs in size from only the parents (498 individuals)
  • GoNL SV release 1: Summary counts of consensus deletion regions 100 basepairs and larger in size in txt format

De novo mutations and mutation rate map

A total of 11,020 de novo mutations were called in the 250 families from GoNL, representing the largest set of human mutations to date, and the first in healthy individuals. The list of de novo mutations is available as a tab-delimited file here: GoNL_DNMs.txt
We analysed the distribution of these mutations throughout the genome and used this information in a model based on human-chimpanzee divergence rates and including local sex-averaged recombination rates (UCSC deCODE track),  mutation type and transcribed/non-transcribed strand in coding regions to create a genome-wide human mutation rate map. We defined 2,339 1Mb non-overlapping windows across the genome after excluding windows where (1) >10% of the window lacked recombination rates, or (2) the average sex-averaged local recombination rate across the window equals to 0 cM/Mb or is greater than 3 cM/Mb, (3) the primate local substitution rate was estimated to be 0 or extremely high, or (4) more than 80% of the window has less than 90% estimated de novo mutation calling sensitivity. The data is provided as BED formatted files. The neutral mutation rates are available as either normalized to a total mutation rate of 1.2 x 10-8 (local_mutation_rate.bias_corrected.SEXAVG.bed).
In addition, we computed the mutation rate for coding sequences of autosomal protein-coding transcripts were downloaded from Ensembl v74. In total, we could compute the mutation rates for 54,310 transcripts (out of 78,818), including 15,462 canonical transcripts (out of 19,270). The canonical transcript was defined as the transcript of longest coding sequence length for each gene. The genomic window i of each transcript was determined by the genomic coordinate of midpoint between transcription start and end sites. For each transcript, all potential nonsense, missense and silent mutations were scanned with respect to reference sequence, and their mutation rates were aggregated over the sequence. The data is provided as tab-delimited files and contains mutation rates for synonymous, missense and nonsense mutations separately. Similarly to the whole-genome data, the neutral mutation rates are available as either normalized to a total mutation rate of 1.2 x 10-8 (All: functional_mut_rate.bias_corrected.local.bed, Canonical only: functional_mut_rate.bias_corrected.local.canonical_tx_only.bed).
A detailed description of the data and methods used to derive this mutation rate map is available in: Francioli et alGenome-wide patterns and properties of de novo mutations in humans. Nature Genetics (2015) doi:10.1038/ng.3292. See (pubmed)