HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data

This site contains supporting material to the manuscript "HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data".

Summary

Our method HapFABIA identifies short identity by descent (IBD) segments that are tagged by rare variants in large sequencing data. Two haplotypes are identical by descent (IBD) if they share a segment that both inherited from a common ancestor. Current IBD methods reliably detect long IBD segments because many minor alleles in the segment are concordant between the two haplotypes. However, many cohort studies contain unrelated individuals which share only short IBD segments. Short IBD segments contain too few minor alleles to distinguish IBD from random allele sharing by recurrent mutations. New sequencing techniques improve the situation by providing rare variants which convey more information on IBD than common variants, because random minor allele sharing of rare variants is less likely than for common variants.


IBD segment (yellow) that descended from a founder to different individuals.

Short IBD segments are of interest because (i) they resolve the genetic structure on a fine scale and (ii) they can be assumed to be old. In order to detect short IBD segments, both the information supplied by rare variants and information from more than two individuals should be utilized. These two characteristics are the basis for detecting short IBD segments by HapFABIA. We propose biclustering to detect very short IBD segments that are shared among multiple individuals. Biclustering simultaneously clusters rows and columns of a matrix. In particular it clusters row elements that are similar to each other on a subset of column elements. A genotype matrix has individuals (unphased) or chromosomes (phased) as row elements and SNVs as column elements. Entries in the genotype matrix usually count how often the minor allele of a particular SNV is present in a particular individual. Alternatively, minor allele likelihoods or dosages may be used. Individuals that share an IBD segment are similar to each other at minor alleles of SNVs (tagSNVs) which tag the IBD segment (see Figure below). Therefore an IBD segment that is shared among individuals corresponds to a bicluster because these individuals are similar to one another at this segment. Identifying a bicluster means identifying tagSNVs (column bicluster elements) that tag an IBD segment and, simultaneously, identifying individuals (row bicluster elements) that possess the IBD segment.


Biclustering of a genotyping matrix. Left: original genotyping data matrix with individuals as row elements and SNVs as column elements. Minor alleles are indicated by violet bars and major alleles by yellow bars for each individual-SNV pair. Right: after sorting the rows, the detected bicluster can be seen in the top three individuals. They contain the same IBD segment which is marked in gold. Biclustering simultaneously clusters rows and columns of a matrix so that row elements (here individuals) are similar to each other on a subset of column elements (here the tagSNVs).

Publication

Research Report: IBD between Humans, Neandertals, and Denisovans

Software, Data, Source Codes




Examples of Short IBD Segments in Chromosome 1 of the 1000 Genomes Project

Figures 1-6: Examples of IBD segments that were extracted from chromosome 1 of the 1000 Genomes Project. For these phased genotype data, phasing errors can be seen (yellow lines from the left hand side). Click on any of these thumbnails to view full-size images.


Fig. 1: IBD segment exclusively found in Africans. The third and fourth line very likely show a phasing error as both chromosomes belong to the same individual. Analog the last but fourth and last but fifth line.

Fig. 2: IBD segment observed in all populations including one African. However this might also be a region of sequencing errors because the tagSNV pattern is not very clear.

Fig. 3: IBD segment observed in all populations.

Fig. 4: IBD segment shared by Africans and one admixed American. Again phasing errors for the last two lines (NA20299) and lines 11 and 12 (NA19248).

Fig. 5: IBD segment shared by Africans and Asians. Phasing errors at lines 6 and 7 (NA18636).

Fig. 6: IBD segment shared by Africans and Europeans. Phasing errors at lines 8 and 9 (NA18516), lines 23 and 24 (NA19310), and lines 32 and 33 (NA19384).



Short IBD Segments Found in Data from the Korean Personal Genome Project (KPGP)

The Korean Personal Genome Project (KPGP) is part of the international Personal Genome Project (PGP) established by Genome Research Foundation (GRF). 39 Human genomes were sequenced on an Illumina HiSeq 2000 platform with 30x to 40x coverage. The genotypes of these 38 Koreans and one Caucasian female are combined with the genotype data of the 1000 Genomes Project to extract short IBD segments by HapFABIA.


Data/results of hapFabia IBD segment extraction on the KPGP data - used hapFabia 0.90.0:
Data, results, and KPGP IBD segments


The KPGP data contains two twin pairs (KPGP88/KPGP89 and KPGP90/KPGP91) and a family (KPGP1-KPGP12). KPGP10 is a Caucasian female from US. The relations are given in the following pedigree charts:

Pedigree charts for the KPGP individuals. Click on thumbnail to view full-size image.


Figures K1-K7: Examples of short IBD segments from chromosome 1 of the KPGP combined with the 1000 Genomes Project. Click on any of these thumbnails to view full-size images.


Fig. K1: IBD segment caused by systematic sequencing errors. Note that this segment is observed in all KPGP individuals and only those, though KPGP10 is a Caucasian female.

Fig. K2: IBD segment with sequencing errors for KPGP individuals at the right hand side. Some Koreans are classified to have this segment because they only agree to other Koreans at the sequencing errors.

Fig. K3: IBD segment that matches the Denisova genome and shared among Asians, in particular Koreans.

Fig. K4: Another IBD segment that matches the Denisova genome and is shared by Asians, in particular observed in Koreans.

Fig. K5: IBD segment exclusively shared by Koreans.

Fig. K6: IBD segment that is shared by both Korean twin pairs. Sequencing errors can be seen as twins should have the same IBD segments.

Fig. K7: IBD segment which is shared by many members of the Korean family. The IBD segment is descended from KPGP1 to all her children (KPGP3, KPGP5, KPGP9) and some of her grandchildren (KPGP7, KPGP11, KPGP12).



Correlation between population proportions and ancient genomes based on short IBD segments


Persons correlation between the Denisova genome and different populations.



Fisher test for dependencies between the Denisova genome and different populations.



Persons correlation between the Neandertal genome and different populations.



Fisher test for dependencies between the Neandertal genome and different populations.