HapRFN: a deep learning method for identifying short IBD segments

A segment of DNA is called identical by descent (IBD) in two or more individuals if it is identical because it was inherited from a common ancestor. IBD segments can be used to uncover hidden familial relationships, detect the population of origin of an individual or find interbreedings between humans and ancient hominins like the Neandertal [1]. IBD segments can also be used to find the cause of diseases via a technique called IBD mapping.

We recently introduced “Rectified Factor Networks” (RFNs) as an unsupervised deep learning approach [3]. Each code unit of the RFN represents a bicluster and therefore an IBD segments, where samples for which the code unit is active share the bicluster (IBD segment) and features (DNA variants) that have activating weights to the code unit tag the IBD segment. HapRFN overcomes the problems of HapFABIA. (1) RFNs provide sparser codes via their rectified linear units that immediately supply bicluster memberships as factors being different from zero. (2) RFNs can learn thousands of factors and therefore many IBD segments simultaneously. Therefore, all IBD segments are mutually decorrelated, thus are not redundant and do not overlap. (3) RFNs allow for much faster processing of very large data sets using techniques from deep learning like efficient matrix multiplications and implementations of networks on graphical processing units (GPUs).

To keep feature membership vectors sparse, we introduce a Laplace prior on the parameters. Therefore, only few features contribute to activating a code unit, that is, only few features belong to a bicluster. In order to enforce more sparseness of the sample membership vectors, we introduce dropout of code units. Dropout means that during training some code units are set to zero at the same time as they get rectified. Dropout avoids co-adaptation of code units and reduces correlation of code units.

As a result HapRFN makes it possible to process very large data sets and to determine the size and number of IBD segments more precisely. With HapRFN we are able to accurately detect familial relationships, populations of origin, or interbreeding with ancient genomes in data sets with thousands of individuals. Furthermore, finding disease associations via IBD mapping becomes more reliable which might be the key to uncover unknown hereditary causes of multifactorial diseases.

Please cite:

Clevert D.-A., Mayr A., Unterthiner T., and Hochreiter S. (2015) Rectified factor networks. Advances in Neural Information Processing Systems 28 (NIPS 2015), eds. Cortes C., Lawrence N. D., Lee D. D., Sugiyama M., and Garnett R. (Montreal, QC), 1846–1854.