The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.

Introduction In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple. Objectives and...

Full description

Bibliographic Details
Main Author: Daniel Avery
Format: Article
Language:English
Published: Swansea University 2018-08-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/643
id doaj-205473d01e6940ef9022a3c470c6cb6c
record_format Article
spelling doaj-205473d01e6940ef9022a3c470c6cb6c2020-11-24T21:49:05ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.643The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.Daniel Avery0University of Oxford Introduction In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple. Objectives and Approach Where 2 people share a genome we need to be able to confirm that these pairs are twins. However, there are a number of issues which could cause 2 people to appear to share a genome; for example being recruited twice, donating blood on another’s behalf, etc. We already identify and exclude participant data based on these conditions. We developed our methodology by looking at the first identified pair in great detail, looking for evidence which specifically ruled out possible alternate explanations, and then applying and refining the method on later pairs. Results We were able to demonstrate the pair were almost certainly twins using their biochemistry and family questionnaire data as principal sources. We also identified a number of variables which were useful in indicating the likelihood of a twin, and now form part of a methodology which we are still developing. Even more usefully, we identified a number of variables that seemed like useful measures but proved extremely misleading. To date we have 26 pairs of possible twins, with 9 confirmed as twins and the remainder looking likely to be twins but falling short of a threshold for confidence. We also have 75 pairs which confirm duplicate participants we have already excluded. Conclusion/Implications We formed two lessons: even very simply linkages come with pitfalls, and you should gather more administrative data than you think. We’re proposing the collection of additional familial relationship data in our third resurvey. We are also looking into machine learning and statistical techniques to better identify twins and duplicates. https://ijpds.org/article/view/643
collection DOAJ
language English
format Article
sources DOAJ
author Daniel Avery
spellingShingle Daniel Avery
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
International Journal of Population Data Science
author_facet Daniel Avery
author_sort Daniel Avery
title The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
title_short The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
title_full The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
title_fullStr The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
title_full_unstemmed The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
title_sort mysterious case of the delayed twin: using research data to resolve linkage questions.
publisher Swansea University
series International Journal of Population Data Science
issn 2399-4908
publishDate 2018-08-01
description Introduction In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple. Objectives and Approach Where 2 people share a genome we need to be able to confirm that these pairs are twins. However, there are a number of issues which could cause 2 people to appear to share a genome; for example being recruited twice, donating blood on another’s behalf, etc. We already identify and exclude participant data based on these conditions. We developed our methodology by looking at the first identified pair in great detail, looking for evidence which specifically ruled out possible alternate explanations, and then applying and refining the method on later pairs. Results We were able to demonstrate the pair were almost certainly twins using their biochemistry and family questionnaire data as principal sources. We also identified a number of variables which were useful in indicating the likelihood of a twin, and now form part of a methodology which we are still developing. Even more usefully, we identified a number of variables that seemed like useful measures but proved extremely misleading. To date we have 26 pairs of possible twins, with 9 confirmed as twins and the remainder looking likely to be twins but falling short of a threshold for confidence. We also have 75 pairs which confirm duplicate participants we have already excluded. Conclusion/Implications We formed two lessons: even very simply linkages come with pitfalls, and you should gather more administrative data than you think. We’re proposing the collection of additional familial relationship data in our third resurvey. We are also looking into machine learning and statistical techniques to better identify twins and duplicates.
url https://ijpds.org/article/view/643
work_keys_str_mv AT danielavery themysteriouscaseofthedelayedtwinusingresearchdatatoresolvelinkagequestions
AT danielavery mysteriouscaseofthedelayedtwinusingresearchdatatoresolvelinkagequestions
_version_ 1725889594770063360