The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.
Introduction In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple. Objectives and...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2018-08-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/643 |
id |
doaj-205473d01e6940ef9022a3c470c6cb6c |
---|---|
record_format |
Article |
spelling |
doaj-205473d01e6940ef9022a3c470c6cb6c2020-11-24T21:49:05ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.643The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions.Daniel Avery0University of Oxford Introduction In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple. Objectives and Approach Where 2 people share a genome we need to be able to confirm that these pairs are twins. However, there are a number of issues which could cause 2 people to appear to share a genome; for example being recruited twice, donating blood on another’s behalf, etc. We already identify and exclude participant data based on these conditions. We developed our methodology by looking at the first identified pair in great detail, looking for evidence which specifically ruled out possible alternate explanations, and then applying and refining the method on later pairs. Results We were able to demonstrate the pair were almost certainly twins using their biochemistry and family questionnaire data as principal sources. We also identified a number of variables which were useful in indicating the likelihood of a twin, and now form part of a methodology which we are still developing. Even more usefully, we identified a number of variables that seemed like useful measures but proved extremely misleading. To date we have 26 pairs of possible twins, with 9 confirmed as twins and the remainder looking likely to be twins but falling short of a threshold for confidence. We also have 75 pairs which confirm duplicate participants we have already excluded. Conclusion/Implications We formed two lessons: even very simply linkages come with pitfalls, and you should gather more administrative data than you think. We’re proposing the collection of additional familial relationship data in our third resurvey. We are also looking into machine learning and statistical techniques to better identify twins and duplicates. https://ijpds.org/article/view/643 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Daniel Avery |
spellingShingle |
Daniel Avery The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. International Journal of Population Data Science |
author_facet |
Daniel Avery |
author_sort |
Daniel Avery |
title |
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. |
title_short |
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. |
title_full |
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. |
title_fullStr |
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. |
title_full_unstemmed |
The Mysterious Case of the Delayed Twin: using research data to resolve linkage questions. |
title_sort |
mysterious case of the delayed twin: using research data to resolve linkage questions. |
publisher |
Swansea University |
series |
International Journal of Population Data Science |
issn |
2399-4908 |
publishDate |
2018-08-01 |
description |
Introduction
In a large biobank of over half a million people, we have several pairs of participants who appear to share their genome. As more individuals are sequenced, more pairs are likely to be found. If these are twins then this is great news, but it isn’t quite that simple.
Objectives and Approach
Where 2 people share a genome we need to be able to confirm that these pairs are twins. However, there are a number of issues which could cause 2 people to appear to share a genome; for example being recruited twice, donating blood on another’s behalf, etc. We already identify and exclude participant data based on these conditions. We developed our methodology by looking at the first identified pair in great detail, looking for evidence which specifically ruled out possible alternate explanations, and then applying and refining the method on later pairs.
Results
We were able to demonstrate the pair were almost certainly twins using their biochemistry and family questionnaire data as principal sources. We also identified a number of variables which were useful in indicating the likelihood of a twin, and now form part of a methodology which we are still developing. Even more usefully, we identified a number of variables that seemed like useful measures but proved extremely misleading. To date we have 26 pairs of possible twins, with 9 confirmed as twins and the remainder looking likely to be twins but falling short of a threshold for confidence. We also have 75 pairs which confirm duplicate participants we have already excluded.
Conclusion/Implications
We formed two lessons: even very simply linkages come with pitfalls, and you should gather more administrative data than you think. We’re proposing the collection of additional familial relationship data in our third resurvey. We are also looking into machine learning and statistical techniques to better identify twins and duplicates.
|
url |
https://ijpds.org/article/view/643 |
work_keys_str_mv |
AT danielavery themysteriouscaseofthedelayedtwinusingresearchdatatoresolvelinkagequestions AT danielavery mysteriouscaseofthedelayedtwinusingresearchdatatoresolvelinkagequestions |
_version_ |
1725889594770063360 |