Summary: | 碩士 === 國立陽明大學 === 醫學生物技術研究所 === 90 === Changes in the genetic information have been import research resource in evolution, genetic disease and the establishment of variation databases is critical for storage and exchange of variation information. Furthermore, it is essential for the study of functional genomics. Variation databases could be categorized to Locus-specific database (LSDB) for collecting variations for a few gene locus, Central Mutation & SNP Databases, such as Online Mendelian Inheritance in Man (OMIM) and Human Gene Mutation Database (HGMD), and the National & Ethnic Variation Databases,such as Arab Genetic Disease Database (AGDDB), for collecting variation from a specific nation or population. Chinese Gene Mutation Database (CGMD, http://cgmd.nhri.org.tw) is established by research team of Professor Kuang-Dong Wuu and Professor Kwang-Jen Hsiao in Institute of Genetics in National Ying-Ming University. It has been compiled to provide information of inherited gene mutations in Chinese population. The aim of this thesis is to establish a Chinese Gene Variation Database (CGVdb) on the basis of CGMD.
The establishment of CGVdb followed the recommendations of Human Genome Variation Society (HGVS, http://www.hgvs.org), including Mutation Nomenclature Recommendations and Database Content Recommendations for minimum (core) content, essential and suggestive fields. The content source data of CGVdb are the Chinese Gene Variation Reports (CVRs) which “inherited” mutation and variation reports related to phenotype studied in Chinese population, not including somatic variations. The reports are collected from public available MEDLINE database PubMed (http://www.ncbi.nlm.nih.gov/PubMed/) established by National Library of Medicine (http://www.nlm.nih.gov). By composing the search field descriptions of PubMed and MeSH (Medical Subject Heading), we can customized a search string for semi-automated electronic data collection. We also established a “Text Analysis Standard Operation Procedure” (Text Analysis SOP) to create content of CGVdb from abstract data of CVRs. To evaluate the effectiveness of our electronic data collection method and Text Analysis SOP, we have established a “standard data set” containing true CVRs.
By manual reviewing 6,942 papers published in 18 selected journals which includes 9 journals published in foreign region, 9 journals published in Chinese region, in the period of 1997 and 1998, we has found 115 CVRs. However, our electronic data collection method detected 266 papers from MEDLINE. 3 False Negatives (FN, rate : 0.026), and 154 False Positives (FP, rate : 0.579) were found. Using optimized search string, 782 reports were found in the period of 1997 and 1998 from searching all the papers contained in MEDLINE. Among them, 251 CVRs and 531 False Positives (FP rate : 0.679) were found. This result indicated that the total CVRs in the MEDLINE database from 1960 to 2001 can be estimated to be around 1,308 from the 4,076 reports detected by our electronic method on Aug. 28, 2002. The 251 CVRs under the Text Analysis SOP produced 502 Variation Records (V.R.). 43.2% ( 217/502 ) 的V.R. should be into full text process to the minimal filed information for CGVdb. 56.8% (285/502) passed Text Analysis SOP which could be online to CGVdb. The 285 V.R. undergone combination process of V.R. describing same variation produced 222 CGV records. It could be searched by gene name, gene symbol, phenotype or CGV ID and browsing by gene or phenotype via the CGVdb web site: http://www.CGVdb.org.tw. We also established a value-added table GeneLink for direct links to Bioinformatics databases, such as LocusLink, OMIM , HGMD, RefSeq, GDB, Swiss Prot, GeneLynx and Locus Specific Database.
|