Mining a shared concept space for domain adaptation in text mining.
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabele...
Other Authors: | |
---|---|
Format: | Others |
Language: | English Chinese |
Published: |
2011
|
Subjects: | |
Online Access: | http://library.cuhk.edu.hk/record=b6075126 http://repository.lib.cuhk.edu.hk/en/item/cuhk-344759 |
id |
ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_344759 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
English Chinese |
format |
Others
|
sources |
NDLTD |
topic |
Data mining--Mathematical models |
spellingShingle |
Data mining--Mathematical models Mining a shared concept space for domain adaptation in text mining. |
description |
In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based. on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques. === Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains. === We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches. === We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap. === Chen, Bo. === Adviser: Wai Lam. === Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . === Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. === Includes bibliographical references (leaves 87-95). === Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Abstract also in Chinese. |
author2 |
Chen, Bo |
author_facet |
Chen, Bo |
title |
Mining a shared concept space for domain adaptation in text mining. |
title_short |
Mining a shared concept space for domain adaptation in text mining. |
title_full |
Mining a shared concept space for domain adaptation in text mining. |
title_fullStr |
Mining a shared concept space for domain adaptation in text mining. |
title_full_unstemmed |
Mining a shared concept space for domain adaptation in text mining. |
title_sort |
mining a shared concept space for domain adaptation in text mining. |
publishDate |
2011 |
url |
http://library.cuhk.edu.hk/record=b6075126 http://repository.lib.cuhk.edu.hk/en/item/cuhk-344759 |
_version_ |
1718977984823558144 |
spelling |
ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_3447592019-02-19T03:42:35Z Mining a shared concept space for domain adaptation in text mining. CUHK electronic theses & dissertations collection Data mining--Mathematical models In many text mining applications involving high-dimensional feature space, it is difficult to collect sufficient training data for different domains. One strategy to tackle this problem is to intelligently adapt the trained model from one domain with labeled data to another domain with only unlabeled data. This strategy is known as domain adaptation. However, there are two major limitations of the existing domain adaptation approaches. The first limitation is that they all separate the domain adaptation framework into two separate steps. The first step attempts to minimize the domain gap, and then the second step is to train the predictive model based. on the reweighted instances or transformed feature representation. However, such a transformed representation may encode less information affecting the predictive performance. The second limitation is that they are restricted to using the first-order statistics in a Reproducing Kernel Hilbert Space (RKHS) to measure the distribution difference between the source domain and the target domain. In this thesis, we focus on developing solutions for those two limitations hindering the progress of domain adaptation techniques. Then we propose an improved symmetric Stein's loss (SSL) function which combines the mean and covariance discrepancy into a unified Bregman matrix divergence of which Jensen-Shannon divergence between normal distributions is a particular case. Based on our proposed distribution gap measure based on second-order statistics, we present another new domain adaptation method called Location and Scatter Matching. The target is to find a good feature representation which can reduce the embedded distribution gap measured by SSL between the source domain and the target domain, at the same time, ensure the new derived representation can encode sufficient discriminants with respect to the label information. Then a standard machine learning algorithm, such as Support Vector Machine (SYM), can be adapted to train classifiers in the new feature subspace across domains. We conduct a series of experiments on real-world datasets to demonstrate the performance of our proposed approaches comparing with other competitive methods. The results show significant improvement over existing domain adaptation approaches. We develop a novel model to learn a low-rank shared concept space with respect to two criteria simultaneously: the empirical loss in the source domain, and the embedded distribution gap between the source domain and the target domain. Besides, we can transfer the predictive power from the extracted common features to the characteristic features in the target domain by the feature graph Laplacian. Moreover, we can kernelize our proposed method in the Reproducing Kernel Hilbert Space (RKHS) so as to generalize our model by making use of the powerful kernel functions. We theoretically analyze the expected error evaluated by common convex loss functions in the target domain under the empirical risk minimization framework, showing that the error bound can be controlled by the expected loss in the source domain, and the embedded distribution gap. Chen, Bo. Adviser: Wai Lam. Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. Includes bibliographical references (leaves 87-95). Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Abstract also in Chinese. Chen, Bo Chinese University of Hong Kong Graduate School. Division of Systems Engineering and Engineering Management. 2011 Text theses electronic resource microform microfiche 1 online resource (xiv, 95 leaves : ill.) cuhk:344759 isbn: 9781267099570 http://library.cuhk.edu.hk/record=b6075126 eng chi Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) http://repository.lib.cuhk.edu.hk/en/islandora/object/cuhk%3A344759/datastream/TN/view/Mining%20a%20shared%20concept%20space%20for%20domain%20adaptation%20in%20text%20mining.jpghttp://repository.lib.cuhk.edu.hk/en/item/cuhk-344759 |