Summary: | Urban data is a imperative resource for urban computing, which can promote the establishment of urban knowledge, the collection of urban information and the construction of smart cities. Seen that the urban data collected through hardware sensors and crowdsourcing has the limitations of uneven information distribution, poor data comprehensiveness and high resource costs, we turn to the Internet resources of real-time updates and extensive information coverage. Therefore, we propose an approach to Sensing Urban text Data from Internet Resources (SUDIR). We put forward innovative work on two key issues: urban data recognition for Chinese context and urban data sensing for multi-source web resources. On one hand, we design a Chinese urban data recognition model based on Whole Word Masking for Bidirectional Encoder Representations from Transformers (BERT-WWM) embedding model and Bidirectional Long-Short Term Memory with a Conditional Random Field (BLSTM-CRF) sequence labeling model. We introduce Chinese Word Segmentaion (CWS) concept in BERT embedding model to make the text embedding effect better represent semantic information on Chinese context. BLSTM-CRF model based on deep learning is used to achieve high-quality coding and prediction. On the other hand, we propose a method of Extracting Urban text data based on Web page features and Clustering operation (EUWC). EUWC is used to correct the false negative samples labeled by BERT-WWM+BLSTM-CRF recognition model and enable SUDIR to sense more accurate and comprehensive city data from multi-source web resources. The experimental results show that our work outperforms the other baseline methods, and it also proves that SUDIR using Internet resources and deep learning technology has the advantages of low-cost, high-quality urban data sensing.
|