Thứ Năm, 6 tháng 2, 2014

The study of web information extraction technology based on VietSpider

Currently network information extraction technology is a hot and difficult spot of the Web data excavation area. In this paper, the author introduces a new, open source information collection tool: VietSpider, including system structure, core technology, case to proceed etc. The author also compares it with another tool (the Heritrix+HtmlParser combination) and analyzes the advantages and disadvantages of the two methods, which facilitates the selection and application of the users and researchers. And at last the author gives the solution to the garbage problem in the process of Chinese information acquisition. Authors: Gao Tao ; Beijing Inst. of Technol., Beijing, China ; Wu Hongna