Topic-Oriented Collaborative Web Crawling
dc.contributor.author | Chung, Chiasen | en |
dc.date.accessioned | 2006-08-22T14:26:53Z | |
dc.date.available | 2006-08-22T14:26:53Z | |
dc.date.issued | 2001 | en |
dc.date.submitted | 2001 | en |
dc.description.abstract | A <i>web crawler</i> is a program that "walks" the Web to gather web resources. In order to scale to the ever-increasing Web, multiple crawling agents may be deployed in a distributed fashion to retrieve web data co-operatively. A common approach is to divide the Web into many partitions with an agent assigned to crawl within each one. If an agent obtains a web resource that is not from its partition, the resource will be transferred to the rightful owner. This thesis proposes a novel approach to distributed web data gathering by partitioning the Web into topics. The proposed approach employs multiple focused crawlers to retrieve pages from various topics. When a crawler retrieves a page of another topic, it transfers the page to the appropriate crawler. This approach is known as <i>topic-oriented collaborative web crawling</i>. An implementation of the system was built and experimentally evaluated. In order to identify the topic of a web page, a topic classifier was incorporated into the crawling system. As the classifier categorizes only English pages, a language identifier was also introduced to distinguish English pages from non-English ones. From the experimental results, we found that redundance retrieval was low and that a resource, retrieved by an agent, is six times more likely to be retained than a system that uses conventional hashing approach. These numbers were viewed as strong indications that <i>topic-oriented collaborative web crawling system</i> is a viable approach to web data gathering. | en |
dc.format | application/pdf | en |
dc.format.extent | 733407 bytes | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | http://hdl.handle.net/10012/1040 | |
dc.language.iso | en | en |
dc.pending | false | en |
dc.publisher | University of Waterloo | en |
dc.rights | Copyright: 2001, Chung, Chiasen. All rights reserved. | en |
dc.subject | Computer Science | en |
dc.subject | Web Crawling | en |
dc.subject | Distributed System | en |
dc.subject | Text Categorization | en |
dc.title | Topic-Oriented Collaborative Web Crawling | en |
dc.type | Master Thesis | en |
uws-etd.degree | Master of Mathematics | en |
uws-etd.degree.department | School of Computer Science | en |
uws.peerReviewStatus | Unreviewed | en |
uws.scholarLevel | Graduate | en |
uws.typeOfResource | Text | en |
Files
Original bundle
1 - 1 of 1