I动态爬虫管理平台构建与实现摘要随着互联网的迅速发展,Web的信息量越来越大。人们往往通过搜索引擎去从互联网上搜索想要的信息,比如:百度,谷歌,搜狗等。这类搜索引擎称之为通用搜索引擎,其为所有的用户所需的内容,但目前互联网上的信息繁杂,难于辨识,用户搜索出来的信息可能与自己想要的信息大相径庭。对于这种问题,就需要更加专业的,面向特定领域的搜索引擎来解决。主题搜寻技术是垂直搜索引擎的重要组成部分。本文的主要目的是对主题爬虫技术的核心技术进行科学研究。主要研究方向如下:(1)主题内容的提取是网络主题辨别的关键过程。本文综合了网页内容的特征和主题内容的相关特征,设计了一种提取网页主题内容的方法。(2)明确提出了一种基于物理线路连接的主题辨别优化算法,以辨别网页的主题。将基于知识库系统的物理线路连接方法应用于特征提取。实验表明,该方法提高了主题网页辨别的准确性。(3)本文基于Best-First算法,进行平台构建。平台构建是指导主题网络爬虫抓取网页的关键,本文采用基于Best-First算法的平台构建。关键词:网络爬虫;Best-First算法;平台构建;链接IIAbstractWiththerapiddevelopmentoftheInternet,theinformationonlineismoreandmorevarious.PeopleoftenusesearchenginestosearchfortheinformationtheywantfromtheInternet,suchas:Baidu,Google,Sogou,etc.Thistypeofsearchengineiscalledageneralsearchengine,whichprovidesalluserswithalltheinformationtheywant.WiththeincreasingamountofinformationontheInternet,theinformationthatuserssearchformaybeverydifferentfromtheinformationtheywant.Forthiskindofproblem,amoreprofessional,field-orientedsearchengineisneededtosolveit.Thetopicwebcrawlerisakeypartoftheverticalsearchengine.Thisarticlemainlystudiesthekeytechnologiesinthetopicwebcrawler.Themainresearchcontentsareasfollows:(1)Theextractionofsubjectcontentisanimportantstepofwebsubjectidentification.Thispapercombinesthedistributioncharacteristicsofwebcontentandrelatedfeaturesofsubjectcontenttodesignawebsubjectcontentextractionmethod.(2)Atopicrecognitionalgorithmbasedonentitylinksisproposedtoidentifythetopicofwebpages.Theentitylinkmethodbasedonknowledgebaseisappliedtofeatureextraction.Experimentsshowthatthismethodimprovestheaccuracyoftopicwebpagerecognition.(3)AplatformconstructionbasedonBest-Firstalgorithmisproposed.Platformconstructionisthekeytoguidingthematicwebcrawlerstocrawlwebpages.ThisarticleusesaplatformconstructionbasedontheBest-Firstalgorithm.Keywords:topicwebcrawler;entitylink;Best-Firstalgorithm;platformconstructionIII目录第1章绪论....................................................................................................................................11.1背景与意义.........................................................................................................................11.2主题网络爬虫的国内外研究现状.....................................................................................11.2.1主题辨别算法及平台构建.......................................................................................11.2.2主题爬虫系统...........................................................................................................21.3本文的研究内容.................................................................................................................2第二章主题网络爬虫的体系结构...............................................................................................42.1组成模块............................................................................................................................