微信扫码下载

内容简介

  如果编程是魔法,那么网络数据采集肯定就是某种巫术。编写一个简单的自动化程序,你就可以查询Web服务器,请求数据,解析数据以提取所需的信息。这本实用书籍的扩充版不但介绍了网络数据采集,更是从现代网络中抓取几乎各类数据的综合指南。
  《Python网络数据采集(第2版影印版英文版)》第一部分侧重于网络数据采集机制:使用Python向Web服务器请求信息,对服务器响应信息做基本的处理,自动与站点展开交互。第二部分探讨了各种更具体的工具和应用程序,以应对你可能遇到的任何网络数据采集场景。

作者简介

  瑞安·米切尔,位于波士顿的HedgeServ的高级软件工程师,负责开发公司的API和数据分析工具。她毕业于欧林工程学院,拥有哈佛大学扩展学院(HarvardUrliversityExterlsionSchool)软件工程硕士学位以及数据科学证书。在加入HedgeServ之前,她曾就职于Abine,负责使用Python开发网络数据采集工具和自动化工具。她经常从事零售、金融和制药行业的网络数据采集项目的咨询工作,还曾经在东北大学和欧林工程学院担任课程顾问和兼职教员。

目录

Preface

PartI.BuildingScrapers
1.YourFirstWebScraper
Connecting
AnIntroductiontoBeautifulSoup
InstallingBeautifulSoup
RunningBeautifulSoup
ConnectingReliablyandHandlingExceptions
2.AdvancedHTMLParsing
YouDon'tAlwaysNeedaHammer
AnotherServingofBeautifulSoup
findoandfindallowithBeautifulSoup
OtherBeautifulSoupObjects
NavigatingTrees
RegularExpressions
RegularExpressionsandBeautifulSoup
AccessingAttributes
LambdaExpressions
3.WritingWebCrawlers
TraversingaSingleDomain
CrawlinganEntireSite
CollectingDataAcrossanEntireSite
CrawlingAcrosstheInternet
4.WebCrawlingModels
PlanningandDefiningObjects
DealingwithDifferentWebsiteLayouts
StructuringCrawlers
CrawlingSitesThroughSearch
CrawlingSitesThroughLinks
CrawlingMultiplePageTypes
ThinkingAboutWebCrawlerModels
5.Scrapy
InstallingScrapy
InitializingaNewSpider
WritingaSimpleScraper
SpideringwithRules
CreatingItems
OutputtingItems
TheItemPipeline
LoggingwithScrapy
MoreResources
6.St0ringData
MediaFiles
StoringDatatoCSV
MySQL
InstallingMySQL
SomeBasicCommands
IntegratingwithPython
DatabaseTechniquesandGoodPractice
"SixDegrees"inMySQL
Email

PartII.AdvancedScraping
7.ReadingDocuments
DocumentEncoding
Text
TextEncodingandtheGlobalInternet
CSV
ReadingCSVFiles
PDF
MicrosoftWordand.docx
8.CleaningYourDirtyData
CleaninginCode
DataNormalization
CleaningAftertheFact
OpenRefine
9.ReadingandWritingNaturalLanguages
SummarizingData
MarkovModels
SixDegreesofWikipedia:Conclusion
NaturalLanguageToolkit
InstallationandSetup
StatisticalAnalysiswithNLTK
LexicographicalAnalysiswithNLTK
AdditionalResources
10.CrawlingThroughFormsandLogins
PythonRequestsLibrary
SubmittingaBasicForm
RadioButtons,Checkboxes,andOtherInputs
SubmittingFilesandImages
HandlingLoginsandCookies
HTTPBasicAccessAuthentication
OtherFormProblems
11.ScrapingJavaScript
ABriefIntroductiontoJavaScript
CommonJavaScriptLibraries
AjaxandDynamicHTML
ExecutingJavaScriptinPythonwithSelenium
AdditionalSeleniumWebdrivers
HandlingRedirects
AFinalNoteonJavaScript
12.CrawlingThroughAPIs
ABriefIntroductiontoAPIs
HTTPMethodsandAPIs
MoreAboutAPIResponses
ParsingJSON
UndocumentedAPIs
FindingUndocumentedAPIs
DocumentingUndocumentedAPIs
FindingandDocumentingAPIsAutomatically
CombiningAPIswithOtherDataSources
MoreAboutAPIs
13.ImageProcessingandTextRecognition
OverviewofLibraries
Pillow
Tesseract
NumPy
ProcessingWell-FormattedText
AdjustingImagesAutomatically
ScrapingTextfromImagesonWebsites
ReadingCAPTCHAsandTrainingTesseract
TrainingTesseract
RetrievingCAPTCHAsandSubmittingSolutions
14.AvoidingScrapingTraps
ANoteonEthics
LookingLikeaHuman
AdjustYourHeaders
HandlingCookieswithJavaScript
TimingIsEverything
CommonFormSecurityFeatures
HiddenInputFieldValues
AvoidingHoneypots
TheHumanChecklist
15.TestingYourWebsitewithScrapers
AnIntroductiontoTesting
WhatAreUnitTests?
Pythonunittest
TestingWikipedia
TestingwithSelenium
InteractingwiththeSite
unittestorSelenium?
16.WebCrawlinginParallel
ProcessesversusThreads
MultithreadedCrawling
RaceConditionsandQueues
ThethreadingModule
MultiprocessCrawling
MultiprocessCrawling
CommunicatingBetweenProcesses
MultiprocessCrawling——AnotherApproach
17.ScrapingRem0tely
WhyUseRemoteServers?
AvoidingIPAddressBlocking
PortabilityandExtensibility
Tor
PySocks
RemoteHosting
RunningfromaWebsite-HostingAccount
RunningfromtheCloud
AdditionalResources
18.TheLegalitiesandEthicsofWebScraping
Trademarks,Copyrights,Patents,OhMy!
CopyrightLaw
TrespasstoChattels
TheComputerFraudandAbuseAct
robots.txtandTermsofService
ThreeWebScrapers
eBayversusBidder'sEdgeandTrespasstoChattels
UnitedStatesv.AuernheimerandTheComputerFraudandAbuseAct
Fieldv.Google:Copyrightandrobots.txt
MovingForward
Index

其他推荐