Cover image for Spidering hacks
Spidering hacks
Hemenway, Kevin.
Personal Author:
Publication Information:
Beijing ; Cambridge : O'Reilly, [2004]

Physical Description:
xix, 402 pages ; 23 cm
Added Author:
Format :


Call Number
Material Type
Home Location
Item Holds
QA76.9.D343 H46 2004 Adult Non-Fiction Central Closed Stacks

On Order



The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you. Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks , you'll be able to:

Aggregate and associate data from disparate locations, then store and manipulate the data as you like Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites Integrate third-party data into your own applications or web sites Make your own site easier to scrape and more usable to others Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Author Notes

Kevin Hemenway, coauthor of Mac OS X Hacks, is better known as Morbus Iff, the creator of, which bills itself as "content for the discontented." Publisher and developer of more home cooking than you could ever imagine, he'd love to give you a Fry Pan of Intellect upside the head. Politely, of course. And with love.

Tara Calishain is the creator of the site, ResearchBuzz. She is an expert on Internet search engines and how they can be used effectively in business situations.

Table of Contents

Creditsp. ix
Prefacep. xv
Chapter 1. Walking Softlyp. 1
1. A Crash Course in Spidering and Scrapingp. 1
2. Best Practices for You and Your Spiderp. 3
3. Anatomy of an HTML Pagep. 7
4. Registering Your Spiderp. 10
5. Preempting Discoveryp. 12
6. Keeping Your Spider Out of Sticky Situationsp. 15
7. Finding the Patterns of Identifiersp. 18
Chapter 2. Assembling a Toolboxp. 21
Perl Modulesp. 22
Resources You May Find Helpfulp. 23
8. Installing Perl Modulesp. 24
9. Simply Fetching with LWP::Simplep. 27
10. More Involved Requests with LWP::UserAgentp. 29
11. Adding HTTP Headers to Your Requestp. 30
12. Posting Form Data with LWPp. 32
13. Authentication, Cookies, and Proxiesp. 34
14. Handling Relative and Absolute URLsp. 38
15. Secured Access and Browser Attributesp. 40
16. Respecting Your Scrapee's Bandwidthp. 42
17. Respecting robots.txtp. 46
18. Adding Progress Bars to Your Scriptsp. 47
19. Scraping with HTML::TreeBuilderp. 53
20. Parsing with HTML::TokeParserp. 56
21. WWW::Mechanize 101p. 59
22. Scraping with WWW::Mechanizep. 62
23. In Praise of Regular Expressionsp. 67
24. Painless RSS with Template::Extractp. 70
25. A Quick Introduction to XPathp. 74
26. Downloading with curl and wgetp. 78
27. More Advanced wget Techniquesp. 80
28. Using Pipes to Chain Commandsp. 82
29. Running Multiple Utilities at Oncep. 86
30. Utilizing the Web Scraping Proxyp. 89
31. Being Warned When Things Go Wrongp. 93
32. Being Adaptive to Site Redesignsp. 96
Chapter 3. Collecting Media Filesp. 99
33. Detective Case Study: Newgroundsp. 99
34. Detective Case Study: iFilmp. 105
35. Downloading Movies from the Library of Congressp. 108
36. Downloading Images from Webshotsp. 111
37. Downloading Comics with dailystripsp. 115
38. Archiving Your Favorite Webcamsp. 118
39. News Wallpaper for Your Sitep. 122
40. Saving Only POP3 Email Attachmentsp. 125
41. Downloading MP3s from a Playlistp. 132
42. Downloading from Usenet with ngetp. 137
Chapter 4. Gleaning Data from Databasesp. 141
43. Archiving Yahoo! Groups Messages with yahoo2mboxp. 141
44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groupsp. 143
45. Gleaning Buzz from Yahoo!p. 147
46. Spidering the Yahoo! Catalogp. 150
47. Tracking Additions to Yahoo!p. 157
48. Scattersearch with Yahoo! and Googlep. 160
49. Yahoo! Directory Mindshare in Googlep. 164
50. Weblog-Free Google Resultsp. 168
51. Spidering, Google, and Multiple Domainsp. 171
52. Scraping Product Reviewsp. 176
53. Receive an Email Alert for Newly Added Reviewsp. 178
54. Scraping Customer Advicep. 180
55. Publishing Associates Statisticsp. 182
56. Sorting Recommendations by Ratingp. 185
57. Related Products with Alexap. 188
58. Scraping Alexa's Competitive Data with Javap. 193
59. Finding Album Information with FreeDB and Amazon.comp. 194
60. Expanding Your Musical Tastesp. 203
61. Saving Daily Horoscopes to Your iPodp. 207
62. Graphing Data with RRDTOOLp. 209
63. Stocking Up on Financial Quotesp. 213
64. Super Author Searchingp. 217
65. Mapping O'Reilly Best Sellers to Library Popularityp. 232
66. Using All Consuming to Get Book Listsp. 235
67. Tracking Packages with FedExp. 241
68. Checking Blogs for New Commentsp. 243
69. Aggregating RSS and Posting Changesp. 248
70. Using the Link Cosmos of Technoratip. 255
71. Finding Related RSS Feedsp. 259
72. Automatically Finding Blogs of Interestp. 270
73. Scraping TV Listingsp. 273
74. What's Your Visitor's Weather Like?p. 277
75. Trendspotting with Geotargetingp. 281
76. Getting the Best Travel Route by Trainp. 287
77. Geographic Distance and Back Againp. 290
78. Super Word Lookupp. 296
79. Word Associations with Lexical Freenetp. 300
80. Reformatting Bugtraq Reportsp. 303
81. Keeping Tabs on the Web via Emailp. 308
82. Publish IE's Favorites to Your Web Sitep. 314
83. Spidering Game Pricesp. 322
84. Bargain Hunting with PHPp. 325
85. Aggregating Multiple Search Engine Resultsp. 331
86. Robot Karaokep. 335
87. Searching the Better Business Bureaup. 339
88. Searching for Health Inspectionsp. 342
89. Filtering for the Naughtiesp. 345
Chapter 5. Maintaining Your Collectionsp. 349
90. Using cron to Automate Tasksp. 349
91. Scheduling Tasks Without cronp. 351
92. Mirroring Web Sites with wget and rsyncp. 355
93. Accumulating Search Results Over Timep. 359
Chapter 6. Giving Back to the Worldp. 363
94. Using XML::RSS to Repurpose Datap. 364
95. Placing RSS Headlines on Your Sitep. 368
96. Making Your Resources Scrapable with Regular Expressionsp. 371
97. Making Your Resources Scrapable with a REST Interfacep. 378
98. Making Your Resources Scrapable with XML-RPCp. 381
99. Creating an IM Interfacep. 385
100. Going Beyond the Bookp. 389
Indexp. 391