Scrape the Web: Strategies for programming websites that don't expect it Presenter: Asheesh Laroia (scrape-pycon@asheesh.org, +1-585-506-8865) Intended audience: intermediate (or better) Python programmers, probably without extensive web testing experience Tutorial format: Interactive lecture, with lots of examples and hopefully lively Q&A, interspersed with longer lab- and Q&A-sessions. If attendees email me with requests, we can take a look at specific websites. Recording: I give permission to record and publish my tutorial for free distribution. Prequisites: None. Requirements: Attendees are welcome to bring their laptops with Python installed (version 2.5 or higher, preferably 2.6). You will want BeautifulSoup and mechanize installed. Having Firefox and Firebug installed is a bonus. Suggestions: Attendees are encouraged to email me before the talk with suggestions of websites they want to see scraped. Promotional Summary Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots? We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable. You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation. Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes. Outline for Review (Note: As I practice the talk in the coming months, I may update the times or make small content modifications.) 1. Introduction (5 minutes) - You will learn neat tricks - DO NOT BECOME AN EVIL COMMENT SPAMMER - Theory, then practice, then a full overview of a few screen-scraping apps - Brittle? Sometimes. 2. Theory of HTML data extraction (5 minutes) - HTML - vs. XHTML - "What is a grammar?" - "What does it mean to parse?" - Web HTML is not always parseable 3. Reading data from HTML in practice, starting with pages on disk (12 minutes) - Treating web pages like text - String properties (split(), string in page) - Regular expressions - Taking advantage of structure - XPath: Only works with valid documents! - BeautifulSoup: Works with all sorts of trash. 4. Can you tell me how to GET? (10 minutes) - Introduce urllib.urlopen for getting pages - Basic usage - Show the HTTP headers it generates - Contrast this to Firefox with LiveHTTPHeaders - The masquerade begins: Setting user-agent 5. A little more about HTTP (5 minutes) - Response codes 6. Cookies are delicious delicacies (5 minutes) - HTTP is stateless (often idempotent, in theory) - Sessions work due to the client handing back extra information - This information is called "cookies" 7. Talking back with forms (with urllib2) (5 minutes) - GET forms: query string parameters - Demonstrations: Google, Yahoo - POST forms: URL encoded 8. Automating the nitty-gritty with "mechanize" (5 minutes) - mechanize handles cookies - mechanize handles forms! 9. Recap: Basic screen-scraping (5 minutes) - Parsing or scraping web pages - Dealing with cookies manually - Submitting forms by hand with urllib2 10. Example: Stern Grove music downloader (5 minutes) - Simple application with no UI 11. Example: Mydomain.com - change my DNS settings in a fly (4 minutes) - Simple application with forms but no UI 12. Example: Cepstral - Text to speech (4 minutes) - Simple application with forms and a UI 13. Example: Emusic - Download music everywhere, and "save from later" (6 minutes) - Show more complex backend class - Discuss user interface - Demonstrate new "Save for later"-based workflow 14. XPath: Queries for XML (8 minutes) - Some web services actually give you back XML! - XPath demonstration of basic queries - Demonstration of XPath finding with FireBug 15. Example: Weather notification bot (5 minutes) - Simple bot that uses XPath 16. "Play nice" on the web (5 minutes) - Ignore Terms of Service at your own peril - robots.txt - DO NOT BECOME AN EVIL COMMENT SPAMMER 17. Bot-evasion techniques used by websites (8 minutes) - Requiring cookies - JavaScript - CAPTCHAs - Requiring CAPTCHAs randomly - Demonstrate getting blocked by Google (not on the conference IP address, I promise) 18. In-depth BeautifulSoup workshopping (15 minutes) 19. A novel issue: The Patent Office's AJAX (12 minutes) - A CAPTCHA: Solve it by hand and save a cookie - ...now re-implement the entire website 20. JavaScript is frustrating (5 minutes) - Python is not JS - onclick=broken(); 21. A novel solution: Selenium RC (10 minutes) - Demonstrate Selenium RC - Demonstrate Selenium Recorder 22. Example: Patent office PDF downloader (10 minutes) - Demonstrate full PDF downloading 23. Scaling and stability (5 minutes) - "The user interface IS the API" - Choosing reliable queries from web pages - Expanding to more IP addresses when necessary using SSH and Python 2.6 multiprocessing 24. Summary (5 minutes) - If it's on a web page, you can scrape it out. - "Now you have an API for everything." 25. Bonus time (? minutes) If we have time: - Greasemonkey demo: scraping in the browser - Audience-suggested scraping lab Outline for Website - Introduction: Be nice to the web, but get the better of it - Structure of HTML and XHTML - Extracting information with regular expressions, parsers, and XPath - HTTP: Setting User-Agent, dealing with cookies, and handling errors - Filling out forms with urllib2 and mechanize - Discuss example applications: Text-to-speech, music store scraping, and more - In-depth BeautifulSoup query discussion - Expanding to more computers with Python 2.6 multiprocess and keeping your scraping stable across time - Scraping dynamic "AJAXy" websites by truly automating a web browser - ☃ - Q&A Presenter Bio By day, Asheesh Laroia is a software engineer at Creative Commons, where he uses Python extensively. He began scraping web sites in 2001 and honed his skills in 2008 when confronted with a US Patent Office website the Electronic Frontier Foundation needed information from. At other times, he juggles, maintains software in Debian, and leads the Students for Free Culture web team. Asheesh lives in San Francisco with his stuffed dog Herbert. Presenter's Previous Experience - For new programmers, I co-teach an Electronic Frontier Foundation Python class and personally teach a San Francisco Linux Users Group class. - I led a Python web scraping class in 2004 at Johns Hopkins (JHU). - I regularly mentor other budding Python programmers at work and at hacker spaces (http://noisebridge.net, http://www.acm.jhu.edu). - As far as general public speaking background, I have spoken about Creative Commons and its technology at OSCON 2008, at the Creative Commons Technology Summit in July 2007, at Red Emma's Bookstore in Baltimore, and on CC and myriad other things to the JHU Association for Computing Machinery.