turbin3
BuSo Pro
- Joined
- Oct 9, 2014
- Messages
- 613
- Likes
- 1,287
- Degree
- 3
So you're trying to collect data from the internet, huh? First, as we learn we usually start with manual methods, typically variations on copy/pasting our way to success. Eventually you start trying to find tools to remove steps from the process and begin collecting more data in less time. A few websites with tools here, a browser plugin there, and you start scaling up a small amount. Eventually you might start to learn some more technical methods, such as using Xpaths and different programs to scrape data from multiple pages. Eventually, it just won't be enough and won't be fast enough. The impatience is maddening at times. We live in the age of big data and machine learning, where those things and more are advancing to the level of auto/semi-automation, such that manually working in Excel is like bringing a SMART car to the 24 Hours of LeMans. For small niches, manual methods can still work just fine. For big niches/industries, and/or highly competitive ones, it's innovate or DIE. Eventually, you will want to learn a programming language, and learn how to build your own, customized, optimized, crawlers to do more work in less time. One such widely-used language, that is highly usable for many of these efforts, is Python.
Python is often erroneously referred to as a "scripting language", however it is in fact a general purpose programming language. There are websites and web apps out there that run off of it (Django being one type). You can do a great many things with Python. One of the great things about the language is it can be highly readable, and the syntax can be very logical, making it an easy language to program in as opposed to much more verbose and complex (low-level) languages such as C-based languages. In fact, Python is a compiled language that has many different compilers and libraries available to compile to C, Java, JS, and other languages. That can be really useful if you ever take on additional languages, as it can help make your code portable and importable to many others. A lot can be said of Python, and not everyone is always a fan of the language, but the main takeaway here is that it is an incredibly flexible language with significant community support and a wide array of open source libraries and frameworks available. The reason this matters to YOU, is that many of the problems you might be trying to solve have likely been solved before and already have code, frameworks or libraries available to meet those needs. Why reinvent the wheel when someone may have already built an extremely high-performance wheel that you can install in 2 seconds for free?! Github is a veritable treasure trove of free, open-source Python projects people have released, which can often solve a lot of your problems. Instead of necessarily learning a completely new (to you) language, in its entirety, from the ground up, you can simply take the same copy/pasta concept to the next level.
Python and Scraping
Python can be used for a great many things, but one of its most common uses is for building scrapers and bots to perform various functions. For example, with a few frameworks such as Selenium and/or Mechanize, a person could build a bot that mimics a browser and a real user. Imagine your scraper providing the site a real user agent, executing javascript, utilizing cookies... There are a great many ways to build some rather creative bots to do a great many things... The thing I want to highlight here is that you can build bots and scrapers directly in Python, and without necessarily having to use many other frameworks or libraries. That being said, there's working hard, and there's working smart. I'm about to show you how you can work smart.
The Scrapy Framework
So what's the point to all of this? Scrapy, for most of us, is a great way to start working smart. It's an open-source web-scraping framework for Python. It's quick and easy to install and get up and running. While some people like building custom bots directly in Python, without being restricted to one framework, I would say for MOST people's needs, Scrapy probably has you covered, and might be the most efficient way you can get up and scraping with Python.
Installing Python and Scrapy
In the vein of not reinventing the wheel, I'm going to hit the highlights here, linking to some other concise install tutorials I've come across.
First, download Python. I'd highly recommend Python 2.7, as that's what I'm basing this tutorial on, and that's what a significant percentage of open-source frameworks are written/optimized for. It'll probably still be a few years before a significant percentage of the industry really begins transitioning to Python 3+. Make sure you choose the appropriate version, whether you have a 32bit or 64bit system.
https://www.python.org/downloads/
Next, chose the right tutorial for what system you're using:
http://docs.python-guide.org/en/latest/starting/install/win/
http://docs.python-guide.org/en/latest/starting/install/osx/
**One golden nugget here is, when installing Python in Windows, from the installer (unless you're doing it through the command line, which will be a bit different), there will be an option to add Python to your PATH. Select this and you can skip the next section.
Adding Python to Your PATH
Ensure you have Python added to your PATH. Both tutorials above cover this. If Python hasn't been installed in your Path, you will not be able to run Python commands easily from the command line. When it is in the path, anytime you have a command window or terminal window open, you can run Python from anywhere, even if you are not in the Python subdirectory.
Installing Setuptools and Pip
Ensure you have Setuptools and Pip installed. Pip is an installation tool for Python packages, and you're going to need it for Scrapy, as well as potentially some other frameworks and libraries, as you begin to grow your bots. The nice thing about Pip, is it lets you quickly install packages that you might be missing. For example, when you attempt to run a Python program or Scrapy bot, you will often see errors and the run will fail. Sometimes these errors are because a component of the program requires a certain library that you don't have installed. Just like with package managers on the Linux CLI (command line interface), you can easily install many of those packages right from your Windows CLI, terminal, etc. Here's a decent tutorial for verifying and/or installing:
http://dubroy.com/blog/so-you-want-to-install-a-python-package/
Run "pip install scrapy" from the command line to install the Scrapy framework. http://scrapy.org/
Creating Your First Scraper
First you need to determine where you're going to want to store all of your Scrapy projects, and make a directory for it. Go into that directory in your CLI (in Windows, go to the "Run" menu and type CMD). From that directory, run "scrapy startproject <yourprojectname here> to create your first Scrapy project (you don't need the greater/less than symbols). This will create the base files and folder structure for you to start from, which makes it quick and easy to get up and running. In your new directory, you'll have 3 main files where you'll get started:
-Items.py
-Pipelines.py
-Settings.py
The Major Components of Scrapy
Now I'm probably going to butcher this description, and anyone feel free to correct me if I'm a bit off as this is just my own working understanding. Basically "Items.py" is one of the final/higher level files where you more generally specify the components of your bot (think "classes", "fields", etc.). Think of it as the "what", but not necessarily the "how". There are, I believe, additional ways the Items.py can be used, but that's what I usually use it for. Here's more info to learn about Items.py:
http://doc.scrapy.org/en/latest/topics/items.html
Pipelines.py is basically "post-processing" for the things that you've scraped with a spider. When the data is actually being scraped, it may turn out that you run into many issues, such as duplicated data, unformatted or incorrectly formatted data, or some other condition of that data that is less than desirable for what you'd prefer. A Scrapy Pipeline is basically a class where you can define how to format, improve, and shape that data into a usable format for your needs. For example, you can build a Pipeline to dedupe data. You can also build a Pipeline to output your data as a JSON, CSV, or other file format you prefer. If you're scraping a lot of data, you might choose to develop a Pipeline that will post-process that data for quality control, and then export that data into a database. Feel free to let your minds run wild here, as the more data you scrape, the more you will need to "shape" that output into a usable and efficient format. If you get really in depth and creative here, maybe you might use some Pipelines to do some post-processing and file management of your data, while also separately having built an efficient and attractive front-end to query, manage, and derive insights from that data. For example, Pipelines that will refine and export to a database which you have a local or web app with a D3 Javascript front-end for, which will turn those tables into insightful visualizations which will allow you to find the hidden value of that data in the blink of an eye. That certainly sounds a hell of a lot more productive than looking at hundreds of thousands of rows of data in Excel! You can sort of think of Scrapy Pipelines as the "why" behind what you're doing. Here's more on Pipelines:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html
Settings.py, as the name denotes, is the primary configuration file for that particular Scrapy project. If you choose, you could create a separate project for every single bot. You might choose to create a separate project even for just specific functions of a larger overall bot comprised of multiple projects, though for most people's uses that would be over-complicating things significantly. In many cases, for general scraper bots, you can probably combine all or most into one single project that will continue to grow over time. Don't overthink things here, and try to focus on one single project to start with. Your ultimate goal, as you learn more Python (and any programming language for that matter), is to generate highly reusable code as opposed to one-off projects that only ever get used for small edge-cases. For example, if you build a bot that will scrape the primary meta tags from a web page, using the simplest type of Xpath, this bot or at least those specific item classes will likely be successful on the vast majority of sites you might crawl. Highly reusable code pays for itself in the long run, because it prevents you from having to constantly reinvent the wheel. The easiest way to think of Settings.py is like the .htaccess file on an Apache server, or nginx.conf on an NGINX server. This is where you will configure the major components and modules available to you in Scrapy, or that you decide to build into it with your own custom modules. Here's more on Settings.py:
http://doc.scrapy.org/en/latest/topics/settings.html
Another primary component of Scrapy is the "Middleware". There are both Download Middlewares, and Spider Middlewares. You can probably guess by the name. Middlewares are what they sound like; they are frameworks that hook in between the bots and their major functions. For example, you can create a custom Download Middleware that will allow you to customize some of the major "behaviors" your bot exhibits, as well as how it processes requests and the responses to those requests. Say you want to scrape a list of URLs, except that site blocks all requests that don't accept cookies. Download Middlewares will allow you to enable cookies for your bots, as well as defining the behavior of how the bots use those cookies. You might choose to allow a bot to keep a cookie for 5 URL requests, then dump it and keep the next cookie, all in an effort to obfuscate certain traffic-modeling and blocking methods. Another example might be choosing to use the HTTP Cache Middleware to store a log of HTTP response headers. You might choose to strategically crawl certain types of pages on a competitor's site that uses complex cookies, so you can later analyze the response header logs and get a feel for what they're doing and why. Here's more on Download Middlewares:
https://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
Spider Middlewares can allow you to do things such as specifying a depth limit, so that your bot doesn't recursively crawl an insane number of pages on a site, when you might only care about a small percentage or certain subdirectories to a particular level. It will also allow you to do cool things, such as defining how the bot should handle certain types of HTTP requests. For example, say you have an extremely large site that's a bear to maintain, and you're always dealing with pages being 404'd all over the place. You might choose to create a Scrapy bot, set it to run on a schedule with a cron job, to periodically crawl your site. You could then have it ignore every HTTP 200 page, 301, 302, etc. and instead simply have it output a CSV of every 404 it detects, so you can just let it come to you and create your redirects as necessary. Are you starting to see how this Python stuff can seriously improve your life?! Really start thinking about the things you need to do, come up with ways to do them repeatedly, then figure out a way you can do that with Python, then AUTOMATE it and enjoy life. Here's more on Spider Middlewares:
https://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
So to recap, you have Items, Pipelines, Middlewares, and Settings. Those are the bulk of what I'll call the "structural" components to define, activate, and/or deactivate to setup your bots' core behaviors. One thing we haven't talked about yet is the real core of the "how" part of What, Why, and How. While you may define some of the fundamental ways in your settings, pipelines, and middlewares, the real meat and potatoes is what you will define in the bot you will need to create in the "spiders" folder of your project directory. To create your first spider, in your CLI and in the folder with your items, pipelines, and settings files, type
Code:
scrapy genspider <yourspidername> <website>
Code:
scrapy genspider testspider example.com
Code:
# -*- coding: utf-8 -*-
import scrapy
class TestspiderSpider(scrapy.Spider):
name = "testspider"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
pass
So now you have Python and Scrapy installed, as well as your first Scrapy bot template created. Next up will be Building Bots, and learning how to actually create some of the functions you want to scrape various things. After that will be Learning How To FAIL, which is one of the most useful and fundamental things you can learn when learning to program.