Scrape Stack Overflow Jobs Using Scrapy | Python Tutorial

1. What the Hell is Scraping, Anyway?
2. Install Scrapy
3. Write Your First Script
4. Put It To Work!
For this tutorial, we’re going to write a Web Spider to scrape Stack Overflow Jobs.

1. What the Hell is Scraping, Anyway?

Web scraping (web harvesting or web data extraction) is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.

While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.[1]

What’s a “Web Spider”?

A Web crawler, sometimes called a spider, is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).[2]

2. Install Scrapy

We’ll be using Scrapy to write our first website scraping script. There is extensive documentation for Scrapy here.

Before we begin, ensure that you set up a Python virtual environment for this project.

There is also a detailed Installation Guide on the Scrapy docs website.

Once you’ve done that, activate your virtual environment and install Scrapy:

$ cd scrapy-project
$ venv-activate
(scrapy-project-venv) $ pip install scrapy

From there, we can create our Scrapy project:

$ scrapy startproject scrapy-project

3. Write Your First Script

If you’re new to Python or Scrapy (like me) I’d recommend taking a look at the the Scrapy Tutorial in the documentation.

Create a new file in your project directory, under [PROJECT_NAME]/spiders/ named stackoverflow_spider.py.

The first part of our file will be declaring our class:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = [
        'http://stackoverflow.com/jobs?l=Remote&d=20&u=Miles&sort=p'
    ]

  1. First we have import scrapy.
  2. Give your class a relevant name, i.e. StackOverflowSpider.
  3. The name class variables is the name you will use to tell Scrapy to run your web crawler.
  4. Scrapy will automatically run the crawler on each URL in the start_urls array.

The URL I’ve used here is searching only for remote jobs, and sorting them by date.

You can also add queries to search for specific types of jobs.

The next parse is to define the parse function, which your class will call to handle the response from your URL(s):

def parse(self, response):
  for result in response.css('.jobs div.-job'):
    yield {
      'company': result.css('.-job-info .employer::text').extract_first(),
      'title': result.css('.-job-info .-title .job-link::text').extract_first(),
      'date': result.css('.-job-info .posted::text').extract_first(),
      'url': result.css('.-job-info .-title .job-link::attr(href)').extract_first(),
    }

  • The response variable will hold all of the HTML from the URL(s) called.
  • Here we have a for loop to iterate through each result, defined by its CSS selector: .-job-info .employer
  • We use ::text to only extract the text within the tags; otherwise it will grab all of the HTML within the element.
  • We’re using extract_first() instead of extract(), so that it doesn’t put the result into an array.

To learn more about CSS and XPATH selectors in Scrapy, read this section of the docs.

For this spider, we don’t actually need to visit each listing, but you canlearn how here.

4. Put It To Work!

So let’s run our crawler and see what we get:

$ scrapy crawl stackoverflow -o stackoverflow.json

The -o flag tells Scrapy to output the results to a file, stackoverflow.json.

scrapy crawler python json scrape stack overflow jobs

Your file should look something like this:

[
{
  "date": "\r\n                        yesterday\r\n                    ",
  "url": "http://stackoverflow.com/jobs/125185/software-engineer-cloud-technology-partners",
  "company": "\r\n                        Cloud Technology Partners\r\n                    ",
  "title": "Software Engineer"
},
{
  "date": "\r\n                        3 hours ago\r\n                    ",
  "url": "http://stackoverflow.com/jobs/132800/dev-ops-engineer-confluence",
  "company": "\r\n                        Confluence\r\n                    ",
  "title": "Dev Ops Engineer"
},
{
  "date": "\r\n                        5 hours ago\r\n                    ",
  "url": "http://stackoverflow.com/jobs/119653/lead-javascript-developer-planity",
  "company": "\r\n                        Planity\r\n                    ",
  "title": "Lead Javascript Developer"
},

...

]

Looks good! However, we haven’t told our web spider to follow any links yet.

Right now, it’s only pulling the first page of results.

Let’s do something about that…

Place the following code within your parse function, just below the for loop:

# follow pagination links
next_page = response.css('.pagination .test-pagination-next::attr(href)').extract_first()
  if next_page is not None:
    next_page = response.urljoin(next_page)
    yield scrapy.Request(next_page, callback=self.parse)

This code is taken from the Scrapy Tutorial, modified for the Stack Overflow site.

If all goes well, this code will tell the crawler to follow the “Next” link at the bottom of each page, and continue yielding results until there are no more result pages.

Let’s run the crawler again. Note: if you output to the same file, it will append results to the end of that file.

So, either delete the json file, or output to a different file name.

$ rm stackoverflow.json
$ scrapy crawl stackoverflow -o stackoverflow.json

You should have many more results now.

Now it gets really fun! Let’s do something with our data!

I’ve created an Angular app to display the results from multiple sites on one page, and added a search function, too.

scrape stack overflow jobs

View on GitHub Pages

You can also view the code in my GitHub Repository.

Syntax Highlighting by EnlighterJS

About the Author

Kimberly is a software engineer, and currently works as a Test Engineering Intern for Mozilla.

While she always enjoys learning new technologies, her current focus is python, and when she has free time (she doesn't), Angular & Node.

When not coding, Kimberly spends time with her three young boys in Durango, CO.