Scrapy is a great tool for scraping info off of websites. Recently I was trying to pull info via Scrapy
from EventBrite’s API tools. I say trying because instead of getting a JSON response like I was expecting, it was returning a full HTML webpage. Not very helpful when trying to parse JSON.
EventBrite’s API is a little unique because they supply a very useful web interface to interact with while building the queries. However, when using Scrapy
, it becomes less useful and more of a hindrance.
I suspected EventBrite was looking at the request headers and returning a specific view based on if it was requesting HTML or JSON. Scrapy
, being a web scraper, defaults to requesting the HTML version of pages.
Setting the headers for Scrapy
is straight-forward:
scrapy_header.py
import scrapy
import json
class scrapyHeaderSpider(scrapy.Spider):
name = "scrapy_header"
# This is a built-in Scrapy function that runs first where we'll override the default headers
# Documentation: https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
def start_requests(self):
url = "https://www.eventbriteapi.com/v3/organizers/[ORG_ID]/events/?token=[YOUR_TOKEN]"
# Set the headers here. The important part is "application/json"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
'Accept': 'application/json,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
}
yield scrapy.http.Request(url, headers=headers)
def parse(self, response):
parsedJson = json.loads(response.body)
Then run via:
Terminal
scrapy runspider scrapy_header.py
That’s it!
If you want to learn more about Scrapy
's default settings, the documentation on it is here.
"Why are you using Scrapy for something that could easily be solved by just using Requests?"
That's true. In most cases, doing something like this is much simpler:
response = requests.get("http://api.open-notify.org/iss-now.json")
However, there may be an instance that you need to set a header in Scrapy
, so hopefully this tutorial is useful to someone.