I recently needed to find all the broken links on a website. Using a little Python
and Scrapy
, I was able to crawl the whole site quickly.
You'll need to have Scrapy installed in order to run the following code. Tested on Scrapy 1.1.2
and Python 2.7.10
. Newer versions should work as well.
404_scraper.py
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.item import Item, Field
class LinkItem(Item):
url = Field()
referer = Field()
status = Field()
class LinkSpider(CrawlSpider):
name = "linkSpider"
# Filter out other sites. No need dig into outside websites and check their links.
allowed_domains = ["matthewhoelter.com"]
# String together multiple domains if needed with a comma (,)
# i.e. ['https://www.matthewhoelter.com', 'https://blog.matthewhoelter.com']
start_urls = ['https://www.matthewhoelter.com']
handle_httpstatus_list = [404]
rules = (Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),)
def parse_item(self, response):
if response.status == 404:
link = LinkItem()
link['url'] = response.url
link['status'] = response.status
link['referer'] = response.request.headers.get('Referer')
return link
Then run it via:
Terminal
scrapy runspider 404_scraper.py -o output.json
This was adapted from the code found on alecxe's GitHub.