用robotparser解析网站的robots.txt

robotparser是Python2中的自带模块,Python3中则是集成在urllib自带模块中。
可以用来解析网站的robots.txt,判断要抓取的url和user-agent是否被网站所建议禁用。


以豆瓣电影的robots.txt为例:

User-agent: *
Disallow: /subject_search
Disallow: /amazon_search
Disallow: /search
Disallow: /group/search
Disallow: /event/search
Disallow: /celebrities/search
Disallow: /location/drama/search
Disallow: /forum/
Disallow: /new_subject
Disallow: /service/iframe
Disallow: /j/
Disallow: /link2/
Disallow: /recommend/
Disallow: /trailer/
Disallow: /doubanapp/card
Sitemap: https://www.douban.com/sitemap_index.xml
Sitemap: https://www.douban.com/sitemap_updated_index.xml
# Crawl-delay: 5

User-agent: Wandoujia Spider
Disallow: /

Python3中的示例代码:

>>> import urllib
>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://movie.douban.com/robots.txt')
>>> rp.read()
>>> url = 'https://movie.douban.com/subject/24987018/'
>>> user_agent = 'molock_crawler'
>>> rp.can_fetch(user_agent, url)
True
>>> user_agent = 'Wandoujia Spider'
>>> rp.can_fetch(user_agent, url)
False
>>> url = 'https://movie.douban.com/j/new_search_subjects?tags=电影&range=0,10&start=0'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'molock_crawler'
>>> url = 'https://movie.douban.com/subject/24987018/'
>>> rp.can_fetch(user_agent, url)
True
>>> url = 'https://movie.douban.com/j/new_search_subjects?tags=电影&range=0,10&start=0'
>>> rp.can_fetch(user_agent, url)
False

可见,如果要遵循robots.txt的话,robotparser是个挺好用的工具。