Tuesday, 17 September 2013

Why is my Scrapy scraper only returning the second page of results?

Why is my Scrapy scraper only returning the second page of results?

College is starting soon for me, so I decided to build a web scraper for
Rate My Professor to help me find the highest rated teachers at my school.
The scraper works perfectly well... but only for the second page! No
matter what I try, I can't get it to work properly.
This is the URL that I am scraping from:
http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311&pageNo=3 (not
my actual college, but has the same type of URL structure)
And here is my spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from rmp.items import RmpItem
class MySpider(CrawlSpider):
name = "rmp"
allowed_domains = ["ratemyprofessors.com"]
start_urls =
["http://www.ratemyprofessors.com/SelectTeacher.jsp?sid=2311"]
rules = (Rule(SgmlLinkExtractor(allow=('&pageNo=\d',),
restrict_xpaths=('//a[@id="next"]',)), callback='parser',
follow=True),)
def parser(self, response):
hxs = HtmlXPathSelector(response)
html = hxs.select("//div[@class='entry odd vertical-center'] | //
div[@class='entry even vertical-center']")
profs = []
for line in html:
prof = RmpItem()
prof["name"] = line.select("div[@class='profName']/a/text()").
extract()
prof["dept"] = line.select("div[@class='profDept']/text()").
extract()
prof["ratings"] = line.select("div[@class='profRatings']/
text()").extract()
prof["avg"] = line.select("div[@class='profAvg']/text()").
extract()
profs.append(prof)
Some things I have tried include removing the restrict_xpaths keyword
argument (resulted in the scraper going after the first, the last, the
next, and the back buttons because all share the &pageNo=\d URL structure)
and changing the regex of the allow keyword argument (resulted in no
change).
Does anybody have any suggestions? This seems to be a simple problem, but
I've already spent an hour and a half trying to figure it out! Any help
would be very appreciated.

No comments:

Post a Comment