python - Scrap website using scrapy -
i trying scrap website scrapy having problem scrapping products site using endless scrolling...
i can scrap below data 52 items 3824 items.
hxs.select("//span[@class='itm-catbrand strong']").extract() hxs.select("//span[@class='itm-price ']").extract() hxs.select("//span[@class='itm-title']").extract() if use hxs.select("//div[@id='content']/div/div/div").extract() extracts whole items list wont filter further....how scrap items...
i have tried same result. please tell me wrong...
def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body n in [2,3,4,5,6]: req = request(url="http://www.jabong.com/men/shoes/?page=" + n, headers = {"referer": "http://www.jabong.com/men/shoes/", "x-requested-with": response.header['x-requested-with']}) return req
as have guessed, website uses javascript load more items when scroll page.
using developers tools included in browser (ctrl-maj chromium), saw in network tab javascript script included in page performs following requests load more items :
get http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc... the web server responds documents of following type :
<li id="ph969sh70hptindfas" class="itm hasoverlay unit size1of4 "> <div id="qa-quick-view-btn" class="quickviewzoom itm-quickview ui-buttonquickview l-absolute pos-t" title="quick view" data-url ="phosphorus-black-moccasins-233629.html" data-sku="ph969sh70hptindfas" onclick="_gaq.push(['_trackevent', 'badgeqv','shown','offer inside']);">quick view</div> <div class="itm-qlinsert tooltip-qlist highlightstar" onclick="javascript:rocket.quicklist.insert('ph969sh70hptindfas', 'catalog'); return false;" > <div class="starhrmsg"> <span class="starhrmsgarrow"> </span> save later </div> </div> <a id='cat_105_ph969sh70hptindfas' class="itm-link sobrtxt" href="/phosphorus-black-moccasins-233629.html" onclick="firegaq('_trackevent', 'catalog pdp', 'men--shoes--moccasins', 'ph969sh70hptindfas--1699.00--', this),firegaq('_trackevent', 'badgepdp','shown','offer inside', this);"> <span class="lazyimage"> <span style="width:176px;height:255px;" class="itm-imagewrapper itm-imagewrapper-ph969sh70hptindfas" id="http://static4.jassets.com/p/phosphorus-black-moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4"> <noscript><img src="http://static4.jassets.com/p/phosphorus-black-moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript> </span> </span> <span class="itm-budgeflag offinside"><span class="flagbrdleft"></span>offer inside</span> <span class="itm-catbrand strong">phosphorus</span> <span class="itm-title"> black moccasins </span> these documents contain more items.
so, full list of items have return request objects in parse method of spider (see spider class documentation), tell scrapy should load more data :
def parse(self, response): # ... extract items in page using extractors n = number of next "page" parse # get n using response.url, extracting number # @ end , adding 1 # it important set referer , x-requested-with headers # here because that's how website detects if request made javascript # or direcly following link. req = request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n, headers = {"referer": "http://www.website-your-are-crawling.com/men/shoes/", "x-requested-with": "xmlhttprequest"}) return req # and items oh, , way (in case want test), can't load http://www.website-your-are-crawling.com/men/shoes/?page=2 in browser see returns because website redirect global page (ie http://www.website-your-are-crawling.com/men/shoes/) if x-requested-with header different xmlhttprequest.
Comments
Post a Comment