python - Scrap website using scrapy -

May 15, 2015

i trying scrap website scrapy having problem scrapping products site using endless scrolling...

i can scrap below data 52 items 3824 items.

hxs.select("//span[@class='itm-catbrand strong']").extract() hxs.select("//span[@class='itm-price ']").extract() hxs.select("//span[@class='itm-title']").extract()

if use hxs.select("//div[@id='content']/div/div/div").extract() extracts whole items list wont filter further....how scrap items...

i have tried same result. please tell me wrong...

def parse(self, response):     filename = response.url.split("/")[-2]     open(filename, 'wb').write(response.body     n in [2,3,4,5,6]:                 req = request(url="http://www.jabong.com/men/shoes/?page=" + n,                       headers = {"referer": "http://www.jabong.com/men/shoes/",                                  "x-requested-with": response.header['x-requested-with']})     return req

as have guessed, website uses javascript load more items when scroll page.

using developers tools included in browser (ctrl-maj chromium), saw in network tab javascript script included in page performs following requests load more items :

get http://www.website-your-are-crawling.com/men/shoes/?page=2 # 2,3,4,5,6 etc...

the web server responds documents of following type :

<li id="ph969sh70hptindfas" class="itm hasoverlay unit size1of4 ">   <div id="qa-quick-view-btn" class="quickviewzoom itm-quickview ui-buttonquickview l-absolute pos-t" title="quick view" data-url ="phosphorus-black-moccasins-233629.html" data-sku="ph969sh70hptindfas" onclick="_gaq.push(['_trackevent', 'badgeqv','shown','offer inside']);">quick view</div>                                      <div class="itm-qlinsert tooltip-qlist  highlightstar"                      onclick="javascript:rocket.quicklist.insert('ph969sh70hptindfas', 'catalog');                                              return false;" >                                               <div class="starhrmsg">                          <span class="starhrmsgarrow">&nbsp;</span>                          save later                         </div>                                         </div>                 <a id='cat_105_ph969sh70hptindfas' class="itm-link sobrtxt" href="/phosphorus-black-moccasins-233629.html"                                      onclick="firegaq('_trackevent', 'catalog pdp', 'men--shoes--moccasins', 'ph969sh70hptindfas--1699.00--', this),firegaq('_trackevent', 'badgepdp','shown','offer inside', this);">                     <span class="lazyimage">                         <span style="width:176px;height:255px;" class="itm-imagewrapper itm-imagewrapper-ph969sh70hptindfas" id="http://static4.jassets.com/p/phosphorus-black-moccasins-6668-926332-1-catalog.jpg" itm-img-width="176" itm-img-height="255" itm-img-sprites="4">                             <noscript><img src="http://static4.jassets.com/p/phosphorus-black-moccasins-6668-926332-1-catalog.jpg" width="176" height="255" class="itm-img"></noscript>                         </span>                                                 </span>                                              <span class="itm-budgeflag offinside"><span class="flagbrdleft"></span>offer inside</span>                                                                    <span class="itm-catbrand strong">phosphorus</span>                     <span class="itm-title">                                                                                 black moccasins                        </span>

these documents contain more items.

so, full list of items have return request objects in parse method of spider (see spider class documentation), tell scrapy should load more data :

def parse(self, response):     # ... extract items in page using extractors     n = number of next "page" parse     # get n using response.url, extracting number     # @ end , adding 1      # it important set referer , x-requested-with headers     # here because that's how website detects if request made javascript     # or direcly following link.     req = request(url="http://www.website-your-are-crawling.com/men/shoes/?page=" + n,        headers = {"referer": "http://www.website-your-are-crawling.com/men/shoes/",           "x-requested-with": "xmlhttprequest"})     return req # and items

oh, , way (in case want test), can't load http://www.website-your-are-crawling.com/men/shoes/?page=2 in browser see returns because website redirect global page (ie http://www.website-your-are-crawling.com/men/shoes/) if x-requested-with header different xmlhttprequest.

Search This Blog

New Mian

python - Scrap website using scrapy -

Comments

Post a Comment

Popular posts from this blog

Change php variable from jquery value using ajax (same page) -

Pull out data related to my apps from Android Play Store and iOS App Store -

How can I fetch data from a web server in an android application? -