php - How to write a bot that does not consume much RAM? -

August 15, 2012

i have web bot , consumes memory much, after time, memory usage hits 50%, , process gets killed; have no idea why memory usage increasing that, did not include "para.php" library parallel curl requests. want know more things web crawlers, searched lot, not find helpful document or methods can use.

this library obtained para.php.

my code:

require_once "para.php";  class crawling{  public $montent;   public function crawl_page($url){      $m = new mongo();      $muun = $m->howto->en->findone(array("_id" => $url));      if (isset($muun)) {         return;     }      $m->howto->en->save(array("_id" => $url));      echo $url;      echo "\n";      $para = new parallelcurl(10);      $para->startrequest($url, array($this,'on_request_done'));      $para->finishallrequests();      preg_match_all("(<a href=\"(.*)\")siu", $this->montent, $matk);      foreach($matk[1] $longu){         $href = $longu;         if (0 !== strpos($href, 'http')) {             $path = '/' . ltrim($href, '/');             if (extension_loaded('http')) {                 $href = http_build_url($url, array('path' => $path));             } else {                 $parts = parse_url($url);                 $href = $parts['scheme'] . '://';                 if (isset($parts['user']) && isset($parts['pass'])) {                     $href .= $parts['user'] . ':' . $parts['pass'] . '@';                 }                   $href .= $parts['host'];                 if (isset($parts['port'])) {                     $href .= ':' . $parts['port'];                 }                 $href .= $path;             }         }         $this->crawl_page($longu);     } }  public function on_request_done($content) {     $this->montent = $content; }   $moj = new crawling; $moj->crawl_page("http://www.example.com/");

you call crawl_page function on 1 url. it's content fetched ($this->montent) , checked links ($matk).

while these not yet destroyed, go recursive, starting new call crawl_page. $this->moment overwritten new content (that's ok). bit further down, $matk (a new variable) populated links new $this->montent. @ point, there 2 $matk's in memory: 1 links document started processing first, , 1 links document first linked in original document.

i'd suggest find links & save them database (instead of going recursive). clear queue of links in database, 1 1 (with each new document adding new entry database)

Search This Blog

New Mian

php - How to write a bot that does not consume much RAM? -

Comments

Post a Comment

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

node.js - Getting the socket id,user id pair of a logged in user(s) -

keyboard - C++ GetAsyncKeyState alternative -