php - How to write a bot that does not consume much RAM? -
i have web bot , consumes memory much, after time, memory usage hits 50%, , process gets killed; have no idea why memory usage increasing that, did not include "para.php" library parallel curl requests. want know more things web crawlers, searched lot, not find helpful document or methods can use.
this library obtained para.php.
my code:
require_once "para.php"; class crawling{ public $montent; public function crawl_page($url){ $m = new mongo(); $muun = $m->howto->en->findone(array("_id" => $url)); if (isset($muun)) { return; } $m->howto->en->save(array("_id" => $url)); echo $url; echo "\n"; $para = new parallelcurl(10); $para->startrequest($url, array($this,'on_request_done')); $para->finishallrequests(); preg_match_all("(<a href=\"(.*)\")siu", $this->montent, $matk); foreach($matk[1] $longu){ $href = $longu; if (0 !== strpos($href, 'http')) { $path = '/' . ltrim($href, '/'); if (extension_loaded('http')) { $href = http_build_url($url, array('path' => $path)); } else { $parts = parse_url($url); $href = $parts['scheme'] . '://'; if (isset($parts['user']) && isset($parts['pass'])) { $href .= $parts['user'] . ':' . $parts['pass'] . '@'; } $href .= $parts['host']; if (isset($parts['port'])) { $href .= ':' . $parts['port']; } $href .= $path; } } $this->crawl_page($longu); } } public function on_request_done($content) { $this->montent = $content; } $moj = new crawling; $moj->crawl_page("http://www.example.com/");
you call crawl_page function on 1 url. it's content fetched ($this->montent) , checked links ($matk).
while these not yet destroyed, go recursive, starting new call crawl_page. $this->moment overwritten new content (that's ok). bit further down, $matk (a new variable) populated links new $this->montent. @ point, there 2 $matk's in memory: 1 links document started processing first, , 1 links document first linked in original document.
i'd suggest find links & save them database (instead of going recursive). clear queue of links in database, 1 1 (with each new document adding new entry database)
Comments
Post a Comment