Archive for the ‘scraper’ Category

Copy Paste Google Scraper

Bored trying to use DOM to parse your data? That library is immense for simple tasks. Well anyway it’s pretty simple to write a program to scrape google, but just to make it easier here’s how I do it. Make sure that the scraper code from here is in the same php file or included. Feel free to use this code for any tool you want.

function scrape_google($url)
{
// get a page of results
$page = scrape_page($url);

// get a list of organic SE links
preg_match(”/<h2 class=r>(.*)<\/h2>/”, $page, $matches);
$link_list = $matches[1];

// get a list of URLS
$link_list = str_replace(”</a>”, “</a>\n”, $link_list);
preg_match_all(”/<a href=\”(.*)?\” class(.*)<\/a>/”, $link_list, $matches);
$link_list = $matches[1];

return $link_list;

// DEBUG: All this below is debugging stuff I’ve left in

// create a string to print to screen
//$str_link_list = implode(”\n”, $link_list);
//echo “<pre>” . $str_link_list . “</pre>”;

// save all links to a file
//$fp = fopen(”out”, “a”);
//fwrite($fp, $str_link_list . “\n”);
//fclose($fp);

// get the next page
//preg_match(”/<td nowrap class=b><a href=\”(.*)\”><div id=nn><\/div>Next<\/a>/”, $page, $matches);
//echo “<a href=’?googleurl=” . urlencode(”http://www.google.com” . $matches[1]) . “‘>Next Page - ” . $matches[1] . “</a>”;
}

So I grabbed this URL from the address bar and stuffed it into this function:

Tuesday, March 11th, 2008

Simple Copy Paste Scraper Function

If you looked over at the guest post code on BlueHatSeo.com for scraping it’s a little incomplete (in my opinion :P ). It’s missing cookies, and has a couple of flexibility issues. The code below will let you use POST variables simply, as well as allowing you to store session data in cookies etc.

It’s really simple to use. To get a page without proxy and no post variables:

$htmlcode = scrape_page(”http://www.google.com/”);

with post variables:

$htmlcode = scrape_page(”http://www.google.com/”, 1, “var1=1&var2=2&var3=3″);

with proxy (defaults to 127.0.0.1:8118 - TOR):

$htmlcode = scrape_page(”http://www.google.com/”, 0, “”, 1);

Here’s the code… all you need to do is change the cookie path to a text file (with the correct permissions on linux), and set the proxy to your proxy address.

<?php

function scrape_page($page, $post=0, $fields=null, $proxy=0)
{
// cookie path
$file_cookie = “/path/to/cookie/file/cookies”;

$ch = curl_init($page);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

if($proxy==1)
curl_setopt($ch, CURLOPT_PROXY, “127.0.0.1:8118″);

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)”);

if($post==1)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
}

$response = curl_exec($ch);
curl_close($ch);

//echo curl_error($ch);

return $response;
}

?>

The only thing it is missing is some good old error checking.

Tuesday, March 11th, 2008