Simple Copy Paste Scraper Function
If you looked over at the guest post code on BlueHatSeo.com for scraping it’s a little incomplete (in my opinion
). It’s missing cookies, and has a couple of flexibility issues. The code below will let you use POST variables simply, as well as allowing you to store session data in cookies etc.
It’s really simple to use. To get a page without proxy and no post variables:
$htmlcode = scrape_page(”http://www.google.com/”);
with post variables:
$htmlcode = scrape_page(”http://www.google.com/”, 1, “var1=1&var2=2&var3=3″);
with proxy (defaults to 127.0.0.1:8118 - TOR):
$htmlcode = scrape_page(”http://www.google.com/”, 0, “”, 1);
Here’s the code… all you need to do is change the cookie path to a text file (with the correct permissions on linux), and set the proxy to your proxy address.
<?php
function scrape_page($page, $post=0, $fields=null, $proxy=0)
{
// cookie path
$file_cookie = “/path/to/cookie/file/cookies”;
$ch = curl_init($page);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
if($proxy==1)
curl_setopt($ch, CURLOPT_PROXY, “127.0.0.1:8118″);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)”);
if($post==1)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
}
$response = curl_exec($ch);
curl_close($ch);
//echo curl_error($ch);
return $response;
}
?>
The only thing it is missing is some good old error checking.



March 11th, 2008 at 7:08 am
Thanks for the code.
March 11th, 2008 at 9:08 am
im going to put this function into use. i hate curl, but this is a lot better than file_get_contents
March 11th, 2008 at 2:22 pm
[…] scrape google, but just to make it easier here’s how I do it. Make sure that the scraper code from here is in the same php file or included. Feel free to use this code for any tool you […]
March 26th, 2008 at 1:57 pm
[…] a free email service that supports webmail, and a page scraping utility. Hmmmm… Guess what, my page scraping code will work excellently with webmail services. What’s really handy is as long as you point the […]
April 1st, 2008 at 5:42 pm
I will test this code next week. It looks ok, besides the lack of error checking!
April 11th, 2008 at 1:36 pm
Thanks for the code!
I’ll try it this weekend
Rgds,
Trond
July 29th, 2008 at 5:13 pm
Just wanted to let you know that the link to your guest post is down
July 30th, 2008 at 5:11 am
It wasn’t my guest post. Somebody else’s. I dunno what happened to it :S
January 19th, 2009 at 4:59 pm
Great example and tutorial. I’ll try this
March 23rd, 2009 at 4:28 am
How to use this code?
June 10th, 2009 at 4:33 am
Great example and tutorial. I’ll try this.