Copy Paste Google Scraper
Bored trying to use DOM to parse your data? That library is immense for simple tasks. Well anyway it’s pretty simple to write a program to scrape google, but just to make it easier here’s how I do it. Make sure that the scraper code from here is in the same php file or included. Feel free to use this code for any tool you want.
function scrape_google($url)
{
// get a page of results
$page = scrape_page($url);
// get a list of organic SE links
preg_match(”/<h2 class=r>(.*)<\/h2>/”, $page, $matches);
$link_list = $matches[1];
// get a list of URLS
$link_list = str_replace(”</a>”, “</a>\n”, $link_list);
preg_match_all(”/<a href=\”(.*)?\” class(.*)<\/a>/”, $link_list, $matches);
$link_list = $matches[1];
return $link_list;
// DEBUG: All this below is debugging stuff I’ve left in
// create a string to print to screen
//$str_link_list = implode(”\n”, $link_list);
//echo “<pre>” . $str_link_list . “</pre>”;
// save all links to a file
//$fp = fopen(”out”, “a”);
//fwrite($fp, $str_link_list . “\n”);
//fclose($fp);
// get the next page
//preg_match(”/<td nowrap class=b><a href=\”(.*)\”><div id=nn><\/div>Next<\/a>/”, $page, $matches);
//echo “<a href=’?googleurl=” . urlencode(”http://www.google.com” . $matches[1]) . “‘>Next Page - ” . $matches[1] . “</a>”;
}
So I grabbed this URL from the address bar and stuffed it into this function:



March 11th, 2008 at 3:22 pm
good stuff…welcome to my feed reader
March 11th, 2008 at 8:53 pm
Thanks for the useful code. Keep up the good work, I’m subscribing
March 12th, 2008 at 8:32 am
I’ve subscribed too, very useful stuff!
March 12th, 2008 at 11:03 am
Some moron left in a bug (No guesses who). I ripped this code out of a script I wrote and converted it to a function, tested it but didn’t realise I had left in a piece of code which referenced a variable which doesn’t exist anymore.
Fixed. Sorry about that.
March 12th, 2008 at 4:19 pm
I keep getting an empty array… the page scraper function works… do you think I may need to change the regex?
March 12th, 2008 at 6:19 pm
I just tested it and it works. However obviously if you downloaded the code before I removed the $url=$_GET[’googleurl’]; line at the top then that’s definitely screwing it up. That’s me being stupid.
The only other thing I noticed and I’m not sure if this is just on Linux. I copied and pasted my code back off this website into a test program and it told me it contained unicode characters, which were the quotes. The quotes were changed to ? signs. I think that might just be Linux though. I’ll have to look at why that’s happening. If you’ve got the page scraper function working that’s probably not the issue.
Failing that, I only tested the program on google.com and google.co.uk.
Is any of this helping?
March 12th, 2008 at 8:32 pm
Great Stuff! Must have taken a while :O
March 13th, 2008 at 6:22 pm
Took ages
… j/k. Actually it didn’t take too long thanks to php’s amazing string matching functions.
March 14th, 2008 at 3:30 am
That is really a nice thing. I have used it also. I really liked it.
March 15th, 2008 at 8:29 am
Is it works really.It is looking very simple.Thanks for this.
March 19th, 2008 at 2:17 am
If this thing works then this is cool.
March 19th, 2008 at 2:18 am
Looking very cool.I am gonna try this.
March 26th, 2008 at 9:45 am
That is really a nice thing.I will try it.
March 29th, 2008 at 12:47 am
for elusid : have you tried it using a valid proxy ?
March 30th, 2008 at 7:36 am
Thanks for the useful code. Keep up the good work.
March 30th, 2008 at 7:37 am
That is really a nice thing. I have used it also.
March 30th, 2008 at 7:38 am
it works really very good.It is looking very simple.
April 3rd, 2008 at 5:11 am
Yeah!! (Wrings hands)! Nice blog you have here. I’ve enjoyed much reading your last posts. Keep it that way.
April 5th, 2008 at 8:39 am
lol.. that is a nice post. i really enjoyed to read it.
April 10th, 2008 at 4:07 am
Hmm… those codes really works nice. Thanks for the codes.
May 24th, 2008 at 7:28 pm
For screen scraping, I enjoy using Perl with HTML::TreeBuilder there’s a good tutorial here:
http://search.cpan.org/~petek/HTML-Tree-3.23/lib/HTML/Tree/Scanning.pod
May 25th, 2008 at 5:31 am
For a minute there I thought this was shameless self promotion from Will
. If it is a decent tutorial I’d have left it there anyway.
May 25th, 2008 at 8:00 am
Hi everybody..
I’m pretty n00b about this scraping stuff. Is it possible to run this script on windows platform? And how? Sorry for this stupid question, but I really don’t know how.
June 4th, 2008 at 5:04 pm
Very nice site!
cheap viagra
July 11th, 2008 at 8:13 am
Try my version … looks a bit better