Archive for the ‘mini script’ Category

Instant GOCR Training

A while back I said you *may* be able to train GOCR to recognise PHPBB2 captchas instantly thanks to its excellent database layout. Now for the moment of truth. Several hours later after travelling through much shrubbery with only my trusty whip and bent fedora for company (I think I may be insane but I don’t have the paper to prove it or the jacket)…

It works. The only downside is if you fill the database with too many characters it is very likely to slow GOCR down immensely. So go easy and possibly try and remove too many duplicates of the same letter.

So here’s how it works, inside the custom database directory is a file called db.lst. This file is literally just a list of pictures with their correct answer as seen below (note this is my custom database, normally it names the files sensible names :D ):

30402199be694d0330735cb3de4df778.pbm G
852f04abf55c904fdb977dc297c630ec.pbm Z
1cbc984624ca1673132afead5d6f518a.pbm G
297a35232ba803cd6675a38a29453828.pbm D

The first entry is the filename, and it can literally be any pbm/png file. The second entry is the correct letter. That simple. All we have to do is rip the letters out and put them in the same directory. Unfortunately I haven’t got the script cleaned in a nice easy to use format to just download, but I’ll post what I used to build my custom database very quickly. I use the retrieve.php include which is somewhere on this site. I should be more organised. I think it’s here.

Now this code is written to run on Windows/Linux so it uses png files because we can’t export pbm files from GD in php. It was either that or have the script not work in Windows at all. All you Linux folks can easily convert them to pbm files and do it the way it’s supposed to be done. (The script runs from the command line only… like this… “php script.php answer.txt captcha.png”) (Also I just thought… Make sure you have the directory ‘data’ in the same directory as you run the script. Don’t run the script from the ‘data’ directory but the directory just above it)

<?php

require_once(”retrieve.php”);

// extract the letters out
$letters = get_letter_array($argv[$argc-1]);

// get the answer to the captcha
$fp = fopen($argv[$argc-2], “r”) or die(”Need a solved answer in ” . $argv[$argc-2]);
$str_answer = fgets($fp);
fclose($fp);
$answer = str_split($str_answer);

// give them unique names and save them in .png format
$unique_name = array();
for($index=0; $index<count($letters); $index++)
{
$unique_name[] = md5(uniqid());
imagepng($letters[$index], “data/” . $unique_name[$index] . “.png”);
}

// link them from the db.lst file
$fp = fopen(”data/db.lst”, “a”);
for($index=0; $index<count($letters); $index++)
{
fwrite($fp, $unique_name[$index] . “.png ” . $answer[$index] . “\n”);
}
fclose($fp);

?>

And now for some link love to the spamhuntress.

I actually have a plan in mind for my next post, which is damn unusual. I’ll let you know how it goes in several days time :D . Oh yeah and it’ll be in Java so it’ll run nicely on your Windows install too.

Wednesday, April 16th, 2008

Forum registration

I get a message in my comments:

“BTW? have you got the rest of the scripts you need to use your captcha breaking code? i.e. the forum spam stuff?”

Don’t think I’m not listening :D . So here we go, a script that will register at a phpBB2 forum. It works automatically for Linux if you run it from the command line. I know half of you probably use Windows but it’s such a pain trying to port code and the necessary code is in my guest post on BlueHatSeo.com.

The workings behind the functions are stored in regfunctions.php, and you use the script by either running “php regphpbb2.php” or navigating to it in your browser if you’re on Windows.

Anyway at the top of the code is our list of variables that we can change for registering at different forums.

<?php

require_once(”regfunctions.php”);

// set our sign up variables like username and so on
$sign_user = “user”;
$sign_email = “test@test.localhost”;
$sign_pass = “aaa”;
$sign_sig = “My spammy signature”;
$site_name = “http://localhost/phpBB2/”;

Now we download the captcha and if we’re running inside a browser we show the captcha to the user, otherwise we run our C program to crack it.

// make sure we haven’t already sent an answer to our captcha
if(!isset($_GET[’captchacode’]))
{
// begin to register an account this will save the captcha to downloadedcaptcha/captcha.png
// it will return a necessary session/confirm id we’ll need later
$ids = get_register_captcha($site_name);
$sid = $ids[0];
$cid = $ids[1];

// crack the captcha or get a human to solve it
if(!isset($_SERVER[’_']))
{
// if we are running in a web page show the captcha to the user
echo “<h2>PHPBB2 Captcha</h2> You can crack this automatically by running this script from the command line in Linux with ImageMagick libraries installed.<br />”;

echo “<img src=’downloadedcaptcha/captcha.png’ /><br />”;
echo “<FORM action=’” . $_SERVER[’PHP_SELF’] . “‘ method=’GET’>”;
echo “Type in the code <input type=’text’ size=’15′ name=’captchacode’ /><br />”;
echo “<input type=’hidden’ name=’sid’ value=’” . $sid . “‘ />”;
echo “<input type=’hidden’ name=’cid’ value=’” . $cid . “‘ />”;
echo “<input type=’submit’ value=’submit answer’ />”;
echo “</FORM>”;

exit(1);
}
else
{
// if we are running from the command line solve it in code
echo “Solving captcha…\n”;
$solved_captcha = str_replace(” “, “”, exec(”./cleanpic downloadedcaptcha/captcha.png”));
$solved_captcha = str_replace(”\n”, “”, $solved_captcha);
}
}

// if we have a solved captcha put it in the correct variable
if(isset($_GET[’captchacode’]))
{
$solved_captcha = $_GET[’captchacode’];
$sid = $_GET[’sid’];
$cid = $_GET[’cid’];
}

The important bit here is the $solved_captcha = exec(”./cleanpic… ) part. exec allows us to run a program and return the value, in this case our cracked captcha. You need to replace this program to it’s windows version if you are running windows. The str_replace around the call to exec is just to clean the string up in case it sends back a string with spaces or carriage returns. Now we just send some post variables to the server with all the necessary data

// finish the sign up
$success = sign_up($sid, $cid, $solved_captcha, $site_name, $sign_user, $sign_email, $sign_pass, $sign_sig);

if($success)
echo “account created\n”;
else
echo “account failed to be created\n”;

// now verify the email, note: this is a stub, no code in it
// gotta write it yourself :D
verify_email();

?>

I haven’t written in the email verification code but you don’t always need it for phpBB2. It’s dependent on the mail server you use anyway.

How do you work these scripts out? I have a trick :D . LiveHTTP Headers for Firefox. Take a look below. I register first manually and it prints out everything I need to send to the server to register automatically next time.

LiveHTTP headers

The highlighted part (click to zoom in) is all the post variables that allow us to register. Just exchange them for our own variables. From here it’s pretty simple to add on the pieces that post messages on the forum.

Forum Registration Code

Friday, April 11th, 2008

Email Verification

So you’ve cracked that phpBB2 or phpBB3 captcha registered an account, and now it wants you to verify your account by email. Foiled again.

Actually this is pretty easy to get around. All you need is a free email service that supports webmail, and a page scraping utility. Hmmmm… Guess what, my page scraping code will work excellently with webmail services. What’s really handy is as long as you point the cookies string to a proper empty file it will keep the session details allowing you to log on as if you were using a normal web browser. So then you would just use preg_match to find important parts of the page (like login buttons, inbox, and so on), follow these links, until you find the link that says “Confirm your email address” or similar.

Or you could use temporary email…

$output = scrape_page(”http://www.mytrashmail.com/myTrashMail_inbox.aspx?email=” . $temp_email_name);

That’ll dump the html of your temporary inbox. You can even delete the email promptly to save them space.

If you’re really good you can download a POP3 PHP class and log into GoogleMail directly ;)

Wednesday, March 26th, 2008

PHP Preg_match without the BS

Your guide to becoming a preg_match/regex genius. Or just coding like me :D . Seriously though when all you want to do is prototype a script and you need to match a string regular expressions seem complicated. To write a proper regular expression there are a huge number of symbols you can use which make the expression more efficient all around.

The only symbols I ever really remember are:

(.*) - Matches anything in the same way as searching for a file by the name *.*
(.*?) - Same as above except it defaults to the fewest number of characters possible.

So if we had the phrase:

I am a pretty crazy guy. I am a pretty crazy guy.

and our regular expression was (note: all regular expression start and end with / unless you know what you’re doing and you have another plan):

/I am a(.*)guy/

our output would be:

 pretty crazy guy. I am a pretty crazy

now if we did this:

/I am a(.*?)guy/

our output is:

 pretty crazy

Welcome to the lazy way of programming ;)

Just to further clarify the full php would be:

preg_match(”/I am a(.*)guy/”, “I am a pretty crazy guy. I am a pretty crazy guy.”, $matches);
echo $matches[1];

Saturday, March 15th, 2008

Copy Paste Google Scraper

Bored trying to use DOM to parse your data? That library is immense for simple tasks. Well anyway it’s pretty simple to write a program to scrape google, but just to make it easier here’s how I do it. Make sure that the scraper code from here is in the same php file or included. Feel free to use this code for any tool you want.

function scrape_google($url)
{
// get a page of results
$page = scrape_page($url);

// get a list of organic SE links
preg_match(”/<h2 class=r>(.*)<\/h2>/”, $page, $matches);
$link_list = $matches[1];

// get a list of URLS
$link_list = str_replace(”</a>”, “</a>\n”, $link_list);
preg_match_all(”/<a href=\”(.*)?\” class(.*)<\/a>/”, $link_list, $matches);
$link_list = $matches[1];

return $link_list;

// DEBUG: All this below is debugging stuff I’ve left in

// create a string to print to screen
//$str_link_list = implode(”\n”, $link_list);
//echo “<pre>” . $str_link_list . “</pre>”;

// save all links to a file
//$fp = fopen(”out”, “a”);
//fwrite($fp, $str_link_list . “\n”);
//fclose($fp);

// get the next page
//preg_match(”/<td nowrap class=b><a href=\”(.*)\”><div id=nn><\/div>Next<\/a>/”, $page, $matches);
//echo “<a href=’?googleurl=” . urlencode(”http://www.google.com” . $matches[1]) . “‘>Next Page - ” . $matches[1] . “</a>”;
}

So I grabbed this URL from the address bar and stuffed it into this function:

Tuesday, March 11th, 2008

Simple Copy Paste Scraper Function

If you looked over at the guest post code on BlueHatSeo.com for scraping it’s a little incomplete (in my opinion :P ). It’s missing cookies, and has a couple of flexibility issues. The code below will let you use POST variables simply, as well as allowing you to store session data in cookies etc.

It’s really simple to use. To get a page without proxy and no post variables:

$htmlcode = scrape_page(”http://www.google.com/”);

with post variables:

$htmlcode = scrape_page(”http://www.google.com/”, 1, “var1=1&var2=2&var3=3″);

with proxy (defaults to 127.0.0.1:8118 - TOR):

$htmlcode = scrape_page(”http://www.google.com/”, 0, “”, 1);

Here’s the code… all you need to do is change the cookie path to a text file (with the correct permissions on linux), and set the proxy to your proxy address.

<?php

function scrape_page($page, $post=0, $fields=null, $proxy=0)
{
// cookie path
$file_cookie = “/path/to/cookie/file/cookies”;

$ch = curl_init($page);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

if($proxy==1)
curl_setopt($ch, CURLOPT_PROXY, “127.0.0.1:8118″);

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_USERAGENT,
“Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)”);

if($post==1)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
}

$response = curl_exec($ch);
curl_close($ch);

//echo curl_error($ch);

return $response;
}

?>

The only thing it is missing is some good old error checking.

Tuesday, March 11th, 2008