Archive for the ‘captcha’ Category

PHPBB3 Captcha difficulty

Is phpBB3 more secure than phpBB2? Here is a default phpBB3 sample.

PHPBB3 captcha

This is a lot stronger than a phpBB2 captcha. We can’t separate a letter based purely on its colour anymore. Notice how there is a line running underneath the B that is the same colour as the B. The background colour is annoying as anything but only from a person’s point of view. Our PC doesn’t really mind.

One of its issues/weaknesses is that there are no lines that cut across the squares, they all go underneath them. That means there is no breaks in the squares we have to detect. The only other weakness I can see is that the lines go directly across without intersecting at any point. That means that there are no objects that look like the squares of the letters that are just noise.

So here’s my algorithm which I think would solve it. Admittedly I haven’t tested this but I don’t see why it wouldn’t work. All the letters are made up of squares. We need to test, if starting at one pixel we can get back to the start by following the same colour pixels. That obviously would make a square :P , or something close like a distorted rectangle. It’s almost like dot to dot puzzles. If we can get back to the start keep the line and colour it in using previous post’s fill function (or php GD’s one :D ). If we can’t get back to the start or the line keeps travelling too far then we remove it and find another coloured pixel, that doesn’t match the background colour.

The main issues we would have to overcome are lines which are thicker than 1 pixel and small blocks of colour found at the side of some of the letters. The other issue would be making sure we don’t recheck the part we just shaded in (Maybe use a unique colour for it?).

Friday, March 21st, 2008

A custom floodfill routine

Last post I said to separate characters you simply need to flood fill them and calculate the extreme points to find out where to fit your rectangle around. Hahaha. I’ve spent the last few days trying to port optimized floodfill functions to php. Normally I’d just take the sane easy option and use pre-written code like GOCR but apparently I like pain.

The problem is the optimized bit. I can write a simple recursive floodfill function that calls itself until it’s done but I have no idea how to write something that will be reasonably fast. The more complicated captchas will require a fair bit of speed because you will be thinking about cracking a fair few of them in a period of time. This site is where I eventually found a simple routine that worked. My issue was I used another routine, ported it, and then it broke. Miserably. After filling only two lines.

Here is my code. First it loads in an image of a captcha. It then scans two lines along the horizontal axis, one at 1/4 of the way down the picture and one at 3/4 of the way down. When it hits a character it floodfills it. The floodfill function returns the extreme positions of pixels which gives us a rectangle around that letter.

<?php

function floodFillScanlineStack($image, $x, $y)
{
// the colour we are shading in - black letters
$oldColour = 0;
// the colour the want to shade the letters in - red just because we can
$fillColour = imagecolorallocate($image, 255, 0, 0);

// we need the image width & height
$w = imagesx($image);
$h = imagesy($image);

// set the rectangle co-ords
$rectangle = array(”x1″ => $x, “x2″ => $x, “y1″ => $y, “y2″ => $y);

if($oldColour == $fillColour) return;

$stack = array();
$stack[] = array(”x” => $x, “y” => $y);

while(count($stack)>0)
{
$pos = array_pop($stack);
$x = $pos[’x'];
$y = $pos[’y'];

$y1 = $y;
while($y1 >= 0 && imagecolorat($image, $x, $y1) == $oldColour) $y1–;
$y1++;
$spanLeft = 0;
$spanRight = 0;
while($y1 < $h && imagecolorat($image, $x, $y1) == $oldColour )
{

// here we set the pixel colour
// use these to find our rectangle around the letter
imagesetpixel($image, $x, $y1, $fillColour);
if($x<$rectangle[’x1′])
$rectangle[’x1′] = $x;
if($y1<$rectangle[’y1′])
$rectangle[’y1′] = $y1;
if($x>$rectangle[’x2′])
$rectangle[’x2′] = $x;
if($y1>$rectangle[’y2′])
$rectangle[’y2′] = $y1;

if($spanLeft==0 && $x > 0 && imagecolorat($image, $x - 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x - 1, “y” => $y1);
$spanLeft = 1;
}
else if($spanLeft==1 && $x > 0 && imagecolorat($image, $x - 1, $y1) != $oldColour)
{
$spanLeft = 0;
}
if($spanRight==0 && $x < $w && imagecolorat($image, $x + 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x + 1, “y” => $y1);
$spanRight = 1;
}
else if($spanRight==1 && $x < $w && imagecolorat($image, $x + 1, $y1) != $oldColour)
{
$spanRight = 0;
}
$y1++;
}
}

return $rectangle;
}

function floodfill_char($image, $x)
{
if((imagecolorat($image, $x, 12)==0))
return floodFillScanlineStack($image, $x, 12);

if((imagecolorat($image, $x, 38)==0))
return floodFillScanlineStack($image, $x, 38);
}

function split_chars_along_vertical($image)
{
$w = imagesx($image);
$h = imagesy($image);

/* $rgb = imagecolorat($img, $x, $y);
$r += $rgb >> 16;
$g += $rgb >> 8 & 255;
$b += $rgb & 255; */

// scan along each verical line looking for black pixels
// we’ll only scan two lines of pixels to save time. Both along the center
// split slightly apart
$letters = array();
for($index=0; $index<$w; $index++)
{
// check two lines of pixels one at 12 down, one at 38 down
// the picture is 50 pixels tall by the way
if((imagecolorat($image, $index, 12)==0) || (imagecolorat($image, $index, 38)==0))
{
// fill the character and return a rectangle around the image
$rectangle = floodfill_char($image, $index);

// pull this letter out into a new image
$singleLetter = imagecreatetruecolor($rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
imagecopy($singleLetter, $image, 0, 0, $rectangle[’x1′], $rectangle[’y1′],
$rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
$letters[] = $singleLetter;

// find the next character
$index = $rectangle[’x2′]+1;
}
}

return $letters;
}

$image = @imagecreatefrompng(’82.clean.png’);
if ($image == false) { die (’Unable to open image’); }

$letters = split_chars_along_vertical($image);

// dump the first letter to the screen
header(”content-type:image/png”);
imagepng($letters[0]);

?>

To run through the code and show how it works I made this neat little gif. To be honest it’s probably just a waste of my bandwidth but it looks pretty cool.

Floodfill gif

Now I can finally sort this damn neural network out :P . You just know something is going to break. Of course that’s the fun part, right? Anyone?

Scripts to separate characters

Friday, March 21st, 2008

Separating characters manually

Eli Bluehatseo.com is spending a week breaking captchas after they ddos’d his server. I’m interested.

I’m going to explain how I would separate the characters in an image after removing the noise and lines and stuff. It’s funny this post is going to be pretty damn short :D but I need time to work out the post with neural networks. This will just be the theory of separating characters and I’ll post code next time.

So first things first we need to scan through the vertical lines until we hit a pixel. From that pixel we flood fill. Yep. That’s it. As long as it’s a custom flood fill routine it will give us a start and end point from which we can fit a polygon or rectangle around the letter and extract it out. Now just de-rotate it and throw it at the neural network I don’t have yet.

Tuesday, March 18th, 2008

Replacing GOCR part 1

This is related to my guest post on Eli’s BlueHatSeo on cracking phpBB2 captcha. Many thanks to him for posting it.

Ok so maybe you’ve had issues using GOCR and it won’t run on your server or <insert problem here>. Well anyway let’s remove GOCR from the program altogether. I’m going to insert a neural network in its place. Now the advantages of a neural network are they are relatively simple to understand and code especially with the help of the libraries available for many different languages and depending on the algorithm they can be faster. However I was reading a whitepaper the other day comparing a complex OCR recognition algorithm against a neural network, and the OCR algorithm performed quite a bit better. So it should be interesting.

Basic Background

I guess I should put in these basics just in case you’re not familiar with neural networks. A neural network is comprised of neurons connected to each other in different sequences. A neuron has dendrites which accept signals from other neurons and an axon which sends a signal depending on the signals received from these other neurons. The neuron will send a signal when the inputs from the other neurons add up to reach a threshold. Now what makes the neuron really powerful is that each dendrite can be weighted so that the signal received can be more or less likely to make the neuron send a signal down its axon. Some dendrites could also inhibit the sending of a signal down the axon of the neuron depending the library we use.

We can train the network by adjusting the weighting of each of these dendrites until the outputs at the end of the network gives us the correct answer. I plan to use the simplest form of network called a feed forward network. Good old wikipedia explains. The input neurons will be the pixels of an individual character.

My Plan

For this post literally all I’m going to do is compile some training data in preparation for our neural network. Using GOCR I did it all by hand and it wasn’t too slow. But I don’t want to sit here retraining the network by hand every time I make a change to the number of neurons or something in it so I’m going to get around 10,000 phpBB2 captchas with their solved text. Assuming 5 characters in each captcha that is around 50,000 characters to train the network with. I’m figuring that’ll be enough (Hopefully it will be overkill). It’ll be interesting to see how many it prefers because obviously on a non-open source captcha like google you can’t just grab captchas this easily.

Obviously I highly recommend you run the program that grabs these captchas on your home PC and not on your shared hosting because I’m not sure how much CPU it’s going to “borrow”, and I hear shared hosts really don’t appreciate programs that hog CPU time (hmmm, phproxy?). It brought my PC to its knees (Although I did try to open a few programs at the same time as running it :D ).

Getting the Captcha Producing Code

I obviously need a script to extract these captchas so first things first let’s find out how phpBB2 works. Using a phpBB2 you’ve installed on your home WAMP server you open it up in Firefox and browse to register a new user. Looking at the title bar you see that we are in the file profile.php with this string after it, “?mode=register&agreed=true”. So we know that two variables are set, “mode” and “agreed”. We’re looking for something in profile.php that checks if mode is equal to register as set in the address bar of Firefox.

We see the lines:

$mode = ( isset($HTTP_GET_VARS[’mode’]) ) ? $HTTP_GET_VARS[’mode’] : $HTTP_POST_VARS[’mode’];
$mode = htmlspecialchars($mode);

So that just put the word “register” into the variable $mode. Scroll down:

else if ( $mode == ‘editprofile’ || $mode == ‘register’ )
{
if ( !$userdata[’session_logged_in’] && $mode == ‘editprofile’ )
{
redirect(append_sid(”login.$phpEx?redirect=profile.$phpEx&mode=editprofile”, true));
}

include($phpbb_root_path . ‘includes/usercp_register.’.$phpEx);
exit;
}

This checks if mode is equal to “editprofile” or “register” and if it is “register” includes the file “includes/usercp_register.php”. This literally means run all the code in usercp_register.php. So open this file up now.

There’s a few ways to find what we’re looking for but scanning over the code I noticed they called the captcha a confirmation code a lot of the time, so I just opened up a search box and typed confirm into it and searched through the file until I hit something interesting:

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid($user_ip));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘”. $userdata[’session_id’] . “‘, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

unset($code);

$confirm_image = ‘<img src=”‘ . append_sid(”profile.$phpEx?mode=confirm&id=$confirm_id”) . ‘” alt=”" title=”" />’;

Well, this code makes our captcha. It inserts the sequence of letters and numbers it wants into $code then produces a unique code generated from user’s ip address in $confirm_id. After all that it inserts it all into the database and retrieves an image using the last line.

Of course we don’t need to know how it works we just need to copy this code into a new file which downloads and saves the solve for the captcha. Put this file in the same directory as your phpBB2 install, create a new directory called output/, and run it from the command line. I run it from the command line because it’s going to take a while to get all the captchas, and I can cancel it from the command line if it goes wrong, as well as the issue that it will timeout if you run it from your browser. Just use Start->Run, type cmd. Then type:

C:
cd\directoryofphpbb2install
php getcaptcha.php

I used two new scripts. The first one is basically the code seen above modified to quickly download as many captchas as possible, save it as getcaptcha.php in the same directory as your phpBB2 install:

<?php
////////////// START UP CODE ///////////////////////////////////////////
define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);

$userdata = session_pagestart($user_ip, PAGE_PROFILE);
init_userprefs($userdata);

// we need to get the current directory so we can get the download address for the captcha
echo “Please enter address of phpbb2 install.\nFor instance http://localhost/phpBB2: “;
$phpbb2dir = fgets(STDIN);
// make sure there is not return on the end of the string
$phpbb2dir = str_replace(”\r”, “”, $phpbb2dir);
$phpbb2dir = str_replace(”\n”, “”, $phpbb2dir);

echo “Started\n”;

// let’s loop over and get all of our captchas
for($index=0; $index<10000; $index++)
{

////////////// OUR BORROWED CODE ///////////////////////////////////////////

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid(”127.0.0.1″));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘fakesessionid’, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

$confirm_image = $phpbb2dir . append_sid(”/getcaptcha.confirm.php?id=$confirm_id”);

////////////// DOWNLOAD CAPTCHA CODE ///////////////////////////////////////////
// write the captcha image to a file
$captcha = file_get_contents($confirm_image) or die(”Error downloading captcha”);
$fp = fopen(”output/$index.png”, “w”) or die(”Can’t create output file”);
fwrite($fp, $captcha) or die(”Error writing to file”);
fclose($fp);
$fp = fopen(”output/$index.txt”, “w”) or die(”Can’t create output file”);
fwrite($fp, $code) or die(”Error writing to file”);
fclose($fp);

echo “Written $index\n”;
}

echo “Done.\n”
?>

The second script is necessary because when we download the captcha with file_get_contents(…) we don’t have the cookies that were set by phpBB2 in Firefox. file_get_contents(…) can’t pass cookies at all and we need them because phpBB2 tracks which captcha it has asked you by cookies. The file includes/usercp_confirm.php from phpBB2 is what produces the captcha. We copy it to the same place as getcaptcha.php, naming it getcaptcha.confirm.php and change these lines:

if ( !defined(’IN_PHPBB’) )
{
die(’Hacking attempt’);
exit;
}

to this:

define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);
$userdata[’session_id’] = ‘fakesessionid’;

This tricks the captcha code into producing a captcha for a session named ‘fakesessionid’.

Well, that’s it. I have my set of training data ready. Hopefully you got all that to work without too many problems. Below is my compilation of captchas, it’s in bzip2 format because I needed to compress it heavily and I can never get rar to work on Linux. I also removed the backgrounds etc from them as outlined in my guest post on BlueHatSeo. And I also deleted the 10,000 records for captchas you now have in your mySQL database :P .

10,000 phpBB2 captchas without noise

Scripts to download captchas

I was looking at my GOCR database as well and it looks like using this massive set of captchas we should be able to instantly train GOCR to recognise phpBB2 without having to correct the errors that it makes. Over the weekend I’m going to look at both of these things.

Thursday, March 13th, 2008