Archive for March, 2008

GOCR to Neural Nets Pt 2

As per usual these posts don’t go as smoothly as I would like. The idea was to use FANN for PHP to make a simple neural network that would work easily. Hahahaha. You might think I make this stuff up as I go along. Oh wait. I do.

Anyway FANN requires PEAR to be installed and I figured that it’d be much simpler than installing PEAR modules to find something that was completely PHP to do the job. I did that. However.

1. It’s slow.

Ok so it’s slow. We can live with that right? I mean we’ve got some time.

2. It’s slow.

Ok it really is slow, and I’m getting impatient.

3. It’s painfully slow.

PHP sucks for some things. I do like C’s direct memory access. On the plus side this little neural net class is so simple and easy to understand.

Anyway I went ahead and I fitted all my modules together around this PHP neural net and I walked away whilst my computer attempted to learn the alphabet at a ridiculously slow pace. 900 captchas later I give it up.

The neural net does sort of work now for quite a few characters. Although I did notice a character failed to segment properly which set the learning back a bit. A lot of the characters it fails on are things like producing P’s for R’s or vice versa. So you can understand where the problem lies.

I won’t post code because there’s a fair bit spread over a few modules. nnbreak.php is the main module and must be run from the command line. Like this:

php nnbreak.php captcha.jpg answer.txt train=1/0

1 means train the network using the answer stored in answer.txt. 0 means guess what the captcha is in captcha.jpg, which will ignore answer.txt but answer.txt must still be included (although it can be blank or not exist etc).

So analysis… Here’s how it works roughly:

Allocate memory for the neural network, and load in previous neuron weight values

Extract the letters as said in that post about extracting letters.

Convert the letters into a 10×10 matrix of averaged values, maximum being 1.0

Loop through each letter and send the matrix of values to the input neurons

Check the output neurons

Possibly teach the network the right answer using back propogation.

We can tweak a lot of things such as the number of neurons in each layer and the size of the matrix. I did mess with the default backpropogation teaching speed because it didn’t seem to be learning fast enough to me :D . I’m guessing that has some drawbacks to it but I’m not sure exactly what they are. In this line the 0.5 is the learning speed which has been moved up from the default 0.1:

$nn = new nn(3, $layer_structure,1,0.5,0.9);

Now if anyone happens to train the network to recognise all of the letters properly send me the file so I can claim it as my own and pretend I did it all perfectly ;) . J/k. Seriously though I do think it should eventually learn all the letters properly it’s just taking a long time.

In conclusion. I don’t trust neural networks they do too much stuff that I don’t know about. I’m betting that robot from the Terminator was probably built from neural nets.

The code to guess a captcha using neural nets - Already populated with some weights so it sort of works

Sunday, March 30th, 2008

I install plugins

I haven’t installed any plugins since I installed this blog. I was lazy and rushed and didn’t really think it through. I meant to install feedburner and forgot. I just got this well crafted comment which made me think:

“Hahaha, you have “nofollow”

No wonder you have no readers!!!!!!!!!

HAHAHA”

He’s probably right. So notice the dofollow plugin, top commentators, and a contact page that actually works.

Update: I just got spammed…Plugins go back off, however I’ll leave the contact page working :D

Friday, March 28th, 2008

10 Steps to Solving a Captcha

Lately I’ve been stuck on one of my projects, hitting my head against the wall. I’m going to keep doing it until I solve the problem, however it’s got in the way of my posts. So I’ve taken some time out to put together two posts.

10 Steps to Solving a Captcha

1. Start up The GIMP, Photoshop, or <insert favourite paint program here>

2. Mess around with all the plugins and filters you have available and see how much noise you can get rid of just with these. If a filter does a good job but removes too much information remember you can always use another filter on the same image and paste these two images together at the end.

3. If the letters are getting sketchy with all these filters, and too thin you can try combining them with the original image. You can do this by floodfilling from the remaining pixels on the altered image onto the original image. Now copy these floodfilled parts of the original image back into your altered image. This won’t work if straight lines cut through the letters though.

4. If “artefacts” or noise are still on the image then you need to list down its common properties. Does a line have to start from an edge? Is it always straight? Is it within a set angle? Are dots spaced a certain distance apart or entirely random? And so on. We can then take all this into consideration when we write our custom noise removal algorithm.

5. Get a programming language you can throw ideas around in quickly. Basically this is just a draft. Save a rough image off from photoshop and maybe even use Visual Basic or something to test ideas out. Even if you don’t know how to apply all those filters you used in Photoshop in code who cares. If you know it works you can move mountains compared to writing a ton of code that you don’t even know will work.

6. Write that custom noise removal algorithm I talked about earlier. ;) Don’t delete code!!! Ever. Even if it doesn’t work just archive it somewhere because sometime you just know you’ll want it again.

7. Once you have the noise removed you may need to break overlapping letters up. All letters are made up of certain types of strokes (and loops in handwriting). As I understand it this is actually the basis behind natural handwriting recognition. For instance in a captcha if a letter ends with a curve then often that will be the end of the letter. Depending on the length of the curve and as long as the letter isn’t rotated at a strange angle. You’ll probably need to consider these type of things if the letters overlap.

*** This is possibly one of the hardest steps

8. Once you have single letters you have to de-rotate them. Read SlightlyShadySeo. Again think about common properties. Don’t accidentally rotate a letter upside down if there is no way a letter would be that way up in the first place. Although that was just an example and probably wouldn’t happen.

9. Write an automated script that downloads a ton of captchas and only requires your input to train it with GOCR or phpOCR. It’s way more fun watching your computer do something than doing it yourself.

10. Optimize the algorithm. If your algorithm is slow but works you can probably save time by removing parts of your code that are running more times than they need to etc.

Abuse each of these 10 steps and change them to suit your personality. Receive flashes of inspiration. They’re fun. Don’t drink high caffeine drinks. They make me talk about absolute rubbish non-stop…

Wednesday, March 26th, 2008

Email Verification

So you’ve cracked that phpBB2 or phpBB3 captcha registered an account, and now it wants you to verify your account by email. Foiled again.

Actually this is pretty easy to get around. All you need is a free email service that supports webmail, and a page scraping utility. Hmmmm… Guess what, my page scraping code will work excellently with webmail services. What’s really handy is as long as you point the cookies string to a proper empty file it will keep the session details allowing you to log on as if you were using a normal web browser. So then you would just use preg_match to find important parts of the page (like login buttons, inbox, and so on), follow these links, until you find the link that says “Confirm your email address” or similar.

Or you could use temporary email…

$output = scrape_page(”http://www.mytrashmail.com/myTrashMail_inbox.aspx?email=” . $temp_email_name);

That’ll dump the html of your temporary inbox. You can even delete the email promptly to save them space.

If you’re really good you can download a POP3 PHP class and log into GoogleMail directly ;)

Wednesday, March 26th, 2008

PHPBB3 Captcha difficulty

Is phpBB3 more secure than phpBB2? Here is a default phpBB3 sample.

PHPBB3 captcha

This is a lot stronger than a phpBB2 captcha. We can’t separate a letter based purely on its colour anymore. Notice how there is a line running underneath the B that is the same colour as the B. The background colour is annoying as anything but only from a person’s point of view. Our PC doesn’t really mind.

One of its issues/weaknesses is that there are no lines that cut across the squares, they all go underneath them. That means there is no breaks in the squares we have to detect. The only other weakness I can see is that the lines go directly across without intersecting at any point. That means that there are no objects that look like the squares of the letters that are just noise.

So here’s my algorithm which I think would solve it. Admittedly I haven’t tested this but I don’t see why it wouldn’t work. All the letters are made up of squares. We need to test, if starting at one pixel we can get back to the start by following the same colour pixels. That obviously would make a square :P , or something close like a distorted rectangle. It’s almost like dot to dot puzzles. If we can get back to the start keep the line and colour it in using previous post’s fill function (or php GD’s one :D ). If we can’t get back to the start or the line keeps travelling too far then we remove it and find another coloured pixel, that doesn’t match the background colour.

The main issues we would have to overcome are lines which are thicker than 1 pixel and small blocks of colour found at the side of some of the letters. The other issue would be making sure we don’t recheck the part we just shaded in (Maybe use a unique colour for it?).

Friday, March 21st, 2008

A custom floodfill routine

Last post I said to separate characters you simply need to flood fill them and calculate the extreme points to find out where to fit your rectangle around. Hahaha. I’ve spent the last few days trying to port optimized floodfill functions to php. Normally I’d just take the sane easy option and use pre-written code like GOCR but apparently I like pain.

The problem is the optimized bit. I can write a simple recursive floodfill function that calls itself until it’s done but I have no idea how to write something that will be reasonably fast. The more complicated captchas will require a fair bit of speed because you will be thinking about cracking a fair few of them in a period of time. This site is where I eventually found a simple routine that worked. My issue was I used another routine, ported it, and then it broke. Miserably. After filling only two lines.

Here is my code. First it loads in an image of a captcha. It then scans two lines along the horizontal axis, one at 1/4 of the way down the picture and one at 3/4 of the way down. When it hits a character it floodfills it. The floodfill function returns the extreme positions of pixels which gives us a rectangle around that letter.

<?php

function floodFillScanlineStack($image, $x, $y)
{
// the colour we are shading in - black letters
$oldColour = 0;
// the colour the want to shade the letters in - red just because we can
$fillColour = imagecolorallocate($image, 255, 0, 0);

// we need the image width & height
$w = imagesx($image);
$h = imagesy($image);

// set the rectangle co-ords
$rectangle = array(”x1″ => $x, “x2″ => $x, “y1″ => $y, “y2″ => $y);

if($oldColour == $fillColour) return;

$stack = array();
$stack[] = array(”x” => $x, “y” => $y);

while(count($stack)>0)
{
$pos = array_pop($stack);
$x = $pos[’x'];
$y = $pos[’y'];

$y1 = $y;
while($y1 >= 0 && imagecolorat($image, $x, $y1) == $oldColour) $y1–;
$y1++;
$spanLeft = 0;
$spanRight = 0;
while($y1 < $h && imagecolorat($image, $x, $y1) == $oldColour )
{

// here we set the pixel colour
// use these to find our rectangle around the letter
imagesetpixel($image, $x, $y1, $fillColour);
if($x<$rectangle[’x1′])
$rectangle[’x1′] = $x;
if($y1<$rectangle[’y1′])
$rectangle[’y1′] = $y1;
if($x>$rectangle[’x2′])
$rectangle[’x2′] = $x;
if($y1>$rectangle[’y2′])
$rectangle[’y2′] = $y1;

if($spanLeft==0 && $x > 0 && imagecolorat($image, $x - 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x - 1, “y” => $y1);
$spanLeft = 1;
}
else if($spanLeft==1 && $x > 0 && imagecolorat($image, $x - 1, $y1) != $oldColour)
{
$spanLeft = 0;
}
if($spanRight==0 && $x < $w && imagecolorat($image, $x + 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x + 1, “y” => $y1);
$spanRight = 1;
}
else if($spanRight==1 && $x < $w && imagecolorat($image, $x + 1, $y1) != $oldColour)
{
$spanRight = 0;
}
$y1++;
}
}

return $rectangle;
}

function floodfill_char($image, $x)
{
if((imagecolorat($image, $x, 12)==0))
return floodFillScanlineStack($image, $x, 12);

if((imagecolorat($image, $x, 38)==0))
return floodFillScanlineStack($image, $x, 38);
}

function split_chars_along_vertical($image)
{
$w = imagesx($image);
$h = imagesy($image);

/* $rgb = imagecolorat($img, $x, $y);
$r += $rgb >> 16;
$g += $rgb >> 8 & 255;
$b += $rgb & 255; */

// scan along each verical line looking for black pixels
// we’ll only scan two lines of pixels to save time. Both along the center
// split slightly apart
$letters = array();
for($index=0; $index<$w; $index++)
{
// check two lines of pixels one at 12 down, one at 38 down
// the picture is 50 pixels tall by the way
if((imagecolorat($image, $index, 12)==0) || (imagecolorat($image, $index, 38)==0))
{
// fill the character and return a rectangle around the image
$rectangle = floodfill_char($image, $index);

// pull this letter out into a new image
$singleLetter = imagecreatetruecolor($rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
imagecopy($singleLetter, $image, 0, 0, $rectangle[’x1′], $rectangle[’y1′],
$rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
$letters[] = $singleLetter;

// find the next character
$index = $rectangle[’x2′]+1;
}
}

return $letters;
}

$image = @imagecreatefrompng(’82.clean.png’);
if ($image == false) { die (’Unable to open image’); }

$letters = split_chars_along_vertical($image);

// dump the first letter to the screen
header(”content-type:image/png”);
imagepng($letters[0]);

?>

To run through the code and show how it works I made this neat little gif. To be honest it’s probably just a waste of my bandwidth but it looks pretty cool.

Floodfill gif

Now I can finally sort this damn neural network out :P . You just know something is going to break. Of course that’s the fun part, right? Anyone?

Scripts to separate characters

Friday, March 21st, 2008

Separating characters manually

Eli Bluehatseo.com is spending a week breaking captchas after they ddos’d his server. I’m interested.

I’m going to explain how I would separate the characters in an image after removing the noise and lines and stuff. It’s funny this post is going to be pretty damn short :D but I need time to work out the post with neural networks. This will just be the theory of separating characters and I’ll post code next time.

So first things first we need to scan through the vertical lines until we hit a pixel. From that pixel we flood fill. Yep. That’s it. As long as it’s a custom flood fill routine it will give us a start and end point from which we can fit a polygon or rectangle around the letter and extract it out. Now just de-rotate it and throw it at the neural network I don’t have yet.

Tuesday, March 18th, 2008

Bluehatseo goes down

So Bluehatseo goes down because he’s been ddos’d. I’m scouring the Internet and I find this below. Now I don’t know whether this is made up crap or for real so don’t blame me if it’s not, I’m just saying what I see:

  1. (8:25:42 PM) youkn0w: this si being done to eli, or bluehatseo.com to be specific

  2. (8:25:58 PM) youkn0w: he is “outing” too many people and the people’s methods, which are meant to be private

  3. (8:26:13 PM) youkn0w: messing with their business, you know

  4. (8:26:24 PM) windfox: very interesting

  5. (8:26:44 PM) kaveman: youkn0w

  6. (8:26:47 PM) kaveman: he aint outing no one

  7. (8:27:07 PM) youkn0w: not the people themselves, but the people’s methods

  8. (8:27:33 PM) youkn0w: he was given some information which was meant to be private, and he ended up posting it on his website

  9. (8:27:38 PM) kaveman: what? breaking a captcha

  10. (8:27:49 PM) youkn0w: so a few people got mad, and thats why his site is down

  11. (8:27:58 PM) Smaxy [n=Smaxildo@12-184-75-66.att-inc.com] entered the room.

  12. (8:28:18 PM) youkn0w: there’s a few specifics in that article which he stole from other people, but he posted things in the past

  13. (8:28:30 PM) kaveman: hwat that captcha breaking post? it was a guest post

  14. (8:28:38 PM) youkn0w: a proxy-captcha-solver thing to be specific

  15. (8:29:05 PM) youkn0w: if he was around, he would know what i am referring to

  16. (8:29:31 PM) kaveman: so what, still doesnt deserve the attacks

  17. (8:29:35 PM) adam-_-: yeh but (my guess) people who read it either a) already know about it b) are gonna be too lazy to get off their arse and implement it

If this is true some folks need to grow up. I mean come on, life’s just a game, why do they feel like they need to pull down someone’s server because he posts something they don’t like. Maybe they’re just jealous that Eli has a decent site with fantastic loyal readers.

According to this posted stuff I’ve gone down real well too:

  1. Cdogg: which is the article in question the phpbb captcha breaker?

  2. (8:49:31 PM) windfox: this one? http://64.233.167.104/search?q=cache:Vgqq8RCE_FAJ:www.bluehatseo.com/user-contributed-captcha-breaking-w-phpbb2-example/+%22bluehatseo.com%22+captcha&hl=en&ct=clnk&cd=1&gl=us&client=firefox-a

  3. (8:49:38 PM) Gnaser: I’m guessing the one in which he mentions setting up a web proxy to get kids to break the captchas

  4. (8:49:52 PM) youkn0w: not the whole article, but pieces in it. in addition to things he has posted int he past

  5. (8:49:59 PM) Tobsn: that guy is so stupid

  6. (8:50:01 PM) youkn0w: in the*

  7. (8:50:27 PM) d0nuts [n=nnscript@71.231.1.74] entered the room.

  8. (8:50:38 PM) Tobsn: funny thing is, hes so stupid he even converts the pic into pnm

  9. (8:50:39 PM) kaveman: what things he posted in the past

  10. (8:51:02 PM) Tobsn: nothing

  11. (8:51:04 PM) Tobsn: hes a moron

  12. (8:51:09 PM) trophaeum: youkn0w, you realize this will just get this article more attention?

  13. (8:51:10 PM) kaveman: cause if its over that captcha thats some gay ass shit

  14. (8:51:12 PM) kaveman: grow up

  15. (8:51:22 PM) Tobsn: http://www.darkseoprogramming.com/

  16. (8:51:40 PM) foucist: Tobsn: there’s an OCR for pnm files tho ?

  17. (8:51:49 PM) Tobsn: …

  18. (8:51:58 PM) YoungMaster [n=69@S0106000e08ed0133.vc.shawcable.net] entered the room.

  19. (8:51:58 PM) Tobsn: okay we have a new dummy of the month

  20. (8:51:59 PM) Tobsn: ;)

So before people all start taking chunks outta me what I will say is this. I ain’t even close to the best blackhat SEO, but I know a fair bit about programming so I chucked this site up. I’m learning more and more every day and I hope this site helps some folks out. Yeah, you can probably dig this stuff up from over the net in different places but here it is organised with downloadable zips to help if you aren’t so good at coding. Ah damn it, read it if you want and don’t if you don’t.

Hope you guys are keeping busy and enjoying life ;)

Update: http://www.wickedfire.com/shooting-shit/25354-eli-working-white-house.html youkn0w tells us why it’s happening. I found the text here http://pastebin.ca/943272 and I didn’t post it earlier because I didn’t know if Eli would want what is highlighted as his pic broadcast widely, but heck it’s on WickedFire. Let’s all hope this gets sorted.

Saturday, March 15th, 2008

PHP Preg_match without the BS

Your guide to becoming a preg_match/regex genius. Or just coding like me :D . Seriously though when all you want to do is prototype a script and you need to match a string regular expressions seem complicated. To write a proper regular expression there are a huge number of symbols you can use which make the expression more efficient all around.

The only symbols I ever really remember are:

(.*) - Matches anything in the same way as searching for a file by the name *.*
(.*?) - Same as above except it defaults to the fewest number of characters possible.

So if we had the phrase:

I am a pretty crazy guy. I am a pretty crazy guy.

and our regular expression was (note: all regular expression start and end with / unless you know what you’re doing and you have another plan):

/I am a(.*)guy/

our output would be:

 pretty crazy guy. I am a pretty crazy

now if we did this:

/I am a(.*?)guy/

our output is:

 pretty crazy

Welcome to the lazy way of programming ;)

Just to further clarify the full php would be:

preg_match(”/I am a(.*)guy/”, “I am a pretty crazy guy. I am a pretty crazy guy.”, $matches);
echo $matches[1];

Saturday, March 15th, 2008

Replacing GOCR part 1

This is related to my guest post on Eli’s BlueHatSeo on cracking phpBB2 captcha. Many thanks to him for posting it.

Ok so maybe you’ve had issues using GOCR and it won’t run on your server or <insert problem here>. Well anyway let’s remove GOCR from the program altogether. I’m going to insert a neural network in its place. Now the advantages of a neural network are they are relatively simple to understand and code especially with the help of the libraries available for many different languages and depending on the algorithm they can be faster. However I was reading a whitepaper the other day comparing a complex OCR recognition algorithm against a neural network, and the OCR algorithm performed quite a bit better. So it should be interesting.

Basic Background

I guess I should put in these basics just in case you’re not familiar with neural networks. A neural network is comprised of neurons connected to each other in different sequences. A neuron has dendrites which accept signals from other neurons and an axon which sends a signal depending on the signals received from these other neurons. The neuron will send a signal when the inputs from the other neurons add up to reach a threshold. Now what makes the neuron really powerful is that each dendrite can be weighted so that the signal received can be more or less likely to make the neuron send a signal down its axon. Some dendrites could also inhibit the sending of a signal down the axon of the neuron depending the library we use.

We can train the network by adjusting the weighting of each of these dendrites until the outputs at the end of the network gives us the correct answer. I plan to use the simplest form of network called a feed forward network. Good old wikipedia explains. The input neurons will be the pixels of an individual character.

My Plan

For this post literally all I’m going to do is compile some training data in preparation for our neural network. Using GOCR I did it all by hand and it wasn’t too slow. But I don’t want to sit here retraining the network by hand every time I make a change to the number of neurons or something in it so I’m going to get around 10,000 phpBB2 captchas with their solved text. Assuming 5 characters in each captcha that is around 50,000 characters to train the network with. I’m figuring that’ll be enough (Hopefully it will be overkill). It’ll be interesting to see how many it prefers because obviously on a non-open source captcha like google you can’t just grab captchas this easily.

Obviously I highly recommend you run the program that grabs these captchas on your home PC and not on your shared hosting because I’m not sure how much CPU it’s going to “borrow”, and I hear shared hosts really don’t appreciate programs that hog CPU time (hmmm, phproxy?). It brought my PC to its knees (Although I did try to open a few programs at the same time as running it :D ).

Getting the Captcha Producing Code

I obviously need a script to extract these captchas so first things first let’s find out how phpBB2 works. Using a phpBB2 you’ve installed on your home WAMP server you open it up in Firefox and browse to register a new user. Looking at the title bar you see that we are in the file profile.php with this string after it, “?mode=register&agreed=true”. So we know that two variables are set, “mode” and “agreed”. We’re looking for something in profile.php that checks if mode is equal to register as set in the address bar of Firefox.

We see the lines:

$mode = ( isset($HTTP_GET_VARS[’mode’]) ) ? $HTTP_GET_VARS[’mode’] : $HTTP_POST_VARS[’mode’];
$mode = htmlspecialchars($mode);

So that just put the word “register” into the variable $mode. Scroll down:

else if ( $mode == ‘editprofile’ || $mode == ‘register’ )
{
if ( !$userdata[’session_logged_in’] && $mode == ‘editprofile’ )
{
redirect(append_sid(”login.$phpEx?redirect=profile.$phpEx&mode=editprofile”, true));
}

include($phpbb_root_path . ‘includes/usercp_register.’.$phpEx);
exit;
}

This checks if mode is equal to “editprofile” or “register” and if it is “register” includes the file “includes/usercp_register.php”. This literally means run all the code in usercp_register.php. So open this file up now.

There’s a few ways to find what we’re looking for but scanning over the code I noticed they called the captcha a confirmation code a lot of the time, so I just opened up a search box and typed confirm into it and searched through the file until I hit something interesting:

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid($user_ip));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘”. $userdata[’session_id’] . “‘, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

unset($code);

$confirm_image = ‘<img src=”‘ . append_sid(”profile.$phpEx?mode=confirm&id=$confirm_id”) . ‘” alt=”" title=”" />’;

Well, this code makes our captcha. It inserts the sequence of letters and numbers it wants into $code then produces a unique code generated from user’s ip address in $confirm_id. After all that it inserts it all into the database and retrieves an image using the last line.

Of course we don’t need to know how it works we just need to copy this code into a new file which downloads and saves the solve for the captcha. Put this file in the same directory as your phpBB2 install, create a new directory called output/, and run it from the command line. I run it from the command line because it’s going to take a while to get all the captchas, and I can cancel it from the command line if it goes wrong, as well as the issue that it will timeout if you run it from your browser. Just use Start->Run, type cmd. Then type:

C:
cd\directoryofphpbb2install
php getcaptcha.php

I used two new scripts. The first one is basically the code seen above modified to quickly download as many captchas as possible, save it as getcaptcha.php in the same directory as your phpBB2 install:

<?php
////////////// START UP CODE ///////////////////////////////////////////
define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);

$userdata = session_pagestart($user_ip, PAGE_PROFILE);
init_userprefs($userdata);

// we need to get the current directory so we can get the download address for the captcha
echo “Please enter address of phpbb2 install.\nFor instance http://localhost/phpBB2: “;
$phpbb2dir = fgets(STDIN);
// make sure there is not return on the end of the string
$phpbb2dir = str_replace(”\r”, “”, $phpbb2dir);
$phpbb2dir = str_replace(”\n”, “”, $phpbb2dir);

echo “Started\n”;

// let’s loop over and get all of our captchas
for($index=0; $index<10000; $index++)
{

////////////// OUR BORROWED CODE ///////////////////////////////////////////

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid(”127.0.0.1″));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘fakesessionid’, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

$confirm_image = $phpbb2dir . append_sid(”/getcaptcha.confirm.php?id=$confirm_id”);

////////////// DOWNLOAD CAPTCHA CODE ///////////////////////////////////////////
// write the captcha image to a file
$captcha = file_get_contents($confirm_image) or die(”Error downloading captcha”);
$fp = fopen(”output/$index.png”, “w”) or die(”Can’t create output file”);
fwrite($fp, $captcha) or die(”Error writing to file”);
fclose($fp);
$fp = fopen(”output/$index.txt”, “w”) or die(”Can’t create output file”);
fwrite($fp, $code) or die(”Error writing to file”);
fclose($fp);

echo “Written $index\n”;
}

echo “Done.\n”
?>

The second script is necessary because when we download the captcha with file_get_contents(…) we don’t have the cookies that were set by phpBB2 in Firefox. file_get_contents(…) can’t pass cookies at all and we need them because phpBB2 tracks which captcha it has asked you by cookies. The file includes/usercp_confirm.php from phpBB2 is what produces the captcha. We copy it to the same place as getcaptcha.php, naming it getcaptcha.confirm.php and change these lines:

if ( !defined(’IN_PHPBB’) )
{
die(’Hacking attempt’);
exit;
}

to this:

define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);
$userdata[’session_id’] = ‘fakesessionid’;

This tricks the captcha code into producing a captcha for a session named ‘fakesessionid’.

Well, that’s it. I have my set of training data ready. Hopefully you got all that to work without too many problems. Below is my compilation of captchas, it’s in bzip2 format because I needed to compress it heavily and I can never get rar to work on Linux. I also removed the backgrounds etc from them as outlined in my guest post on BlueHatSeo. And I also deleted the 10,000 records for captchas you now have in your mySQL database :P .

10,000 phpBB2 captchas without noise

Scripts to download captchas

I was looking at my GOCR database as well and it looks like using this massive set of captchas we should be able to instantly train GOCR to recognise phpBB2 without having to correct the errors that it makes. Over the weekend I’m going to look at both of these things.

Thursday, March 13th, 2008