Replacing GOCR part 1

This is related to my guest post on Eli’s BlueHatSeo on cracking phpBB2 captcha. Many thanks to him for posting it.

Ok so maybe you’ve had issues using GOCR and it won’t run on your server or <insert problem here>. Well anyway let’s remove GOCR from the program altogether. I’m going to insert a neural network in its place. Now the advantages of a neural network are they are relatively simple to understand and code especially with the help of the libraries available for many different languages and depending on the algorithm they can be faster. However I was reading a whitepaper the other day comparing a complex OCR recognition algorithm against a neural network, and the OCR algorithm performed quite a bit better. So it should be interesting.

Basic Background

I guess I should put in these basics just in case you’re not familiar with neural networks. A neural network is comprised of neurons connected to each other in different sequences. A neuron has dendrites which accept signals from other neurons and an axon which sends a signal depending on the signals received from these other neurons. The neuron will send a signal when the inputs from the other neurons add up to reach a threshold. Now what makes the neuron really powerful is that each dendrite can be weighted so that the signal received can be more or less likely to make the neuron send a signal down its axon. Some dendrites could also inhibit the sending of a signal down the axon of the neuron depending the library we use.

We can train the network by adjusting the weighting of each of these dendrites until the outputs at the end of the network gives us the correct answer. I plan to use the simplest form of network called a feed forward network. Good old wikipedia explains. The input neurons will be the pixels of an individual character.

My Plan

For this post literally all I’m going to do is compile some training data in preparation for our neural network. Using GOCR I did it all by hand and it wasn’t too slow. But I don’t want to sit here retraining the network by hand every time I make a change to the number of neurons or something in it so I’m going to get around 10,000 phpBB2 captchas with their solved text. Assuming 5 characters in each captcha that is around 50,000 characters to train the network with. I’m figuring that’ll be enough (Hopefully it will be overkill). It’ll be interesting to see how many it prefers because obviously on a non-open source captcha like google you can’t just grab captchas this easily.

Obviously I highly recommend you run the program that grabs these captchas on your home PC and not on your shared hosting because I’m not sure how much CPU it’s going to “borrow”, and I hear shared hosts really don’t appreciate programs that hog CPU time (hmmm, phproxy?). It brought my PC to its knees (Although I did try to open a few programs at the same time as running it :D ).

Getting the Captcha Producing Code

I obviously need a script to extract these captchas so first things first let’s find out how phpBB2 works. Using a phpBB2 you’ve installed on your home WAMP server you open it up in Firefox and browse to register a new user. Looking at the title bar you see that we are in the file profile.php with this string after it, “?mode=register&agreed=true”. So we know that two variables are set, “mode” and “agreed”. We’re looking for something in profile.php that checks if mode is equal to register as set in the address bar of Firefox.

We see the lines:

$mode = ( isset($HTTP_GET_VARS[’mode’]) ) ? $HTTP_GET_VARS[’mode’] : $HTTP_POST_VARS[’mode’];
$mode = htmlspecialchars($mode);

So that just put the word “register” into the variable $mode. Scroll down:

else if ( $mode == ‘editprofile’ || $mode == ‘register’ )
{
if ( !$userdata[’session_logged_in’] && $mode == ‘editprofile’ )
{
redirect(append_sid(”login.$phpEx?redirect=profile.$phpEx&mode=editprofile”, true));
}

include($phpbb_root_path . ‘includes/usercp_register.’.$phpEx);
exit;
}

This checks if mode is equal to “editprofile” or “register” and if it is “register” includes the file “includes/usercp_register.php”. This literally means run all the code in usercp_register.php. So open this file up now.

There’s a few ways to find what we’re looking for but scanning over the code I noticed they called the captcha a confirmation code a lot of the time, so I just opened up a search box and typed confirm into it and searched through the file until I hit something interesting:

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid($user_ip));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘”. $userdata[’session_id’] . “‘, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

unset($code);

$confirm_image = ‘<img src=”‘ . append_sid(”profile.$phpEx?mode=confirm&id=$confirm_id”) . ‘” alt=”" title=”" />’;

Well, this code makes our captcha. It inserts the sequence of letters and numbers it wants into $code then produces a unique code generated from user’s ip address in $confirm_id. After all that it inserts it all into the database and retrieves an image using the last line.

Of course we don’t need to know how it works we just need to copy this code into a new file which downloads and saves the solve for the captcha. Put this file in the same directory as your phpBB2 install, create a new directory called output/, and run it from the command line. I run it from the command line because it’s going to take a while to get all the captchas, and I can cancel it from the command line if it goes wrong, as well as the issue that it will timeout if you run it from your browser. Just use Start->Run, type cmd. Then type:

C:
cd\directoryofphpbb2install
php getcaptcha.php

I used two new scripts. The first one is basically the code seen above modified to quickly download as many captchas as possible, save it as getcaptcha.php in the same directory as your phpBB2 install:

<?php
////////////// START UP CODE ///////////////////////////////////////////
define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);

$userdata = session_pagestart($user_ip, PAGE_PROFILE);
init_userprefs($userdata);

// we need to get the current directory so we can get the download address for the captcha
echo “Please enter address of phpbb2 install.\nFor instance http://localhost/phpBB2: “;
$phpbb2dir = fgets(STDIN);
// make sure there is not return on the end of the string
$phpbb2dir = str_replace(”\r”, “”, $phpbb2dir);
$phpbb2dir = str_replace(”\n”, “”, $phpbb2dir);

echo “Started\n”;

// let’s loop over and get all of our captchas
for($index=0; $index<10000; $index++)
{

////////////// OUR BORROWED CODE ///////////////////////////////////////////

// Generate the required confirmation code
// NB 0 (zero) could get confused with O (the letter) so we make change it
$code = dss_rand();
$code = substr(str_replace(’0′, ‘Z’, strtoupper(base_convert($code, 16, 35))), 2, 6);

$confirm_id = md5(uniqid(”127.0.0.1″));

$sql = ‘INSERT INTO ‘ . CONFIRM_TABLE . ” (confirm_id, session_id, code)
VALUES (’$confirm_id’, ‘fakesessionid’, ‘$code’)”;
if (!$db->sql_query($sql))
{
message_die(GENERAL_ERROR, ‘Could not insert new confirm code information’, ”, __LINE__, __FILE__, $sql);
}

$confirm_image = $phpbb2dir . append_sid(”/getcaptcha.confirm.php?id=$confirm_id”);

////////////// DOWNLOAD CAPTCHA CODE ///////////////////////////////////////////
// write the captcha image to a file
$captcha = file_get_contents($confirm_image) or die(”Error downloading captcha”);
$fp = fopen(”output/$index.png”, “w”) or die(”Can’t create output file”);
fwrite($fp, $captcha) or die(”Error writing to file”);
fclose($fp);
$fp = fopen(”output/$index.txt”, “w”) or die(”Can’t create output file”);
fwrite($fp, $code) or die(”Error writing to file”);
fclose($fp);

echo “Written $index\n”;
}

echo “Done.\n”
?>

The second script is necessary because when we download the captcha with file_get_contents(…) we don’t have the cookies that were set by phpBB2 in Firefox. file_get_contents(…) can’t pass cookies at all and we need them because phpBB2 tracks which captcha it has asked you by cookies. The file includes/usercp_confirm.php from phpBB2 is what produces the captcha. We copy it to the same place as getcaptcha.php, naming it getcaptcha.confirm.php and change these lines:

if ( !defined(’IN_PHPBB’) )
{
die(’Hacking attempt’);
exit;
}

to this:

define(’IN_PHPBB’, true);
$phpbb_root_path = ‘./’;
include($phpbb_root_path . ‘extension.inc’);
include($phpbb_root_path . ‘common.’.$phpEx);
$userdata[’session_id’] = ‘fakesessionid’;

This tricks the captcha code into producing a captcha for a session named ‘fakesessionid’.

Well, that’s it. I have my set of training data ready. Hopefully you got all that to work without too many problems. Below is my compilation of captchas, it’s in bzip2 format because I needed to compress it heavily and I can never get rar to work on Linux. I also removed the backgrounds etc from them as outlined in my guest post on BlueHatSeo. And I also deleted the 10,000 records for captchas you now have in your mySQL database :P .

10,000 phpBB2 captchas without noise

Scripts to download captchas

I was looking at my GOCR database as well and it looks like using this massive set of captchas we should be able to instantly train GOCR to recognise phpBB2 without having to correct the errors that it makes. Over the weekend I’m going to look at both of these things.

7 Responses to “Replacing GOCR part 1”

  1. online tv Says:

    Hmm… U must have spent a lot of time for this. Would u mind if I follow U???

  2. Jez Says:

    I have never done this kind of programming, but I worked with someone who used “AI / NN” to identify the author of a book… then did another to predict racing results, sure your familiar with this stuff….

    What interested me most was the program to identify book authors. I am sure you are familiar with the output synonym generators / re-writers currently produce… if you were able to use this technology to create a high quality articles re-writer I think you would enjoy a pretty nice lifestyle ;-)

  3. Harry Says:

    I remember reading somewhere that by compressing music, I think it was using BZip2 compression, they could identify the genre of music. It sounds like a similar sort of process.

    It sounds like the program that identified book authors might make a great article “rater”, it’s just an interesting problem of how do you reverse its function to produce articles.

    I always wonder when article re-writers will become popular software for students, and how they will identify them in the future as they improve.

  4. Jez Says:

    I dont mean create new articles, but mash them up enough to get indexed properly, i.e. not show up as dupe.

    It would only be a matter of “learning” when it is appropriate to use a particular synonym… the problem with rewriters I have seen is that they choose synonyms which are out of context / gramatically incorrect making the spun text easily identifyable…

  5. Harry Says:

    Ok here’s why I think it won’t work exactly as you hope. These neural networks can identify the author of a book but they obviously can’t write a book in the style of that author. There are simply too many possibilities. It’s like you can’t get a password from its hash. You have to test passwords against the hash until you find the correct one.

    I didn’t mean produce new articles either. But the only way I see it working is if you mash up an article and then rate the entire thing, and if it fails try again.

    Don’t get me wrong you might be right, but I just can’t see it working in my head at the moment. I’m going through possible algorithms but I can’t find one I believe will do everything.

  6. Jep Says:

    Jep…

    I have seen many sites before and most of them do not look this good. I cannot wait to let my friends know about this site. Thanks for the excellent content….

  7. raikol Says:

    After we have all images? how can we train gocr automatically? :S
    is there any script? or …. we have to tell them this is a
    this is b
    this c … and so on? with all images … it is a pain :S

Leave a Reply

Enter this code