Instant GOCR Training

A while back I said you *may* be able to train GOCR to recognise PHPBB2 captchas instantly thanks to its excellent database layout. Now for the moment of truth. Several hours later after travelling through much shrubbery with only my trusty whip and bent fedora for company (I think I may be insane but I don’t have the paper to prove it or the jacket)…

It works. The only downside is if you fill the database with too many characters it is very likely to slow GOCR down immensely. So go easy and possibly try and remove too many duplicates of the same letter.

So here’s how it works, inside the custom database directory is a file called db.lst. This file is literally just a list of pictures with their correct answer as seen below (note this is my custom database, normally it names the files sensible names :D ):

30402199be694d0330735cb3de4df778.pbm G
852f04abf55c904fdb977dc297c630ec.pbm Z
1cbc984624ca1673132afead5d6f518a.pbm G
297a35232ba803cd6675a38a29453828.pbm D

The first entry is the filename, and it can literally be any pbm/png file. The second entry is the correct letter. That simple. All we have to do is rip the letters out and put them in the same directory. Unfortunately I haven’t got the script cleaned in a nice easy to use format to just download, but I’ll post what I used to build my custom database very quickly. I use the retrieve.php include which is somewhere on this site. I should be more organised. I think it’s here.

Now this code is written to run on Windows/Linux so it uses png files because we can’t export pbm files from GD in php. It was either that or have the script not work in Windows at all. All you Linux folks can easily convert them to pbm files and do it the way it’s supposed to be done. (The script runs from the command line only… like this… “php script.php answer.txt captcha.png”) (Also I just thought… Make sure you have the directory ‘data’ in the same directory as you run the script. Don’t run the script from the ‘data’ directory but the directory just above it)

<?php

require_once(”retrieve.php”);

// extract the letters out
$letters = get_letter_array($argv[$argc-1]);

// get the answer to the captcha
$fp = fopen($argv[$argc-2], “r”) or die(”Need a solved answer in ” . $argv[$argc-2]);
$str_answer = fgets($fp);
fclose($fp);
$answer = str_split($str_answer);

// give them unique names and save them in .png format
$unique_name = array();
for($index=0; $index<count($letters); $index++)
{
$unique_name[] = md5(uniqid());
imagepng($letters[$index], “data/” . $unique_name[$index] . “.png”);
}

// link them from the db.lst file
$fp = fopen(”data/db.lst”, “a”);
for($index=0; $index<count($letters); $index++)
{
fwrite($fp, $unique_name[$index] . “.png ” . $answer[$index] . “\n”);
}
fclose($fp);

?>

And now for some link love to the spamhuntress.

I actually have a plan in mind for my next post, which is damn unusual. I’ll let you know how it goes in several days time :D . Oh yeah and it’ll be in Java so it’ll run nicely on your Windows install too.

5 Responses to “Instant GOCR Training”

  1. Jez Says:

    Interesting stuff, looking forward to you next post!

    Checked out the spamhuntress link… looked at her bio and her mail address is client side Doh!

  2. Harry Says:

    Yep, I think she must like spam. It’s always the best way to broadcast your email to spammers, leave your email address in a mailto command. I think I should post a script that grabs emails from webpages :D . Shame I don’t do email spam.

  3. Blackhat SEO » Blog Archive » How to break captchas Says:

    […] - PHPBB3 Captcha is super easy DarkSeoProgramming - Instant GOCR Training DarkSeoProgramming - Letter Derotation DarkSeoProgramming - GOCR to Neural Nets Pt 2 […]

  4. mat Says:

    I really enjoy more to read then write my opinion because i don`t want to get someone`s attention but i liked all your thoughts and ideas thanks to the owner because he created this blog to share with us his knowledge.

  5. bedava sinema izle Says:

    Oldukça faydalı bir özellik ekledik. Şerefe ve statcounter takımı tekrar. TEŞEKKÜRLER / Thanks.. :D

Leave a Reply

Enter this code