Archive for the ‘captcha’ Category

Pythagoras was a spammer?

I don’t remember blogging about this before so here goes. There’s times when you’ll algorithmically be processing lines and a certain length of line should make your application carry out a certain function on it. Think digg’s captcha ;) .

Now the issue is you can’t just measure the horizontal distance and the vertical distance because the line might be at an arbitrary angle. So we use pythagoras’ theorem. Yes it’s simple but sometimes you can miss this obvious stuff when you’re coding. I did for ages ;) , which is why I’m blogging about it.

Pythagoras Theorem

Simply take the horizontal distance and the vertical distance, square them both, add them together, find the square root and you have the length of your line. Easy. Incidentally you need this in digg because a couple of short lines come off of F’s and so on.

Thursday, June 12th, 2008

Maxxed out Server

Let’s make a long story short. I’ve come into contact with a server which I need to max out the resources on, I’ve been trying my best but it needs something more. If I ran some captcha cracking stuff on it as an API service how many would be interested? I’d be a lot cheaper than employing bored students.What captchas would you all want destroyed :D

Monday, June 9th, 2008

Removing Lines Across Letters

Squidoo and Recaptcha.

Both have an annoying line going through them which joins the letters together. But how effective is this really? The thing about recaptcha is that the text is known to OCR successfully apart from one of the words which is unable to be OCRed. In recaptcha we can simply type in one correct word and it won’t be able to check the other one. We’ll probably need some pretty decent OCR software and an approximation module that guesses how close a word is to a proper english word.

But that line through it destroys any chance of standard OCR software recognising anything. However here’s a weakness. The line generally starts somewhere approximately in the middle and often sticks out from the end of the letter slightly. It shouldn’t be too hard to pick up where the line starts and possibly ends. From there we can assume that it won’t ever be thicker than a certain amount and will move by a limited amount. We can roughly track the line whenever it exits a letter. From that we can estimate where it has been travelling and what part is letter and what part is line.

Incidentally this works pretty well on digg except the vast quantity of lines and differing shades make it harder to pick them all up. Often you’ll pick up what looks like a line and have to flip 180 degrees to make sure you haven’t missed anything. The other problem with digg is it’s easy to end up with breaks in the lines and have to “trace blind”. If you have already traced enough of the line that’s not too hard because it’s a pretty basic algorithm to keep tracing through blank space with straight lines until we hit the rest of the line. Just be aware you might be a pixel or really rarely two away from the actual line.

Squidoo is a lot easier to identify the line with but has a lot more distortion in the letters. The distortion could be an issue, might need another algorithm to beat that if it won’t train out.

Anyway below is some code with a line detection algorithm. It assumes the furthest point left and right of a series of letters is part of the line. It then tries to trace along the line. It’s nowhere near perfect at the moment as it suffers from some issues when trying to build the line at the end that causes it to favour travelling upwards. But it shows that with a bit more tweaking those lines can be removed. I compiled it on Linux, it’ll be easier to test inside a linux VM if you’re running windows. The pics below show a perfect case scenario.

Goopwoot Squidoo Captcha

Squidoo Cleaned Up

Download Code with Link Below:

Code to detect and draw over the line in Squdioo Captcha’s

==================================
ALGORITHM OVERVIEW
==================================

- Try and draw as many lines as possible only allowing the line to move up or down by one pixel with each pixel travelled in the horizontal axis. If we hit a blank space stop drawing this line. Carry on drawing the others.

- Find the parts in the image where the line is most likely to skip. See pic below.

First Stage Squidoo Breaking

- Calculate the average incline per pixel movement in the horizontal axis between each section where the line “jumps”. Height/Width

- Use this incline to latch onto the closest shaded pixel. Then finally smooth the line.

==================================

What about other captchas like myspace where the letters actually touch? Hmmm… I wonder if they considered the best choice of font ;) ?

Sunday, June 1st, 2008

PHPBB3 Captcha is super easy

PHPbb3 Captcha 2

A while back I presented a long-winded algorithm that would crack phpBB3 captchas. However I cracked it a while back and it’s even simpler than I said before. My floodfill routine returns the size of the area it colours in. Soooo… I flood fill background coloured pixels and if it’s a small area we assume it must be part of a letter and keep it. That gives us lots of small segments to join together.

Incidentally we find the background colour by reading the pixels along the top and finding the most regularly occuring colour.

Now we have some small segments we make them touch each other by blurring them and then we force the picture into only two colours. Then using the average density of vertical lines in each letter we rotate them to an approximately correct position. It may throw a few upside down but as long as that letter always comes out that way up the computer doesn’t care.

Now just train Gocr or a neural network or <<insert cunning program here>> to read those letters. Simple. And surprisingly accurate too. We could further improve it with colour checking routines etc but hey, it works.

Monday, May 12th, 2008

Instant GOCR Training

A while back I said you *may* be able to train GOCR to recognise PHPBB2 captchas instantly thanks to its excellent database layout. Now for the moment of truth. Several hours later after travelling through much shrubbery with only my trusty whip and bent fedora for company (I think I may be insane but I don’t have the paper to prove it or the jacket)…

It works. The only downside is if you fill the database with too many characters it is very likely to slow GOCR down immensely. So go easy and possibly try and remove too many duplicates of the same letter.

So here’s how it works, inside the custom database directory is a file called db.lst. This file is literally just a list of pictures with their correct answer as seen below (note this is my custom database, normally it names the files sensible names :D ):

30402199be694d0330735cb3de4df778.pbm G
852f04abf55c904fdb977dc297c630ec.pbm Z
1cbc984624ca1673132afead5d6f518a.pbm G
297a35232ba803cd6675a38a29453828.pbm D

The first entry is the filename, and it can literally be any pbm/png file. The second entry is the correct letter. That simple. All we have to do is rip the letters out and put them in the same directory. Unfortunately I haven’t got the script cleaned in a nice easy to use format to just download, but I’ll post what I used to build my custom database very quickly. I use the retrieve.php include which is somewhere on this site. I should be more organised. I think it’s here.

Now this code is written to run on Windows/Linux so it uses png files because we can’t export pbm files from GD in php. It was either that or have the script not work in Windows at all. All you Linux folks can easily convert them to pbm files and do it the way it’s supposed to be done. (The script runs from the command line only… like this… “php script.php answer.txt captcha.png”) (Also I just thought… Make sure you have the directory ‘data’ in the same directory as you run the script. Don’t run the script from the ‘data’ directory but the directory just above it)

<?php

require_once(”retrieve.php”);

// extract the letters out
$letters = get_letter_array($argv[$argc-1]);

// get the answer to the captcha
$fp = fopen($argv[$argc-2], “r”) or die(”Need a solved answer in ” . $argv[$argc-2]);
$str_answer = fgets($fp);
fclose($fp);
$answer = str_split($str_answer);

// give them unique names and save them in .png format
$unique_name = array();
for($index=0; $index<count($letters); $index++)
{
$unique_name[] = md5(uniqid());
imagepng($letters[$index], “data/” . $unique_name[$index] . “.png”);
}

// link them from the db.lst file
$fp = fopen(”data/db.lst”, “a”);
for($index=0; $index<count($letters); $index++)
{
fwrite($fp, $unique_name[$index] . “.png ” . $answer[$index] . “\n”);
}
fclose($fp);

?>

And now for some link love to the spamhuntress.

I actually have a plan in mind for my next post, which is damn unusual. I’ll let you know how it goes in several days time :D . Oh yeah and it’ll be in Java so it’ll run nicely on your Windows install too.

Wednesday, April 16th, 2008

Letter Derotation

I’m getting kind of done with captchas but here goes another post on them. You may have read Slightly Shady SEO about how to derotate letters. Here’s my easier technique. Add up all the black in the vertical lines of the letter, find the average and then check for spikes above that average. These spikes are probably vertical lines in the letter like the back of a ‘d’ or a ‘p’ etc. Then we simply rotate it around by a few degrees until we find the rotation with the largest vertical spike above the average. We then need some extra checks for symettry and so on but that’s the basics.

Saturday, April 5th, 2008

GOCR to Neural Nets Pt 2

As per usual these posts don’t go as smoothly as I would like. The idea was to use FANN for PHP to make a simple neural network that would work easily. Hahahaha. You might think I make this stuff up as I go along. Oh wait. I do.

Anyway FANN requires PEAR to be installed and I figured that it’d be much simpler than installing PEAR modules to find something that was completely PHP to do the job. I did that. However.

1. It’s slow.

Ok so it’s slow. We can live with that right? I mean we’ve got some time.

2. It’s slow.

Ok it really is slow, and I’m getting impatient.

3. It’s painfully slow.

PHP sucks for some things. I do like C’s direct memory access. On the plus side this little neural net class is so simple and easy to understand.

Anyway I went ahead and I fitted all my modules together around this PHP neural net and I walked away whilst my computer attempted to learn the alphabet at a ridiculously slow pace. 900 captchas later I give it up.

The neural net does sort of work now for quite a few characters. Although I did notice a character failed to segment properly which set the learning back a bit. A lot of the characters it fails on are things like producing P’s for R’s or vice versa. So you can understand where the problem lies.

I won’t post code because there’s a fair bit spread over a few modules. nnbreak.php is the main module and must be run from the command line. Like this:

php nnbreak.php captcha.jpg answer.txt train=1/0

1 means train the network using the answer stored in answer.txt. 0 means guess what the captcha is in captcha.jpg, which will ignore answer.txt but answer.txt must still be included (although it can be blank or not exist etc).

So analysis… Here’s how it works roughly:

Allocate memory for the neural network, and load in previous neuron weight values

Extract the letters as said in that post about extracting letters.

Convert the letters into a 10×10 matrix of averaged values, maximum being 1.0

Loop through each letter and send the matrix of values to the input neurons

Check the output neurons

Possibly teach the network the right answer using back propogation.

We can tweak a lot of things such as the number of neurons in each layer and the size of the matrix. I did mess with the default backpropogation teaching speed because it didn’t seem to be learning fast enough to me :D . I’m guessing that has some drawbacks to it but I’m not sure exactly what they are. In this line the 0.5 is the learning speed which has been moved up from the default 0.1:

$nn = new nn(3, $layer_structure,1,0.5,0.9);

Now if anyone happens to train the network to recognise all of the letters properly send me the file so I can claim it as my own and pretend I did it all perfectly ;) . J/k. Seriously though I do think it should eventually learn all the letters properly it’s just taking a long time.

In conclusion. I don’t trust neural networks they do too much stuff that I don’t know about. I’m betting that robot from the Terminator was probably built from neural nets.

The code to guess a captcha using neural nets - Already populated with some weights so it sort of works

Sunday, March 30th, 2008

10 Steps to Solving a Captcha

Lately I’ve been stuck on one of my projects, hitting my head against the wall. I’m going to keep doing it until I solve the problem, however it’s got in the way of my posts. So I’ve taken some time out to put together two posts.

10 Steps to Solving a Captcha

1. Start up The GIMP, Photoshop, or <insert favourite paint program here>

2. Mess around with all the plugins and filters you have available and see how much noise you can get rid of just with these. If a filter does a good job but removes too much information remember you can always use another filter on the same image and paste these two images together at the end.

3. If the letters are getting sketchy with all these filters, and too thin you can try combining them with the original image. You can do this by floodfilling from the remaining pixels on the altered image onto the original image. Now copy these floodfilled parts of the original image back into your altered image. This won’t work if straight lines cut through the letters though.

4. If “artefacts” or noise are still on the image then you need to list down its common properties. Does a line have to start from an edge? Is it always straight? Is it within a set angle? Are dots spaced a certain distance apart or entirely random? And so on. We can then take all this into consideration when we write our custom noise removal algorithm.

5. Get a programming language you can throw ideas around in quickly. Basically this is just a draft. Save a rough image off from photoshop and maybe even use Visual Basic or something to test ideas out. Even if you don’t know how to apply all those filters you used in Photoshop in code who cares. If you know it works you can move mountains compared to writing a ton of code that you don’t even know will work.

6. Write that custom noise removal algorithm I talked about earlier. ;) Don’t delete code!!! Ever. Even if it doesn’t work just archive it somewhere because sometime you just know you’ll want it again.

7. Once you have the noise removed you may need to break overlapping letters up. All letters are made up of certain types of strokes (and loops in handwriting). As I understand it this is actually the basis behind natural handwriting recognition. For instance in a captcha if a letter ends with a curve then often that will be the end of the letter. Depending on the length of the curve and as long as the letter isn’t rotated at a strange angle. You’ll probably need to consider these type of things if the letters overlap.

*** This is possibly one of the hardest steps

8. Once you have single letters you have to de-rotate them. Read SlightlyShadySeo. Again think about common properties. Don’t accidentally rotate a letter upside down if there is no way a letter would be that way up in the first place. Although that was just an example and probably wouldn’t happen.

9. Write an automated script that downloads a ton of captchas and only requires your input to train it with GOCR or phpOCR. It’s way more fun watching your computer do something than doing it yourself.

10. Optimize the algorithm. If your algorithm is slow but works you can probably save time by removing parts of your code that are running more times than they need to etc.

Abuse each of these 10 steps and change them to suit your personality. Receive flashes of inspiration. They’re fun. Don’t drink high caffeine drinks. They make me talk about absolute rubbish non-stop…

Wednesday, March 26th, 2008

PHPBB3 Captcha difficulty

Is phpBB3 more secure than phpBB2? Here is a default phpBB3 sample.

PHPBB3 captcha

This is a lot stronger than a phpBB2 captcha. We can’t separate a letter based purely on its colour anymore. Notice how there is a line running underneath the B that is the same colour as the B. The background colour is annoying as anything but only from a person’s point of view. Our PC doesn’t really mind.

One of its issues/weaknesses is that there are no lines that cut across the squares, they all go underneath them. That means there is no breaks in the squares we have to detect. The only other weakness I can see is that the lines go directly across without intersecting at any point. That means that there are no objects that look like the squares of the letters that are just noise.

So here’s my algorithm which I think would solve it. Admittedly I haven’t tested this but I don’t see why it wouldn’t work. All the letters are made up of squares. We need to test, if starting at one pixel we can get back to the start by following the same colour pixels. That obviously would make a square :P , or something close like a distorted rectangle. It’s almost like dot to dot puzzles. If we can get back to the start keep the line and colour it in using previous post’s fill function (or php GD’s one :D ). If we can’t get back to the start or the line keeps travelling too far then we remove it and find another coloured pixel, that doesn’t match the background colour.

The main issues we would have to overcome are lines which are thicker than 1 pixel and small blocks of colour found at the side of some of the letters. The other issue would be making sure we don’t recheck the part we just shaded in (Maybe use a unique colour for it?).

Friday, March 21st, 2008

A custom floodfill routine

Last post I said to separate characters you simply need to flood fill them and calculate the extreme points to find out where to fit your rectangle around. Hahaha. I’ve spent the last few days trying to port optimized floodfill functions to php. Normally I’d just take the sane easy option and use pre-written code like GOCR but apparently I like pain.

The problem is the optimized bit. I can write a simple recursive floodfill function that calls itself until it’s done but I have no idea how to write something that will be reasonably fast. The more complicated captchas will require a fair bit of speed because you will be thinking about cracking a fair few of them in a period of time. This site is where I eventually found a simple routine that worked. My issue was I used another routine, ported it, and then it broke. Miserably. After filling only two lines.

Here is my code. First it loads in an image of a captcha. It then scans two lines along the horizontal axis, one at 1/4 of the way down the picture and one at 3/4 of the way down. When it hits a character it floodfills it. The floodfill function returns the extreme positions of pixels which gives us a rectangle around that letter.

<?php

function floodFillScanlineStack($image, $x, $y)
{
// the colour we are shading in - black letters
$oldColour = 0;
// the colour the want to shade the letters in - red just because we can
$fillColour = imagecolorallocate($image, 255, 0, 0);

// we need the image width & height
$w = imagesx($image);
$h = imagesy($image);

// set the rectangle co-ords
$rectangle = array(”x1″ => $x, “x2″ => $x, “y1″ => $y, “y2″ => $y);

if($oldColour == $fillColour) return;

$stack = array();
$stack[] = array(”x” => $x, “y” => $y);

while(count($stack)>0)
{
$pos = array_pop($stack);
$x = $pos[’x'];
$y = $pos[’y'];

$y1 = $y;
while($y1 >= 0 && imagecolorat($image, $x, $y1) == $oldColour) $y1–;
$y1++;
$spanLeft = 0;
$spanRight = 0;
while($y1 < $h && imagecolorat($image, $x, $y1) == $oldColour )
{

// here we set the pixel colour
// use these to find our rectangle around the letter
imagesetpixel($image, $x, $y1, $fillColour);
if($x<$rectangle[’x1′])
$rectangle[’x1′] = $x;
if($y1<$rectangle[’y1′])
$rectangle[’y1′] = $y1;
if($x>$rectangle[’x2′])
$rectangle[’x2′] = $x;
if($y1>$rectangle[’y2′])
$rectangle[’y2′] = $y1;

if($spanLeft==0 && $x > 0 && imagecolorat($image, $x - 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x - 1, “y” => $y1);
$spanLeft = 1;
}
else if($spanLeft==1 && $x > 0 && imagecolorat($image, $x - 1, $y1) != $oldColour)
{
$spanLeft = 0;
}
if($spanRight==0 && $x < $w && imagecolorat($image, $x + 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x + 1, “y” => $y1);
$spanRight = 1;
}
else if($spanRight==1 && $x < $w && imagecolorat($image, $x + 1, $y1) != $oldColour)
{
$spanRight = 0;
}
$y1++;
}
}

return $rectangle;
}

function floodfill_char($image, $x)
{
if((imagecolorat($image, $x, 12)==0))
return floodFillScanlineStack($image, $x, 12);

if((imagecolorat($image, $x, 38)==0))
return floodFillScanlineStack($image, $x, 38);
}

function split_chars_along_vertical($image)
{
$w = imagesx($image);
$h = imagesy($image);

/* $rgb = imagecolorat($img, $x, $y);
$r += $rgb >> 16;
$g += $rgb >> 8 & 255;
$b += $rgb & 255; */

// scan along each verical line looking for black pixels
// we’ll only scan two lines of pixels to save time. Both along the center
// split slightly apart
$letters = array();
for($index=0; $index<$w; $index++)
{
// check two lines of pixels one at 12 down, one at 38 down
// the picture is 50 pixels tall by the way
if((imagecolorat($image, $index, 12)==0) || (imagecolorat($image, $index, 38)==0))
{
// fill the character and return a rectangle around the image
$rectangle = floodfill_char($image, $index);

// pull this letter out into a new image
$singleLetter = imagecreatetruecolor($rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
imagecopy($singleLetter, $image, 0, 0, $rectangle[’x1′], $rectangle[’y1′],
$rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
$letters[] = $singleLetter;

// find the next character
$index = $rectangle[’x2′]+1;
}
}

return $letters;
}

$image = @imagecreatefrompng(’82.clean.png’);
if ($image == false) { die (’Unable to open image’); }

$letters = split_chars_along_vertical($image);

// dump the first letter to the screen
header(”content-type:image/png”);
imagepng($letters[0]);

?>

To run through the code and show how it works I made this neat little gif. To be honest it’s probably just a waste of my bandwidth but it looks pretty cool.

Floodfill gif

Now I can finally sort this damn neural network out :P . You just know something is going to break. Of course that’s the fun part, right? Anyone?

Scripts to separate characters

Friday, March 21st, 2008