An idea I once had

I don’t have the skills to set this up myself as it’s a large project and I don’t know how you’d go about setting up an opensource project on this scale. However.

Have you ever used the ALICE bot? I think it’s pretty amazing yet ridiculously simple. Literally anyone who knows basic English can make his own. And that’s my idea. A website like wikipedia based on user input that is moderated so that people all work together putting in a little time to make many different personality bots for a game. Then we get some open source coders to add the finishing touches to the game like the GUI. Sure it might not make money but hopefully it’d push the boundaries of games in the future. I don’t know. Good idea?

April 12th, 2008, posted by Harry

Forum registration

I get a message in my comments:

“BTW? have you got the rest of the scripts you need to use your captcha breaking code? i.e. the forum spam stuff?”

Don’t think I’m not listening :D . So here we go, a script that will register at a phpBB2 forum. It works automatically for Linux if you run it from the command line. I know half of you probably use Windows but it’s such a pain trying to port code and the necessary code is in my guest post on BlueHatSeo.com.

The workings behind the functions are stored in regfunctions.php, and you use the script by either running “php regphpbb2.php” or navigating to it in your browser if you’re on Windows.

Anyway at the top of the code is our list of variables that we can change for registering at different forums.

<?php

require_once(”regfunctions.php”);

// set our sign up variables like username and so on
$sign_user = “user”;
$sign_email = “test@test.localhost”;
$sign_pass = “aaa”;
$sign_sig = “My spammy signature”;
$site_name = “http://localhost/phpBB2/”;

Now we download the captcha and if we’re running inside a browser we show the captcha to the user, otherwise we run our C program to crack it.

// make sure we haven’t already sent an answer to our captcha
if(!isset($_GET[’captchacode’]))
{
// begin to register an account this will save the captcha to downloadedcaptcha/captcha.png
// it will return a necessary session/confirm id we’ll need later
$ids = get_register_captcha($site_name);
$sid = $ids[0];
$cid = $ids[1];

// crack the captcha or get a human to solve it
if(!isset($_SERVER[’_']))
{
// if we are running in a web page show the captcha to the user
echo “<h2>PHPBB2 Captcha</h2> You can crack this automatically by running this script from the command line in Linux with ImageMagick libraries installed.<br />”;

echo “<img src=’downloadedcaptcha/captcha.png’ /><br />”;
echo “<FORM action=’” . $_SERVER[’PHP_SELF’] . “‘ method=’GET’>”;
echo “Type in the code <input type=’text’ size=’15′ name=’captchacode’ /><br />”;
echo “<input type=’hidden’ name=’sid’ value=’” . $sid . “‘ />”;
echo “<input type=’hidden’ name=’cid’ value=’” . $cid . “‘ />”;
echo “<input type=’submit’ value=’submit answer’ />”;
echo “</FORM>”;

exit(1);
}
else
{
// if we are running from the command line solve it in code
echo “Solving captcha…\n”;
$solved_captcha = str_replace(” “, “”, exec(”./cleanpic downloadedcaptcha/captcha.png”));
$solved_captcha = str_replace(”\n”, “”, $solved_captcha);
}
}

// if we have a solved captcha put it in the correct variable
if(isset($_GET[’captchacode’]))
{
$solved_captcha = $_GET[’captchacode’];
$sid = $_GET[’sid’];
$cid = $_GET[’cid’];
}

The important bit here is the $solved_captcha = exec(”./cleanpic… ) part. exec allows us to run a program and return the value, in this case our cracked captcha. You need to replace this program to it’s windows version if you are running windows. The str_replace around the call to exec is just to clean the string up in case it sends back a string with spaces or carriage returns. Now we just send some post variables to the server with all the necessary data

// finish the sign up
$success = sign_up($sid, $cid, $solved_captcha, $site_name, $sign_user, $sign_email, $sign_pass, $sign_sig);

if($success)
echo “account created\n”;
else
echo “account failed to be created\n”;

// now verify the email, note: this is a stub, no code in it
// gotta write it yourself :D
verify_email();

?>

I haven’t written in the email verification code but you don’t always need it for phpBB2. It’s dependent on the mail server you use anyway.

How do you work these scripts out? I have a trick :D . LiveHTTP Headers for Firefox. Take a look below. I register first manually and it prints out everything I need to send to the server to register automatically next time.

LiveHTTP headers

The highlighted part (click to zoom in) is all the post variables that allow us to register. Just exchange them for our own variables. From here it’s pretty simple to add on the pieces that post messages on the forum.

Forum Registration Code

April 11th, 2008, posted by Harry

Letter Derotation

I’m getting kind of done with captchas but here goes another post on them. You may have read Slightly Shady SEO about how to derotate letters. Here’s my easier technique. Add up all the black in the vertical lines of the letter, find the average and then check for spikes above that average. These spikes are probably vertical lines in the letter like the back of a ‘d’ or a ‘p’ etc. Then we simply rotate it around by a few degrees until we find the rotation with the largest vertical spike above the average. We then need some extra checks for symettry and so on but that’s the basics.

April 5th, 2008, posted by Harry

OK Cool

U R Gay

April 2nd, 2008, posted by Harry

GOCR to Neural Nets Pt 2

As per usual these posts don’t go as smoothly as I would like. The idea was to use FANN for PHP to make a simple neural network that would work easily. Hahahaha. You might think I make this stuff up as I go along. Oh wait. I do.

Anyway FANN requires PEAR to be installed and I figured that it’d be much simpler than installing PEAR modules to find something that was completely PHP to do the job. I did that. However.

1. It’s slow.

Ok so it’s slow. We can live with that right? I mean we’ve got some time.

2. It’s slow.

Ok it really is slow, and I’m getting impatient.

3. It’s painfully slow.

PHP sucks for some things. I do like C’s direct memory access. On the plus side this little neural net class is so simple and easy to understand.

Anyway I went ahead and I fitted all my modules together around this PHP neural net and I walked away whilst my computer attempted to learn the alphabet at a ridiculously slow pace. 900 captchas later I give it up.

The neural net does sort of work now for quite a few characters. Although I did notice a character failed to segment properly which set the learning back a bit. A lot of the characters it fails on are things like producing P’s for R’s or vice versa. So you can understand where the problem lies.

I won’t post code because there’s a fair bit spread over a few modules. nnbreak.php is the main module and must be run from the command line. Like this:

php nnbreak.php captcha.jpg answer.txt train=1/0

1 means train the network using the answer stored in answer.txt. 0 means guess what the captcha is in captcha.jpg, which will ignore answer.txt but answer.txt must still be included (although it can be blank or not exist etc).

So analysis… Here’s how it works roughly:

Allocate memory for the neural network, and load in previous neuron weight values

Extract the letters as said in that post about extracting letters.

Convert the letters into a 10×10 matrix of averaged values, maximum being 1.0

Loop through each letter and send the matrix of values to the input neurons

Check the output neurons

Possibly teach the network the right answer using back propogation.

We can tweak a lot of things such as the number of neurons in each layer and the size of the matrix. I did mess with the default backpropogation teaching speed because it didn’t seem to be learning fast enough to me :D . I’m guessing that has some drawbacks to it but I’m not sure exactly what they are. In this line the 0.5 is the learning speed which has been moved up from the default 0.1:

$nn = new nn(3, $layer_structure,1,0.5,0.9);

Now if anyone happens to train the network to recognise all of the letters properly send me the file so I can claim it as my own and pretend I did it all perfectly ;) . J/k. Seriously though I do think it should eventually learn all the letters properly it’s just taking a long time.

In conclusion. I don’t trust neural networks they do too much stuff that I don’t know about. I’m betting that robot from the Terminator was probably built from neural nets.

The code to guess a captcha using neural nets - Already populated with some weights so it sort of works

March 30th, 2008, posted by Harry

I install plugins

I haven’t installed any plugins since I installed this blog. I was lazy and rushed and didn’t really think it through. I meant to install feedburner and forgot. I just got this well crafted comment which made me think:

“Hahaha, you have “nofollow”

No wonder you have no readers!!!!!!!!!

HAHAHA”

He’s probably right. So notice the dofollow plugin, top commentators, and a contact page that actually works.

Update: I just got spammed…Plugins go back off, however I’ll leave the contact page working :D

March 28th, 2008, posted by Harry

10 Steps to Solving a Captcha

Lately I’ve been stuck on one of my projects, hitting my head against the wall. I’m going to keep doing it until I solve the problem, however it’s got in the way of my posts. So I’ve taken some time out to put together two posts.

10 Steps to Solving a Captcha

1. Start up The GIMP, Photoshop, or <insert favourite paint program here>

2. Mess around with all the plugins and filters you have available and see how much noise you can get rid of just with these. If a filter does a good job but removes too much information remember you can always use another filter on the same image and paste these two images together at the end.

3. If the letters are getting sketchy with all these filters, and too thin you can try combining them with the original image. You can do this by floodfilling from the remaining pixels on the altered image onto the original image. Now copy these floodfilled parts of the original image back into your altered image. This won’t work if straight lines cut through the letters though.

4. If “artefacts” or noise are still on the image then you need to list down its common properties. Does a line have to start from an edge? Is it always straight? Is it within a set angle? Are dots spaced a certain distance apart or entirely random? And so on. We can then take all this into consideration when we write our custom noise removal algorithm.

5. Get a programming language you can throw ideas around in quickly. Basically this is just a draft. Save a rough image off from photoshop and maybe even use Visual Basic or something to test ideas out. Even if you don’t know how to apply all those filters you used in Photoshop in code who cares. If you know it works you can move mountains compared to writing a ton of code that you don’t even know will work.

6. Write that custom noise removal algorithm I talked about earlier. ;) Don’t delete code!!! Ever. Even if it doesn’t work just archive it somewhere because sometime you just know you’ll want it again.

7. Once you have the noise removed you may need to break overlapping letters up. All letters are made up of certain types of strokes (and loops in handwriting). As I understand it this is actually the basis behind natural handwriting recognition. For instance in a captcha if a letter ends with a curve then often that will be the end of the letter. Depending on the length of the curve and as long as the letter isn’t rotated at a strange angle. You’ll probably need to consider these type of things if the letters overlap.

*** This is possibly one of the hardest steps

8. Once you have single letters you have to de-rotate them. Read SlightlyShadySeo. Again think about common properties. Don’t accidentally rotate a letter upside down if there is no way a letter would be that way up in the first place. Although that was just an example and probably wouldn’t happen.

9. Write an automated script that downloads a ton of captchas and only requires your input to train it with GOCR or phpOCR. It’s way more fun watching your computer do something than doing it yourself.

10. Optimize the algorithm. If your algorithm is slow but works you can probably save time by removing parts of your code that are running more times than they need to etc.

Abuse each of these 10 steps and change them to suit your personality. Receive flashes of inspiration. They’re fun. Don’t drink high caffeine drinks. They make me talk about absolute rubbish non-stop…

March 26th, 2008, posted by Harry

Email Verification

So you’ve cracked that phpBB2 or phpBB3 captcha registered an account, and now it wants you to verify your account by email. Foiled again.

Actually this is pretty easy to get around. All you need is a free email service that supports webmail, and a page scraping utility. Hmmmm… Guess what, my page scraping code will work excellently with webmail services. What’s really handy is as long as you point the cookies string to a proper empty file it will keep the session details allowing you to log on as if you were using a normal web browser. So then you would just use preg_match to find important parts of the page (like login buttons, inbox, and so on), follow these links, until you find the link that says “Confirm your email address” or similar.

Or you could use temporary email…

$output = scrape_page(”http://www.mytrashmail.com/myTrashMail_inbox.aspx?email=” . $temp_email_name);

That’ll dump the html of your temporary inbox. You can even delete the email promptly to save them space.

If you’re really good you can download a POP3 PHP class and log into GoogleMail directly ;)

March 26th, 2008, posted by Harry

PHPBB3 Captcha difficulty

Is phpBB3 more secure than phpBB2? Here is a default phpBB3 sample.

PHPBB3 captcha

This is a lot stronger than a phpBB2 captcha. We can’t separate a letter based purely on its colour anymore. Notice how there is a line running underneath the B that is the same colour as the B. The background colour is annoying as anything but only from a person’s point of view. Our PC doesn’t really mind.

One of its issues/weaknesses is that there are no lines that cut across the squares, they all go underneath them. That means there is no breaks in the squares we have to detect. The only other weakness I can see is that the lines go directly across without intersecting at any point. That means that there are no objects that look like the squares of the letters that are just noise.

So here’s my algorithm which I think would solve it. Admittedly I haven’t tested this but I don’t see why it wouldn’t work. All the letters are made up of squares. We need to test, if starting at one pixel we can get back to the start by following the same colour pixels. That obviously would make a square :P , or something close like a distorted rectangle. It’s almost like dot to dot puzzles. If we can get back to the start keep the line and colour it in using previous post’s fill function (or php GD’s one :D ). If we can’t get back to the start or the line keeps travelling too far then we remove it and find another coloured pixel, that doesn’t match the background colour.

The main issues we would have to overcome are lines which are thicker than 1 pixel and small blocks of colour found at the side of some of the letters. The other issue would be making sure we don’t recheck the part we just shaded in (Maybe use a unique colour for it?).

March 21st, 2008, posted by Harry

A custom floodfill routine

Last post I said to separate characters you simply need to flood fill them and calculate the extreme points to find out where to fit your rectangle around. Hahaha. I’ve spent the last few days trying to port optimized floodfill functions to php. Normally I’d just take the sane easy option and use pre-written code like GOCR but apparently I like pain.

The problem is the optimized bit. I can write a simple recursive floodfill function that calls itself until it’s done but I have no idea how to write something that will be reasonably fast. The more complicated captchas will require a fair bit of speed because you will be thinking about cracking a fair few of them in a period of time. This site is where I eventually found a simple routine that worked. My issue was I used another routine, ported it, and then it broke. Miserably. After filling only two lines.

Here is my code. First it loads in an image of a captcha. It then scans two lines along the horizontal axis, one at 1/4 of the way down the picture and one at 3/4 of the way down. When it hits a character it floodfills it. The floodfill function returns the extreme positions of pixels which gives us a rectangle around that letter.

<?php

function floodFillScanlineStack($image, $x, $y)
{
// the colour we are shading in - black letters
$oldColour = 0;
// the colour the want to shade the letters in - red just because we can
$fillColour = imagecolorallocate($image, 255, 0, 0);

// we need the image width & height
$w = imagesx($image);
$h = imagesy($image);

// set the rectangle co-ords
$rectangle = array(”x1″ => $x, “x2″ => $x, “y1″ => $y, “y2″ => $y);

if($oldColour == $fillColour) return;

$stack = array();
$stack[] = array(”x” => $x, “y” => $y);

while(count($stack)>0)
{
$pos = array_pop($stack);
$x = $pos[’x'];
$y = $pos[’y'];

$y1 = $y;
while($y1 >= 0 && imagecolorat($image, $x, $y1) == $oldColour) $y1–;
$y1++;
$spanLeft = 0;
$spanRight = 0;
while($y1 < $h && imagecolorat($image, $x, $y1) == $oldColour )
{

// here we set the pixel colour
// use these to find our rectangle around the letter
imagesetpixel($image, $x, $y1, $fillColour);
if($x<$rectangle[’x1′])
$rectangle[’x1′] = $x;
if($y1<$rectangle[’y1′])
$rectangle[’y1′] = $y1;
if($x>$rectangle[’x2′])
$rectangle[’x2′] = $x;
if($y1>$rectangle[’y2′])
$rectangle[’y2′] = $y1;

if($spanLeft==0 && $x > 0 && imagecolorat($image, $x - 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x - 1, “y” => $y1);
$spanLeft = 1;
}
else if($spanLeft==1 && $x > 0 && imagecolorat($image, $x - 1, $y1) != $oldColour)
{
$spanLeft = 0;
}
if($spanRight==0 && $x < $w && imagecolorat($image, $x + 1, $y1) == $oldColour)
{
$stack[] = array(”x” => $x + 1, “y” => $y1);
$spanRight = 1;
}
else if($spanRight==1 && $x < $w && imagecolorat($image, $x + 1, $y1) != $oldColour)
{
$spanRight = 0;
}
$y1++;
}
}

return $rectangle;
}

function floodfill_char($image, $x)
{
if((imagecolorat($image, $x, 12)==0))
return floodFillScanlineStack($image, $x, 12);

if((imagecolorat($image, $x, 38)==0))
return floodFillScanlineStack($image, $x, 38);
}

function split_chars_along_vertical($image)
{
$w = imagesx($image);
$h = imagesy($image);

/* $rgb = imagecolorat($img, $x, $y);
$r += $rgb >> 16;
$g += $rgb >> 8 & 255;
$b += $rgb & 255; */

// scan along each verical line looking for black pixels
// we’ll only scan two lines of pixels to save time. Both along the center
// split slightly apart
$letters = array();
for($index=0; $index<$w; $index++)
{
// check two lines of pixels one at 12 down, one at 38 down
// the picture is 50 pixels tall by the way
if((imagecolorat($image, $index, 12)==0) || (imagecolorat($image, $index, 38)==0))
{
// fill the character and return a rectangle around the image
$rectangle = floodfill_char($image, $index);

// pull this letter out into a new image
$singleLetter = imagecreatetruecolor($rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
imagecopy($singleLetter, $image, 0, 0, $rectangle[’x1′], $rectangle[’y1′],
$rectangle[’x2′] - $rectangle[’x1′] + 1,
$rectangle[’y2′] - $rectangle[’y1′] + 1);
$letters[] = $singleLetter;

// find the next character
$index = $rectangle[’x2′]+1;
}
}

return $letters;
}

$image = @imagecreatefrompng(’82.clean.png’);
if ($image == false) { die (’Unable to open image’); }

$letters = split_chars_along_vertical($image);

// dump the first letter to the screen
header(”content-type:image/png”);
imagepng($letters[0]);

?>

To run through the code and show how it works I made this neat little gif. To be honest it’s probably just a waste of my bandwidth but it looks pretty cool.

Floodfill gif

Now I can finally sort this damn neural network out :P . You just know something is going to break. Of course that’s the fun part, right? Anyone?

Scripts to separate characters

March 21st, 2008, posted by Harry