Archive for March 26th, 2008

10 Steps to Solving a Captcha

Lately I’ve been stuck on one of my projects, hitting my head against the wall. I’m going to keep doing it until I solve the problem, however it’s got in the way of my posts. So I’ve taken some time out to put together two posts.

10 Steps to Solving a Captcha

1. Start up The GIMP, Photoshop, or <insert favourite paint program here>

2. Mess around with all the plugins and filters you have available and see how much noise you can get rid of just with these. If a filter does a good job but removes too much information remember you can always use another filter on the same image and paste these two images together at the end.

3. If the letters are getting sketchy with all these filters, and too thin you can try combining them with the original image. You can do this by floodfilling from the remaining pixels on the altered image onto the original image. Now copy these floodfilled parts of the original image back into your altered image. This won’t work if straight lines cut through the letters though.

4. If “artefacts” or noise are still on the image then you need to list down its common properties. Does a line have to start from an edge? Is it always straight? Is it within a set angle? Are dots spaced a certain distance apart or entirely random? And so on. We can then take all this into consideration when we write our custom noise removal algorithm.

5. Get a programming language you can throw ideas around in quickly. Basically this is just a draft. Save a rough image off from photoshop and maybe even use Visual Basic or something to test ideas out. Even if you don’t know how to apply all those filters you used in Photoshop in code who cares. If you know it works you can move mountains compared to writing a ton of code that you don’t even know will work.

6. Write that custom noise removal algorithm I talked about earlier. ;) Don’t delete code!!! Ever. Even if it doesn’t work just archive it somewhere because sometime you just know you’ll want it again.

7. Once you have the noise removed you may need to break overlapping letters up. All letters are made up of certain types of strokes (and loops in handwriting). As I understand it this is actually the basis behind natural handwriting recognition. For instance in a captcha if a letter ends with a curve then often that will be the end of the letter. Depending on the length of the curve and as long as the letter isn’t rotated at a strange angle. You’ll probably need to consider these type of things if the letters overlap.

*** This is possibly one of the hardest steps

8. Once you have single letters you have to de-rotate them. Read SlightlyShadySeo. Again think about common properties. Don’t accidentally rotate a letter upside down if there is no way a letter would be that way up in the first place. Although that was just an example and probably wouldn’t happen.

9. Write an automated script that downloads a ton of captchas and only requires your input to train it with GOCR or phpOCR. It’s way more fun watching your computer do something than doing it yourself.

10. Optimize the algorithm. If your algorithm is slow but works you can probably save time by removing parts of your code that are running more times than they need to etc.

Abuse each of these 10 steps and change them to suit your personality. Receive flashes of inspiration. They’re fun. Don’t drink high caffeine drinks. They make me talk about absolute rubbish non-stop…

Wednesday, March 26th, 2008

Email Verification

So you’ve cracked that phpBB2 or phpBB3 captcha registered an account, and now it wants you to verify your account by email. Foiled again.

Actually this is pretty easy to get around. All you need is a free email service that supports webmail, and a page scraping utility. Hmmmm… Guess what, my page scraping code will work excellently with webmail services. What’s really handy is as long as you point the cookies string to a proper empty file it will keep the session details allowing you to log on as if you were using a normal web browser. So then you would just use preg_match to find important parts of the page (like login buttons, inbox, and so on), follow these links, until you find the link that says “Confirm your email address” or similar.

Or you could use temporary email…

$output = scrape_page(”http://www.mytrashmail.com/myTrashMail_inbox.aspx?email=” . $temp_email_name);

That’ll dump the html of your temporary inbox. You can even delete the email promptly to save them space.

If you’re really good you can download a POP3 PHP class and log into GoogleMail directly ;)

Wednesday, March 26th, 2008