10 Steps to Solving a Captcha
Lately I’ve been stuck on one of my projects, hitting my head against the wall. I’m going to keep doing it until I solve the problem, however it’s got in the way of my posts. So I’ve taken some time out to put together two posts.
10 Steps to Solving a Captcha
1. Start up The GIMP, Photoshop, or <insert favourite paint program here>
2. Mess around with all the plugins and filters you have available and see how much noise you can get rid of just with these. If a filter does a good job but removes too much information remember you can always use another filter on the same image and paste these two images together at the end.
3. If the letters are getting sketchy with all these filters, and too thin you can try combining them with the original image. You can do this by floodfilling from the remaining pixels on the altered image onto the original image. Now copy these floodfilled parts of the original image back into your altered image. This won’t work if straight lines cut through the letters though.
4. If “artefacts” or noise are still on the image then you need to list down its common properties. Does a line have to start from an edge? Is it always straight? Is it within a set angle? Are dots spaced a certain distance apart or entirely random? And so on. We can then take all this into consideration when we write our custom noise removal algorithm.
5. Get a programming language you can throw ideas around in quickly. Basically this is just a draft. Save a rough image off from photoshop and maybe even use Visual Basic or something to test ideas out. Even if you don’t know how to apply all those filters you used in Photoshop in code who cares. If you know it works you can move mountains compared to writing a ton of code that you don’t even know will work.
6. Write that custom noise removal algorithm I talked about earlier.
Don’t delete code!!! Ever. Even if it doesn’t work just archive it somewhere because sometime you just know you’ll want it again.
7. Once you have the noise removed you may need to break overlapping letters up. All letters are made up of certain types of strokes (and loops in handwriting). As I understand it this is actually the basis behind natural handwriting recognition. For instance in a captcha if a letter ends with a curve then often that will be the end of the letter. Depending on the length of the curve and as long as the letter isn’t rotated at a strange angle. You’ll probably need to consider these type of things if the letters overlap.
*** This is possibly one of the hardest steps
8. Once you have single letters you have to de-rotate them. Read SlightlyShadySeo. Again think about common properties. Don’t accidentally rotate a letter upside down if there is no way a letter would be that way up in the first place. Although that was just an example and probably wouldn’t happen.
9. Write an automated script that downloads a ton of captchas and only requires your input to train it with GOCR or phpOCR. It’s way more fun watching your computer do something than doing it yourself.
10. Optimize the algorithm. If your algorithm is slow but works you can probably save time by removing parts of your code that are running more times than they need to etc.
Abuse each of these 10 steps and change them to suit your personality. Receive flashes of inspiration. They’re fun. Don’t drink high caffeine drinks. They make me talk about absolute rubbish non-stop…
Wednesday, March 26th, 2008
