Removing Lines Across Letters
Squidoo and Recaptcha.
Both have an annoying line going through them which joins the letters together. But how effective is this really? The thing about recaptcha is that the text is known to OCR successfully apart from one of the words which is unable to be OCRed. In recaptcha we can simply type in one correct word and it won’t be able to check the other one. We’ll probably need some pretty decent OCR software and an approximation module that guesses how close a word is to a proper english word.
But that line through it destroys any chance of standard OCR software recognising anything. However here’s a weakness. The line generally starts somewhere approximately in the middle and often sticks out from the end of the letter slightly. It shouldn’t be too hard to pick up where the line starts and possibly ends. From there we can assume that it won’t ever be thicker than a certain amount and will move by a limited amount. We can roughly track the line whenever it exits a letter. From that we can estimate where it has been travelling and what part is letter and what part is line.
Incidentally this works pretty well on digg except the vast quantity of lines and differing shades make it harder to pick them all up. Often you’ll pick up what looks like a line and have to flip 180 degrees to make sure you haven’t missed anything. The other problem with digg is it’s easy to end up with breaks in the lines and have to “trace blind”. If you have already traced enough of the line that’s not too hard because it’s a pretty basic algorithm to keep tracing through blank space with straight lines until we hit the rest of the line. Just be aware you might be a pixel or really rarely two away from the actual line.
Squidoo is a lot easier to identify the line with but has a lot more distortion in the letters. The distortion could be an issue, might need another algorithm to beat that if it won’t train out.
Anyway below is some code with a line detection algorithm. It assumes the furthest point left and right of a series of letters is part of the line. It then tries to trace along the line. It’s nowhere near perfect at the moment as it suffers from some issues when trying to build the line at the end that causes it to favour travelling upwards. But it shows that with a bit more tweaking those lines can be removed. I compiled it on Linux, it’ll be easier to test inside a linux VM if you’re running windows. The pics below show a perfect case scenario.
Download Code with Link Below:
Code to detect and draw over the line in Squdioo Captcha’s
==================================
ALGORITHM OVERVIEW
==================================
- Try and draw as many lines as possible only allowing the line to move up or down by one pixel with each pixel travelled in the horizontal axis. If we hit a blank space stop drawing this line. Carry on drawing the others.
- Find the parts in the image where the line is most likely to skip. See pic below.
- Calculate the average incline per pixel movement in the horizontal axis between each section where the line “jumps”. Height/Width
- Use this incline to latch onto the closest shaded pixel. Then finally smooth the line.
==================================
What about other captchas like myspace where the letters actually touch? Hmmm… I wonder if they considered the best choice of font
?






June 11th, 2008 at 9:18 am
You have mentioned intersting captcha lines. I’m wondering is this has so mant things in making it!
June 12th, 2008 at 2:05 am
Its quite interesting to know…
June 29th, 2008 at 9:38 pm
very interesting captcha technque. Thanks
July 1st, 2008 at 4:33 am
wow!!, nice tips dude. I can use it for my captcha lines.