Tuesday, May 23, 2006

Obscuring Your Email From Spammers

Fighting spam is a full-time job for some, an annoyance at least for others. Those of us who have a website and want to plainly display an email address for the convenience of our readers have a problem: we are also plainly displaying this for spam-friendly web scrapers. Over the years a good number of techniques have arisen to deal with this, which I will outline in this article.

For each technique I will discuss any drawbacks. There is no perfect method but even a small effort is better than none. The theory is this: spammers operate on the basis of volume. It is not worth their while to slow down to do any sort of complicated parsing since the payoff is only a few more addresses out of millions (how many people bother with these techniques?).

That said, I will embark upon this wonderful journey of discovery as a testament to the inventiveness of those who have pioneeered these methods. Whether they are justified or not!

Plain Text
Code as: <a href="mailto:first.last@domain.com">
first.last@domain.com</a>


Looks like: first.last@domain.com

This amounts to not doing anything. Your email address is displayed openly. Any spam can be dealt with on the receiving end. Your readers need no special browsers or capabilities and can click on the link with expected results.

Character Entities
Code as: first&#46last&#64;domain&#46;com

Looks like: first.last@domain.com

HTML character entities are decoded by the browser back into displayable type, but look like some sort of gibberish at the markup level. However, they are easy to automatically decode, and so cannot be recommended as a way of avoiding spammers.

Typographic Obsfucation
Code as: first dot last at domain dot com
f i r s t . l a s t @ d o m a i n . c o m
first.REMOVE.last@domain.com


By spelling out parts of your address, adding spaces, using synonyms, or including obviously extraneous words, you are relying on a reader to visually decode and rewrite your email. This technique, like most of those that follow, precludes the use of a convenient mailto link, because if it's convenient to the reader then it is to the spider as well.

Unfortunately this means extra work, which will reduce the number of messages you get. Presuming you want to communicate, this is a bad thing. Also, web scrapers may be smart enough to piece together a valid email, since there are only so many substitutions that must be tried. (Removing whitespace is almost too easy.)

Substitute With A Graphic
Code as:
<img src="myaddress.png" border="0" alt="my email address">

Looks like:

Converting the text to an image file definitively foils spiders, but is a barrier to your users and may break usability guidelines. The reason is that you cannot safely put your address in the clear in the ALT attribute, so those without a visual display get no useful info. (A similar technique uses Flash files, but has no additional advantages.)

JavaScript Generation

There are many possible variants on the theme of programmatically creating the email link. This is my own, which contains some enhancements.

Note that significant strings that a spider might be set to recognise (eg: "mailto") are broken up. Also, character entities are used for the symbols, plus the components of the email address are listed backwards.
<script language="JavaScript"> <!--
function InsertEmail(t) {
var chardot = '.';
var charat = '@';
var commune = new Array('com', 'domain', 'last', 'first');

document.write('<a href="ma');
document.write('ilto:');
document.write(commune[3]);
document.write(chardot);
document.write(commune[2]);
document.write(charat);
document.write(commune[1]);
document.write(chardot);
document.write(commune[0]);
document.write('">');
document.write(t);
document.write('</a>');
}
// --> </script>


In practice one would remove the function to an external JS file, making it even less likely to be found and parsed. The problem with this technique is that it restricts your readers to JavaScript-enabled browsers. In practice this may not be a significant limitation.

JavaScript Generation With Obsfucation

To take the previous technique even further you can obsfucate the JavaScript. The online tool Enkoder creates something like this:

<script type="text/javascript">
/* <![CDATA[ */
function hivelogic_enkoder(){var kode=
"kode=\"oked\\\"=')('injo).e(rsvere).''t(lispe.od=kdeko\\\\;k\\\"do=e\\\"\\"+
"\\\\\\\\\\kode\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\==dxke)o(}dcCeaoCrohfmgri.tn"+
"=rxS8+1;+2)=<c(0ic3f);(-AidtCeaocreho.=d{k+ci)h+g;et.ndlkeio0<i;r=f('o=;;'"+
"\\\\x\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\@\\\\g{nh0r\\"+
"\\0\\\\\\\\\\\\\\\\\\\\\\\\0\\\\\\\\,\\\\\\\\\\\\\\\\\\\\\\\\+\\\\gfFhdrFu"+
"rkipjul1wq@u{V;.4>.5,@?f+3lf6i,>+0DlgwFhdrfuhkr1@g~n.fl,k.j>hw1qgonhlr3?l>"+
"u@i+*r@>>*A{(%g/BDs5(ksDibtug4uoFsyjrzzgx4lyuor@gz(oCskbnlgx(&kBo.}zzxk4{t"+
"us%ihjr@\\\\g\\\\\\\\\\\\\\\\\\\\\\\\n\\\\\\\\=\\\\\\\\\\\\\\\\\\\\\\\"d\\"+
"\\ke;o\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\kode=kode.split('').reverse().join('"+
"')\\\"\\\\\\\\\\\\x;'=;'of(r=i;0<ik(do.eelgnht1-;)+i2={)+xk=do.ehcratAi(1+"+
"+)okedc.ahAr(t)ik}do=e+xi(k<do.eelgnhtk?do.ehcratAk(do.eelgnht1-:)'';)\\\""+
"\\\\e=od\\\"kk;do=eokeds.lpti'()'r.verees)(j.io(n'')\";x='';for(i=0;i<(kod"+
"e.length-1);i+=2){x+=kode.charAt(i+1)+kode.charAt(i)}kode=x+(i<kode.length"+
"?kode.charAt(kode.length-1):'');"
;var i,c,x;while(eval(kode));}hivelogic_enkoder();
/* ]]> */
</script>

This has no real advantage over the more comprehensible JavaScript technique unless you believe spammers possess high intelligence and cracking abilities. I don't think so.

Encryption

This technique stores only an encrypted address on the page, decrypted by JavaScript. It certainly stops spam, but is overkill for most purposes. If you wish to use it, try Email Protector, which uses 10-bit RSA encryption.

Form With CGI

Some sites refuse entirely to publish their addresses and accept email only through a web form. Since the email address is only used on the server side, this fully protects the site from spiders. Unfortunately readers receive an interface inferior to their email software and are restricted from keeping a record of the sent email. Though popular, forms are a barrier to communication and I do not recommend them.

CSS Display None
<style>
span.hide {display:none;}
</style>
first.last@domain<span class="hide">null</span>.com


This technique interrupts the email address with some HTML which is set to not display by way of CSS. This could be useful in combination with some of the plain obsfucation techniques but likely adds little to them.

CSS Pseudo-Class
<style>
address:after {
content: " <first.last\40domain.com>";
}
</style>
<address>me</address>


I found this tricky method at Newt Edge. It relies on the CSS2 pseudo-class :after, so older browsers plus Opera and Safari are out of luck. Again, if the style is in a separate file it reduces the chance the address will be found. But it's still almost in the clear.

CSS Backwards Text
<style>
.backwards {unicode-bidi:bidi-override; direction: rtl;}
</style>
<span class="backwards">moc.niamod@tsal.tsrif</span>


This technique is taken from the CSS Play site. It works only in current browsers which support a full range of CSS2 attributes, which means only Explorer 7 and FireFox 1.5. It's cute though.

Conclusions

It's easy enough to set up an experiment and see what techniques resist spam. Back in 2004 basic obfuscation and JavaScript worked just fine. I do not think more complicated techniques are justified, though it's fun to see what people come up with.

Notes:
1. When almost finished this article, I found a similar one, though the author does not credit any of the techniques.
2. It's a shame I cannot properly demo some of these techniques, but Blogger gets in the way.

RELATED POSTS

1 comment:

robin said...

It's true that if you have access to server-side processing there are many techniques.

Post a Comment