Sunday, July 10, 2005

Proposal For An Anti-Spam System That Works

This article was first posted on the web in June 2003.

It is obvious that most current anti-spam mechanisms fall woefully short of doing a good job. Simplistic mechanisms are based on whitelists (addresses you want to receive from), blacklists (those you want to block), and string detection (find bad strings in subject or body). Methods do not in themselves work effectively.

It is impossible to recognise all of the spam and even worse, valid emails (sometimes amusingly called "ham") may be blocked. This is the phenomenon of "false positives".

There are two main areas of development on the anti-spam front. The first uses Bayesian Filtering, which is a more intelligent probabilistic method of determining spam from content strings. PopF and SpamBayes implement this concept in Python. They are both designed as POP3 and SMTP proxies. This allows client-side as well as server-side installation, so that as many people as possible can use the services. On the other hand, tools that require server access do not encourage widespread adoption.

While I think Bayesian Filtering is a step forward, I am not sold on this technique, because the spam blocker will always be playing catch-up with the spam sender, as they get more and more clever about how they contact you. This method will not stop apparently benign messages that can still cause Denial of Service through time wasted reading and deleting apparently innocuous mail.

A more robust method uses an authorisation protocol. Simply: I cannot send an email to you unless I have permission. I must get authorised as a valid sender. This thwarts spammers since they almost never use valid return email addresses. Even if they do, they will not take the time to manually send an authorisation email. Their entire business model is based on sending out millions of messages quickly and freely, in an entirely automated fashion. Any manual intervention kills their profit margin.

Some have objected that this is too much work to simply to send a message, that it puts too much onus on the valid mail sender. But this form of authorisation is the same as what happens when I try to sign up for a mailing list, or a web site. I get back an email telling me how to complete authorisation for that resource. I only need to go through this process once, then I'm recognised and in the clear. Widespread adoption requires only that people think of email exchanges in the same way.

Besides, as we shall see, if I am already known to the recipient I can be pre-authorised and will never even know that they have such an anti-spam mechanism in place.

Before I realised that others had thought of this method as well, I designed a procedure as a first step towards implementing software.

Process 1: when I send a message...
  • recipient is automatically added to whitelist
Process 2: through an interface I can manually...
  • edit whitelist or blacklist by individual, by domain, or list
  • configure whether blocked messages should be quarantined or discarded
  • check quarantined messages, delete them, or forward them to my POP box
  • edit questionaire
  • change questionaire responses required for authorisation
Process 3: a daemon running periodically...
  • checks each pending item in the database, and if too much time has elapsed without a response, adds the sender to the blacklist
Process 4: when a message is received...
  • process it using the following pseudocode:
is sender on blacklist:
kill message
else:
is sender same as user:
is checksum present:
send message on to recipient
else:
trash message
else:
is checksum in subject:
decode checksum based on sender
check ID against database
if ok:
send message on to recipient
else:
report back error to sender
else:
is sender on whitelist:
send message on to recipient
else:
does mail look like mailing list, bounce, or auto-response:
trash message
else:
generate checksum based on sender & unique incident ID
store database entry with message, keyed by ID
add checksum to subject
send message back to sender with questionaire in body

TMDA is a robust server-side system, written in Python, that implements this authorisation method, with a number of enhancements. But it works only on UNIX and is far from easy for a novice to understand. Also, being server-side, only those people with access to their own server (or with a willing ISP) can implement it. Certainly there are good reasons for such software to be active on the server, the primary one being that authorisation requests can then be returned instantly after getting an email. If the software was implemented as a POP proxy, these replies would be sent only when you check your mail. This sort of delay makes the system cumbersome and less likely to be used.

However, I still believe there is a case for a client-side implementation, especially given the popularity of persistent broadband connections to the Internet. A client proxy could periodically (every 15 minutes) contact the ISP and get new messages, processing all authorisation requests. This would not place an unreasonable burden on any of the resources of the system: the client, the server's POP, or the sender's patience. Better yet, such an application could actually be used by the vast majority of people who do not have control over their server's email setup.

To summarise, what is required is an anti-spam tool that:
  • implements an authorisation process like that outlined here
  • works across all major platforms
  • works as a client-side proxy
  • is open source
  • is preferably written in Python

RELATED POSTS

No comments:

Post a Comment