Fighting foreign language spam

The single largest class of spam that seems to get into my inbox is foreign language spam.  The spam seems to get through the multiple anti-spam layers I use:

  1. sendmail
  2. SpamCop
  3. SpamHaus
  4. MailScanner -> Spam Assassin / ClamAV
  5. Procmail
  6. Mac OS X Mail.app Junk Filtering
  7. Custom Mail.app Rules

I have no idea why this is such a hard problem for these anti-spam tools.  I should be able to say any message in a language I can not read should be marked as spam.

The problem comes in deciding where do apply this check.  The list of languages I can read is not the same as other users with accounts on the mail server.  So, this has to be a per user configuration option.

In a past life, I was a campus email relay administrator.  Our organization used a Proofpoint spam appliance as a milter interface into sendmail.  The Proofpoint device has user specific preferences for dealing with spam.  You can adjust what spam profile you wanted to use: quarantine all spam, tag and forward, or totally discard it.  You also had control of personal black and white lists.  There should be another field called “languages you read”.  I would select only “English” as I do not read Korean, French or Russian, so the Proofpoint would discard messages in those languages.

Currently, some anti-spam systems implement a subset of this wish.  They determine the “language” of the email based on the character set encodings.  This works fine for western (Latin) vs non-western languages but does not work for throwing out that French spam I get.  SpamAssassin currently does this with their “ok_locales” user preference.  By default, it is set to “all”.  I have set it to “en”, which most Latin based languages fall into.  The problem with this implementation is it only looks at the headers, making it fast, but will not work for spam that has the charset embedded in multipart/mixed content types.  Not exactly what I wanted.