Every now and again I get sufficiently annoyed by spam to want to do something about it.
Today was one of those days where I had enough time to sit down and work on my plan for dealing with spam.
My email setup is a little unusual so it became a bit complicated.
All my incoming email is processed on my mail server; then one copy is forwarded to gmail, the other to my computer at home.
I then read all my email via gmail, and keep the copy at home just for backup purposes.
I’ve been doing this for a while; but found the manual task of scanning through spam in gmail to be tiresome and annoying.
What I wanted to do was cut out a lot of the spam on my server, before it was even passed on to gmail.
I figured I could script a better spam-filtering solution than gmail’s system.
My plan was to install bogofilter on the mail server and use that to filter out a lot of the spam.
There were two problems with this, one was that I needed a body of spam and ham emails to train it with and I didn’t keep email on the server.
The other was that it would need constant training as new emails came in, this was very tricky as emails were passsed on to gmail and my home computer in an automated process with no scope for human intervention.
To deal with the first problem I decided to install bogofilter at home too and train it there, then upload the training database to the mail server.
For the second problem I came up with the following solution:
I would use bogofilter on the mail server; send anything flagged as non-spam on to gmail and home; and send anything classed as spam just to home.
Once it got to home, it would be passed through bogofilter a second time; this instance would be set up slightly differently to the first one; it would classify emails into one of three folders; ham, spam or unsure.
I would then use mutt to periodically re-train bogofilter telling it that anything in the “unsure” folder was either ham or spam.
Finally, the newly trained database would be copied back up to the mail server each night.
The more astute reader may have noticed a problem with this solution.
I seem to have replaced scanning a folder full of spam on gmail, for scanning a folder full of spam at home.
This is true, initially I will be dealing with the same amount of spam.
However, I have a longer term plan here.
Once I’m happy with the filtering I’m going to tweak my solution so that anything tagged as spam will be deleted outright once it hits my home PC.
This I expect will reduce the amount of spam that I see by about 90%.
I am comfortable with the fact that I will probably lose the occasional non-spam email.
I’m gonna run this system for about a month, if I get through that with zero false positives I’ll feel brave enough to set it to delete.