Spam Stats

Explanations are further down.

The Current Plots

Hourly spam plot Hourly cumulative spam plot
Weekly spam plot Weekly cumulative spam plot
Monthly spam plot Monthly cumulative spam plot


I get a fair amount of spam here at the base. Not a huge amount from what I hear from others, but a fair amount. The workstations here spend some of their time detecting and deleting it. The last time I reconfigured everything to chase spam away, I added hooks to track how well they were doing. Specifically, every time a piece of mail is actively detected as spam or is manually deleted as spam, I now keep track of it. I say “actively detected” because a fair amount of spam drops through to my spam bucket because it's simply not addressed to me; those messages are not counted.

Reading the Plots

There are 6 plots generated. Two of them are generated hourly and the rest daily. The hourly plots are a bar chart of the number of e-mail messages detected as spam each hour and a cumulative plot of the total number of spam messages detected up to that point.

Each plot is also broken out by detection mechanism. Here on the base, there are 2 detection mechanisms: spamassassin and me. The counts of spam that spamassassin catches are plotted as “automatic” and mine as “manual.” Spamassassin does much better than I do, but that's why it gets the big bucks. But the “manual” plot is a fair annoyance metric.

Each bar in the bar chart represents the number of messages detected in the hour starting with the plotted value. The bar over 1 is the messages detected between 1:00 and 1:59 AM. The bars stack on top of each other, so the total number detected is the height of all the bars. This can make determining the exact numbers of the manual deleted spam tricky, but I think the proportions are more interesting.

Cumulative plots are a single point plotted for each class of deletions, and a running total. When no spam has been manually deleted the total point will cover the automatic one. Similarly to the bar chart, the point on plotted above 2 on the cumulative plot represents the number of messages detected from midnight until 2:59 AM.

The other four represent daily summaries of the number of detected spam mail for each day. Again there are cumulative and single-day summaries, broken out by detection mechanism.

Generating the Plots

In principle, generating the plots is easy. In practice, a fair number of systems are brought to bear.

For each message detected as spam, its Subject: line is put into a simple text file, one for spamassassin and one for me. The spamassassin file is actually maintained by the procmail mail filtering program. Procmail does a lot for me, including picking my football pool when I don't get to it. Because it's already putting messages that spamassassin has identified as spam aside, it's easy to add a step that puts a copy of the subject into a file.

I read mail using mutt, and I have a macro to pass a message to spamassassin as spam (so spamassassin can spot similar messages in the future) and delete the message. I've added a line to copy the subject out as well. When I do see a spam message that's gotten through, one keypress teaches spamassassin how to spot it, records my deletion, and deletes it.

Once all that data is recorded, plotting is the next step. Plots are generated from the files created above using perl and hometown boy grap. They get some help from the netpbm suite of tools for the image conversions.

There are no thrilling breakthroughs in the scripting. The data's analyzed in perl and a grap script is generated. The perl script calls grap, groff and the netpbm tools to create the plots. Another perl script puts links to the most recent versions of the plots into this page after any new plots are generated. All of this is coordinated through cron. The most intensive work is the conversion from postscript - grap/groff's usual output format - into png.

Valid XHTML 1.0!
This page written and maintained by Ted Faber
Please mail me any problems with, or comments about this page.
PGP Public Keys