I want to start off by saying that I have not done thorough testing on these tweaks, they have only been obvserved performance gains from running spamassassin in debug mode and then testing the performance against a known spam message (cat known-spam.msg | spamassassin -D).
I will try and explain some of the configuration options the best I can... Most of them are explained in good detail in the SpamAssassin docs, but sometimes you don't get all the settings with a default install.
First off we use bayes and auto_learn.
use_bayes 1
Bayes just makes sense... Even if it is server wide... You would be surprised at how accurate it is. You have to wait for the seeds to fill up (200 ham 200 spam), but once it has a good starting database you will be all set.
auto_learn 1
Auto_learn is great because it consults the auto_learn repository early on in the scan process, so some of our quickest scans normally occur from a hit on auto_learn. (in the 0.4 range sometimes)
use_pyzor 1
Previous I had said that I hadn't been using pyzor due to some performance problems we had seen in testing. It was recently realized that pyzor does infact work properly. Our test message was the reason why our pyzor seemed to be failing. Apparently there is a bug in the pyzor mail routine that causes pyzor to crash out giving an error on certain formats of mail messages. As a weird coincidence, we were using one of these messages as our test message for debugging spamassassin's performance. We have since turned pyzor back on for our scanning needs.
As an aside with regard to both razor and pyzor, you should make sure that they are hitting the correct servers. Pyzor recently moved to different IPs... So you should occassionally run `pyzor discover` to get the new servers from pyzor. Razor also has the same thing by running `razor-admin -discover`.
dns_available yes
This setting is really funny. SpamAssassin actually does an MX lookup on 13 "major" sites on the internet (google, yahoo, etc...) and if a response is received SpamAssassin determines that DNS is available for doing additional lookups. By setting this to "yes" it skips that check and automatically assumes that DNS is available. Since we run a dedicate DNS machine in the same vlan as the spamassassin server if they can't talk to each other, I have bigger problems to deal with.
dcc_dccifd_path /var/dcc/dccifd
This directive tells SpamAssassin where to find the socket file for communication to the dccifd daemon. This daemon is used to communicate with DCC servers as well as to report high scoring spam to the dcc network. I am a stickler for forking processes and the overhead it creates, so I try to avoid it where ever I can. Plus we run our own local DCC server, so that helps a lot too. (And no that doesn't help our performance as much as you think, even though our server is local it can't handle all the dcc requests and some end up being sent to anonymous dcc servers anyways.) By setting this, SpamAssassin will use dccifd instead of dccproc.
rbl_timeout 10
This setting is a little funky... Basically it works as an average to determine when it should consider rbl responses as late. It works on a sliding scale based on the number of queries total left to complete and the timeout value provided. The default is 15 seconds, I changed this to 10 and really wasn't sure I saw any performance increase, but I also didn't notice any decrease. I think this mainly helps when a RBL isn't reachable.
razor_timeout 5
Default is 10 seconds. I set this to 5 because I am willing to take a hit on not having something scanned by razor rather than letting it get passed through because spamassassin didn't scan it in the correct amount of time (by default spamc will wait for a response for 30 seconds and then send the message along unchecked).
pyzor_timeout 5
Default is 10 seconds. Again... Same as razor.
check_mx_attempts 1
This one is cut and dry. SpamAssassin checks the From: address and strips off the username@ and checks domain.com for a valid MX record. For some reason (and I really don't understand why) but the default is to check twice. While I am a forgiving person, if your DNS server doesn't respond the first time I ask it about an MX record... Well... I think you have other problems... And I shouldn't waste my time giving you the benefit of the doubt.
One last thing before we go... if you run spamd as a user (in my case I run it is as the user spamd), be CERTAIN the user has a valid home directory. SpamAssassin requires a writable location to write it's baysian databases as well as its autolearn databases. Without this directory SpamAssassin will be slower, and will not be able to accurately detect spam.
And now for a cheap plug for myself and my company: If you are looking for a managed dedicate mail server, we are putting together a new service which includes true virtual domain mail hosting complete with spam filtering (courtesy of SpamAssassin) and virus filtering. If you are interested, give us a shout. Take a look at some of the other services we provide too.
Thanks for your time and I hope you enjoyed this article, it is the fruits of many hours of testing and debugging.