Mikes Notes
Paul Graham wrote this neat article about dealing with SPAM and introduces the use of Bayesian Filtering.
Resources
- https://paulgraham.com/spamfaq.html
- https://en.wikipedia.org/wiki/Paul_Graham_(programmer)
- https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
- https://en.wikipedia.org/wiki/Bayes%27_theorem
"Paul Graham (/ɡræm/; born November 13, 1964) is an English-American computer scientist, writer, entrepreneur and investor. His work has included the programming language Arc, the startup Viaweb (later renamed Yahoo! Store), co-founding the startup accelerator and seed capital firm Y Combinator, his essays, and Hacker News.
He is the author of the computer programming books On Lisp, ANSI Common Lisp,and Hackers & Painters. Technology journalist Steven Levy has described Graham as a "hacker philosopher".
Graham was born in England, where he and his family have maintained a permanent residence since 2016. He is also a citizen of the United States, where he attended all of his schooling and lived for 48 years prior to returning to England." - Wikipedia
Plan for SPAM FAQ
Is this code available anywhere?
No; it's written in Arc, which is itself not released yet.
Is there a Bayesian filter for Outlook?
I know of nine so far: Spammunition, SpamBayes, Spam Bully, InboxShield, Junk-Out, Outclass, Disruptor OL, SpamTiger, and JunkChief.
Is there anything that can protect my company's server?
The best commercial server-level Bayesian filter is probably Death2Spam. SpamProbe, one of the best open-source Bayesian filters, can also be run on the server.
Most commercial server-level spam filters are still rule-based. But there are starting to be some that use Bayesian filtering. The way to find them is probably to search in Google.
The question to ask the salesman is, does the filter learn to recognize spam based on the spam and nonspam mail we receive? If it doesn't learn, it isn't Bayesian.
Does Arc/Lisp have some advantage for writing this kind of software?
Lisp's symbol type is useful in manipulating databases of words, because it lets you test for equality by just comparing pointers. (If you find yourself using checksums to identify words, it's symbols you mean to be using.) Prolog, Erlang, and Ruby also support symbols. So does Python, to a degree, though I don't think it has any syntax for them yet.
More generally, Lisp was designed for rapid prototyping, and this application involved a lot of that. I probably spent 95% of the development time typing expressions into the toplevel, trying variously tweaked algorithms on individual emails.
Do you mind if I write filters based on this algorithm?
Of course not. I don't claim to have invented anything new here. Bayesian text classification is an old and established field. If there is anything new in this article (at least, it was news to me) it is that such a simple variant of it is so effective at filtering spam.
So by all means go write spam filters. It is a very rewarding hack. If you end up creating something other people can use, let me know and I'll make a link to it.
I don't know Lisp; can you explain the algorithm to me?
It's expressed in math notation in Hackers & Painters.
Could you use Bayesian filtering to make Web content filters?
I've never tried it, but yes, I think it would work well.
Will this algorithm filter out 99.5% of my spam with no false positives?
It does filter out 99.5% of my spam. I got similar results for the one other spam collection I've tested it on so far. (I couldn't measure false positives, because I only had the spams.)
I do worry a bit that my email might just be especially easy to filter. A lot of my spam comes from "opt-in" spammers like Qves and Virtumundo, and that stuff is really easy to detect. Plus my own mail is full of references to programming that make it easy to distinguish from spam (though not "programming" itself, ironically, which often occurs in spams selling satellite dishes).
So if you do try implementing this filter for yourself, I'd appreciate it if you could let me know how well it works for you. My gut feel is that it will work pretty effectively for almost everyone, but it would be reassuring to hear numbers.
Could spammers fool Bayesian filters by filling their spams with random words?
They would have to get rid of the bad words as well as adding neutral ones. Only the most interesting fifteen words contribute to the probability, and neutral words like "onion", no matter how many there are of them, can't compete with the incriminating "viagra" for statistical significance.
To outweigh incriminating words, the spammers would need to dilute their emails with especially innocent words, i.e. those that are not merely neutral but occur disproportionately often in the user's legitimate email. But these words (e.g. the nicknames of one's friends, terms one uses in one's work) are different for each recipient, and the spammers have no way of figuring out what they are.
Once this software was available, couldn't spammers just tune their spams to get through it?
They couldn't necessarily tune their emails and still say what they wanted to say. If they wanted to send you to a url that is known to the filters, for example, they would find it hard to tune their way around that.
Second, tune using what? Each user's filters will be different, and the innocent words will vary especially. At most, spammers will be able to dilute their mails with merely neutral words, and those will not tend to be much use because they won't be among the fifteen most interesting.
If the spammers did try to get most of the incriminating words out of their messages, they would all have to use different euphemisms, because if they all started saying "adolescents" instead of "teens", then "adolescents" would start to have a high spam probability.
Finally, even if spammers worked to get all the incriminating words out of the message body, that wouldn't be enough, because in a typical spam a lot of the incriminating words are in the headers.
What if spammers sent their messages as images?
They already do sometimes, and we are able to catch them. Such emails include a lot of damning content, actually. The headers, to start with, will be as bad as ever. And remember that we scan html as well as text. Within the message body there will probably be a link as well as the image, both containing urls, which would probably score high. "Href" and "img" themselves both have spam probabilities approaching pornographic words. Within the url, the image has to have some kind of name, and these are usually far from random.
Can your program deal with the spam trick of inserting html comments in the middle of words?
Yes, I ignore html comments down at the level of scanning tokens, not even considering them as separators.
Spammers sometimes use randomly generated tags to break up tokens, since html rendering software will usually ignore meaningless tags. I do allow these to separate tokens, and it works fine. Broken bits of words simply get high spam probabilities.
Would this article itself be filtered out by your filters?
No. Someone sent it to me as a test, and it wasn't. Although the article contains a lot of spam words, it also contains a lot of even more provably innocent words (like "spammers" ironically, which occurs in a lot of my real email, and never in my spam). Since only the most interesting words count, the innocent words crowded out the spam words in the rankings, and this mail ended up with the minimum possible probability.
That makes sense, because an article someone writes is more likely to resemble the content of their own email than it would a spam, even if the article happens to be about spam.
How well can your software filter out mail that's about spam?
This is a problem for anyone who works on spam filtering. In the extreme case, if someone you've never heard from merely forwards you a spam with no additional comment, it's going to be hard to filter out. There it will all come down to the headers.
But after all, if someone could forward you spam with impunity, then spammers could forward you spam too. So for these situations we may have to have a special password that people could include in mail to get it by the filters, and also (I have one already) a special trash folder for metaspam so that it doesn't contribute to the probabilities.
If there is a good side to this, it's that if we can create filters that work acceptably well for us, they'll work even better for everyone else.
Will this software filter out good automated email like order receipts and newsgroup faqs?
Mostly not. That kind of mail may be automated, but the text usually has a very different character from spam.
The only automated responses that tend to get filtered out are the ones that contain a lot of additional advertising. In effect, these emails consist of an automated response with a spam appended to the end, so it is not surprising filters catch them.
Will the kind of people who would respond to a spam use filters?
I think so. I'm guessing here, but I suspect that people stupid enough to respond to a spam will often get email through one of the free services like Yahoo Mail or Hotmail, or through big ISPs like AOL or Earthlink. Once word spreads that it is possible to filter out most spam, they'll be forced to offer effective filters.
If filters catch most spam, won't spammers just send more to compensate?
Spammers already operate at capacity.
How do you feel about blacklists?
As ISPs use them now, they're equivalent to very bad filters. But the information they supply could be incorporated into Bayesian filtering.
Doesn't Bayes' Rule assume the predictors are independent?
Yes, strictly speaking it is only valid assuming the probabilitiy of e.g. "sex" occurring in an email is unrelated to the probability of "sexy" occurring in it. Obviously that is not the case.
On the other hand, there is a long tradition of violating this requirement. When you do that it's called a "naive Bayesian" algorithm and in practice it works pretty well, just as in practice (if you stay away from the edges of precision) it works pretty well to treat floating point numbers as if they were reals.
Does this algorithm consider the prior probability of a mail being a spam?
No. It assumes it's equally likely for a mail to be a spam or not. So accuracy could be slightly improved by taking into account the actual proportion of spam received by each user. You would have to do this by hour though, because the proportion varies so much depending on the time of the day that the overall prior probability is useless as a predictor.
Why do you consider 15 words? What's special about 15?
Actually, in the latest version of the filter I use 20, because marking tokens by context means the software now sees more distinct tokens per mail.
Nearly all spams tend to have that many fairly incriminating words in them. If you look at more than that, you'll start to include neutral words, which are noise at best. But you want to look at as many words as you can, because legitimate emails may well contain two or three words with high spam probabilities-- if you used only the top 5 words, for example, you'd start to get false positives.
Will this work for languages other than English?
As long as they use whitespace to separate tokens, it should.
No comments:
Post a Comment