mail complexity
Aug. 27th, 2005 09:59 pmIt's not like I get all that much spam. But some of it causes Eudora 4.2 to crash, and I don't feel like using the adware version or paying all over again for a program that's fairly similar in features. So a while ago I installed CRM114, and I've been training it since. It got to 98 or 99% quickly, but stubbornly refuses to get any better than that. And the failures happen in both directions.
Recently I've been trying a compromise between TOE and TEFT; if a message comes through with a confidence level under 100, I'll train on it. That hasn't really helped CRM114 converge any quicker. I think fast convergence really requires shared data sets, a la Google.
Speaking of which, I've also got a Gmail account (with the same username as this one). I use this for signing up for commercial services that I think will sell my address or otherwise be annoying, and mostly only check that address when I am expecting a particular piece of mail. I thought of giving up on maintaining my own Bayesian filters and just forwarding all my mail to Gmail (which near as I can tell uses pattern-based filtering and ever-vigilant professional pattern authors), but 4.2 doesn't talk POP over SSL, and I want to be able to read mail offline, and to search current and historic mail together. And I do like Bayesian filters' ability to give me only that portion of a mailing list's traffic that will actually interest me, even when the non-interesting parts aren't spam per se.
So, why not filter just the spam to Gmail? Forwarding just the high-confidence messages should keep my Eudora from crashing, and I'll still get the false positives on my Eudora client where I can see them. But the mailbox filter to separate low-confidence and high-confidence spam was after whatever was making Eudora crash. Fortunately, rewriting the filter in procmail wasn't too hard, and now my high-confidence spam goes to my Gmail account, where it can rot for 30 days before Google automatically deletes it.
CRM114: Crash's Bayesian mail filtering program.
TOE: Train On Error.
TEFT: Train Everything.
Recently I've been trying a compromise between TOE and TEFT; if a message comes through with a confidence level under 100, I'll train on it. That hasn't really helped CRM114 converge any quicker. I think fast convergence really requires shared data sets, a la Google.
Speaking of which, I've also got a Gmail account (with the same username as this one). I use this for signing up for commercial services that I think will sell my address or otherwise be annoying, and mostly only check that address when I am expecting a particular piece of mail. I thought of giving up on maintaining my own Bayesian filters and just forwarding all my mail to Gmail (which near as I can tell uses pattern-based filtering and ever-vigilant professional pattern authors), but 4.2 doesn't talk POP over SSL, and I want to be able to read mail offline, and to search current and historic mail together. And I do like Bayesian filters' ability to give me only that portion of a mailing list's traffic that will actually interest me, even when the non-interesting parts aren't spam per se.
So, why not filter just the spam to Gmail? Forwarding just the high-confidence messages should keep my Eudora from crashing, and I'll still get the false positives on my Eudora client where I can see them. But the mailbox filter to separate low-confidence and high-confidence spam was after whatever was making Eudora crash. Fortunately, rewriting the filter in procmail wasn't too hard, and now my high-confidence spam goes to my Gmail account, where it can rot for 30 days before Google automatically deletes it.
CRM114: Crash's Bayesian mail filtering program.
TOE: Train On Error.
TEFT: Train Everything.
no subject
Date: 2005-08-28 03:49 am (UTC)Greylisting has eliminated most of the spam we would otherwise get. Spamassassin catches most of the Nigerian scam stuff.
no subject
Date: 2005-08-28 06:26 am (UTC)no subject
Date: 2005-08-28 11:06 pm (UTC)