More thoughts Re: spam filters as a general sorter
zhouhui at wam.umd.edu
Sat Mar 12 07:36:15 PST 2005
On Sat, Mar 12, 2005 at 10:07:46AM -0500, Jason Gurtz wrote:
>On 12-Mar-05 09:29, Hui Zhou wrote:
>> Reading my own mail (this one that I just sent :)) and I realize that
>> simple token treatment definitely won't work good enough to mark sort
>> my post into interesting (How shameless :). It may work for
>> categorization of regular notifications and alerts, but for general
>> chatting list, something more need to be taken into account. Maybe the
>> the lengh of original post? or proportion of quotes against reply? or
>> average length of sentences?
>I think the hard part is really to come up with the heuristics that do
>the sorting. Beyond that, it's just separating those heuristics into
>classes that each do the sort. I personally find it harder to come up
>with regexes that generically match non-spam mail because I seem to
If the regexes is easy or even possible to come up with, procmail
should be sufficient.
>think more in terms of what I don't want. Maybe you can take a similar
>approach in a hierarchy from "least want to read" to "most want to read"
>You may even want to look at something like MIMEDefang which gives you
>access via perl to many different message qualities.
I tend to directly access the mail stream and get whatever interest
info worth considering.
> Number of
>recipients, time it was sent, envelope From:, etc.... That may give you
> more options in developing the heuristics and then you can just use it
>to add a custom header which procmail will then use for it's sorting job.
No, I am not talking about heuristic method. All heuristic method seem
simple-minded and require way more maintainence then initial
development. What I am thinking is treating heruistic charasteristics
as equivalent to word tokens and apply the statistics. Hopefully, over
large amount of sample mails, the filter can figure out some patterns
based on statistics.
Thinking of spamassassin, it is an sophiscated, heruistic, rule-based
filter. However, to utilize its potential to the best, the users are
asked to tune the weight of way too many individule rules. Frankly, I
think no expert can get that right unless the subject is
scientifically studied on case by case bases, which is to do
>Sounds like an interesting project anyway.
Good to hear.
More information about the lfs-chat