More thoughts Re: spam filters as a general sorter

Hui Zhou zhouhui at wam.umd.edu
Sat Mar 12 07:36:15 PST 2005


On Sat, Mar 12, 2005 at 10:07:46AM -0500, Jason Gurtz wrote:
>On 12-Mar-05 09:29, Hui Zhou wrote:
>
>> Reading my own mail (this one that I just sent :)) and I realize that 
>> simple token treatment definitely won't work good enough to mark sort 
>> my post into interesting (How shameless :). It may work for 
>> categorization of regular notifications and alerts, but for general 
>> chatting list, something more need to be taken into account. Maybe the 
>> the lengh of original post? or proportion of quotes against reply? or 
>> average length of sentences?
>
>I think the hard part is really to come up with the heuristics that do
>the sorting.  Beyond that, it's just separating those heuristics into
>classes that each do the sort.  I personally find it harder to come up
>with regexes that generically match non-spam mail because I seem to

If the regexes is easy or even possible to come up with, procmail 
should be sufficient. 

>think more in terms of what I don't want.  Maybe you can take a similar
>approach in a hierarchy from "least want to read" to "most want to read"
>
>You may even want to look at something like MIMEDefang which gives you
>access via perl to many different message qualities.

I tend to directly access the mail stream and get whatever interest 
info worth considering.

>  Number of
>recipients, time it was sent, envelope From:, etc....  That may give you
> more options in developing the heuristics and then you can just use it
>to add a custom header which procmail will then use for it's sorting job.

No, I am not talking about heuristic method. All heuristic method seem 
simple-minded and require way more maintainence then initial 
development. What I am thinking is treating heruistic charasteristics 
as equivalent to word tokens and apply the statistics. Hopefully, over 
large amount of sample mails, the filter can figure out some patterns 
based on statistics.

Thinking of spamassassin, it is an sophiscated, heruistic, rule-based 
filter. However, to utilize its potential to the best, the users are 
asked to tune the weight of way too many individule rules. Frankly, I 
think no expert can get that right unless the subject is 
scientifically studied on case by case bases, which is to do 
statistics analyses. 

>Sounds like an interesting project anyway.

Good to hear.

-- 
Hui Zhou



More information about the lfs-chat mailing list