Improved SPAM detector (GSOC2009 proposal)

Re: Improved SPAM detector (GSOC2009 proposal)

by Priyanka M -
Number of replies: 0

Hi,

My name is Priyanka Menghani. I am in my second year of B. Tech (Bachelor of Technology) in Computer Science at Banasthali University, India. I like C, C++, PHP and Ruby-on-Rails. I have a bit of Open Source experience. Please refer to my github page for the same [0].

I have done quite a bit of PHP, and used it for fun to make Stock Exchange simulation games, online puzzles etc. I am currently going through the moodle code-base. I also suggested a patch for a bug [1]. To understand moodle even better, I am helping fix a couple more of them.

I am interested in the Spam-Cleaner project idea. My suggested implementation of it is as follows.

1. Pro-active Marking of Posts

If there has been a large number of posts (say, Xi, where Xi varies for every user / IP) from the same user / IP in a fixed amount of time, then there is a likelihood of the posts being spam, or the user flooding the forum.

In such a case, further posts are suspended and require the administrator's approval. In case the administrator marks any of the mail as spam / flooding, the value of Xi for that user and IP would be lowered.

The second strategy that will be used is a Bayesian filtering technique, where the probability of a message being spam is analysed. If the mail looks highly probable of being spam, it is filtered for the administrator's approval. This is not done instantaneously, but instead a cron-job runs which analyses the messages in the background, so that there is no additional latency for clean messages.

2. Administrator's Action

The administrator can act on spam in three ways,

a. Take decision on the posts which have been filtered due to the proactive marking of the posts.

The administrator can take the decision as to whether the filtering was correct or not. The action then changes the value of Xi for the user/IP and also affects the weights used in the Bayesian filtering [2]. Thus, the system is self-adjusting after being given some learning.

b. Search for keywords

This is the same as the current functionality. The administrator can search for certain keywords and then take action accordingly.

c. Set up a blacklist / graylist

Black-list certain users / words. These users would not be allowed to post for a particular time. Words can also be black-listed. The administrator has to be very careful while setting up a word-specific black-list as certain words which might occur in spam messages, might actually be a part of normal messages as well. The administrator can also use the graylist, which will affect the weights of the Bayesian filter.

I am not in favor of using hashing of messages and checking with a central server, because, firstly, it is very easy for spammers to change one punctuation mark and the hash changes wildly. Secondly, setting up a central server is something which would cost Moodle as well as the organization using the application. There can however be the sharing of the bayesian weights (which will be different for each application depending upon the spam activity on their network) with the upstream, which can help investigate which configuration of the weights works best against common spam and then set them by default in the next version. This however needs to be taken up separately.

 

[0] http://wwww.github.com/priyanka-m

[1] http://tracker.moodle.org/browse/MDL-26754

[2] http://en.wikipedia.org/wiki/Bayesian_spam_filtering

Looking forward to your suggestions for the same,

 

Warm Regards,

Priyanka M

(priyanka.menghani@gmail.com)