Improved SPAM detector (GSOC2009 proposal)

Improved SPAM detector (GSOC2009 proposal)

by Eloy Lafuente (stronk7) -
Number of replies: 5
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Hi,

just wanted to share one idea I had annotated in my (big) list and, perhaps, could be an interesting GSOC project.

Since some weeks ago, Moodle includes a nice spam checker able to look for some words into user profiles and posts.

My proposal is about to make that to evolve to something more complete and (I hope) useful for the Community.

And here it's the (general, in 10 points) outline of the idea, feel free to comment / discuss / modify... whatever:

Spam central server
  • 1. each time any registered server marks something as spam... send it to central server (if agreed and configured to do so)
  • 2. information provided could be (originator IP, hashed username and email, spam content).
  • 3. central server hashes it, keeping one counter of occurrences of each text.
  • 4. any text with >X occurrences (or another algorithm), is considered "official" spam.
  • 5. one list of "official" spam hashes is publicy available.
  • 6. any moodle site can download it and will be used by the spam report.
  • 7. those hashes/texts can be of interest in other FLOSS projects (interchange option).
  • 8. can evolve to more complex ways of detection (not only hash based, but looking for other types of similarities - linguistic, users, IP...).
  • 9. can evolve out from the limits of current spam-checker so anybody in a site has one option to mark as spam any content (via capability), just for review of admins, for direct sending of information to SPAM server or both.
  • 10. also, can be checked each time one content is going to be saved to DB (option) using it as an automated-non-blocking reporting tool, or as a blocking one ( configuration ), informing admins about suspicious senders/contents) or blocking them
  • ... anything else.


Note I'm not sure about which is the best way to achieve that (so I designed it to work with simple hashes), sure there are a lot of ways to analyse senders / contents. I just don't know about them. But the tool, as outlined in points 1-6, will, for sure, detect repeated SPAM content just based in moodlers interaction with Moodle.

Also, note points 7-10 are just potential (and possible) "expansions" for the project. The basic project itself is covered by "only" points 1-6

Finally note I cannot volunteer (time availability) for any sort of mentor-ship in GSOC so, if anybody is interested.. and this gets interest enough. just take the baton.

And that's all, I hope it can serve, at least, as the origin for better and more ellaborated ideas related to the SPAM problem. I just tried to share it here (cleaning it from my Ideas TODO list).

Ciao smile
Average of ratings: Useful (1)
In reply to Eloy Lafuente (stronk7)

Re: Improved SPAM detector (GSOC2009 proposal)

by Hubert Chathi -
It sounds a lot like Vipul's Razor (and pyzor). Those people have probably thought long and hard about how best to do the "hashing", so it may be worth looking at those projects to see how they are doing things.
http://pyzor.sourceforge.net/
http://razor.sourceforge.net/
Average of ratings: Useful (2)
In reply to Hubert Chathi

Re: Improved SPAM detector (GSOC2009 proposal)

by Eloy Lafuente (stronk7) -
Picture of Core developers Picture of Documentation writers Picture of Moodle HQ Picture of Peer reviewers Picture of Plugin developers Picture of Testers
Yup, agree. There are a lot of possible techniques to get "good" hashes out there, yup.

I can imagine multiple hashes, based in different parts of the message (links, images, paragraphs...). Also whitelisted/blacklisted user hashes and a lot of combinations.

My 1-10 steps above were just the outline, for sure if anybody decides to go for it, that should be seriously researched.

Ciao smile
In reply to Eloy Lafuente (stronk7)

Re: Improved SPAM detector (GSOC2009 proposal)

by Priyanka M -

Hi,

My name is Priyanka Menghani. I am in my second year of B. Tech (Bachelor of Technology) in Computer Science at Banasthali University, India. I like C, C++, PHP and Ruby-on-Rails. I have a bit of Open Source experience. Please refer to my github page for the same [0].

I have done quite a bit of PHP, and used it for fun to make Stock Exchange simulation games, online puzzles etc. I am currently going through the moodle code-base. I also suggested a patch for a bug [1]. To understand moodle even better, I am helping fix a couple more of them.

I am interested in the Spam-Cleaner project idea. My suggested implementation of it is as follows.

1. Pro-active Marking of Posts

If there has been a large number of posts (say, Xi, where Xi varies for every user / IP) from the same user / IP in a fixed amount of time, then there is a likelihood of the posts being spam, or the user flooding the forum.

In such a case, further posts are suspended and require the administrator's approval. In case the administrator marks any of the mail as spam / flooding, the value of Xi for that user and IP would be lowered.

The second strategy that will be used is a Bayesian filtering technique, where the probability of a message being spam is analysed. If the mail looks highly probable of being spam, it is filtered for the administrator's approval. This is not done instantaneously, but instead a cron-job runs which analyses the messages in the background, so that there is no additional latency for clean messages.

2. Administrator's Action

The administrator can act on spam in three ways,

a. Take decision on the posts which have been filtered due to the proactive marking of the posts.

The administrator can take the decision as to whether the filtering was correct or not. The action then changes the value of Xi for the user/IP and also affects the weights used in the Bayesian filtering [2]. Thus, the system is self-adjusting after being given some learning.

b. Search for keywords

This is the same as the current functionality. The administrator can search for certain keywords and then take action accordingly.

c. Set up a blacklist / graylist

Black-list certain users / words. These users would not be allowed to post for a particular time. Words can also be black-listed. The administrator has to be very careful while setting up a word-specific black-list as certain words which might occur in spam messages, might actually be a part of normal messages as well. The administrator can also use the graylist, which will affect the weights of the Bayesian filter.

I am not in favor of using hashing of messages and checking with a central server, because, firstly, it is very easy for spammers to change one punctuation mark and the hash changes wildly. Secondly, setting up a central server is something which would cost Moodle as well as the organization using the application. There can however be the sharing of the bayesian weights (which will be different for each application depending upon the spam activity on their network) with the upstream, which can help investigate which configuration of the weights works best against common spam and then set them by default in the next version. This however needs to be taken up separately.

 

[0] http://wwww.github.com/priyanka-m

[1] http://tracker.moodle.org/browse/MDL-26754

[2] http://en.wikipedia.org/wiki/Bayesian_spam_filtering

Looking forward to your suggestions for the same,

 

Warm Regards,

Priyanka M

(priyanka.menghani@gmail.com)