László Kozma


Home | About me | Projects | Links | Weblog | Ideas


Wikipedia vandalism - data mining approach [more ideas]

A significant portion of Wikipedia edits consists of vandalism. This can be followed on wikipedivision among others. I'm not familiar with the work of Wikipedia maintainers and volunteers but I assume it is a lot of work to spot and revert all these. A (semi-)automated approach could be useful where data mining and machine learning algorithms would be used to determine the probability of an edit being vandalism (similarly to email filters detecting spam). This could be trained on an annotated corpus of legitimate edits and vandalism and then used to aid editors. Possible features would include the time of edit, the length of removed text the length of added text, the history of edits of the user (or IP address), the history of edit of the article, the contents of the edit as well as many other data. From the vast wikipedia data it would be quite easy to produce benchmark data sets for learning and for data mining challenges.

Is something like this used already? Has anyone looked at the wikipedia historical edit data from this point of view ?



blog comments powered by Disqus