Mass deletion of spam revisions in Mediawiki

6,471

Solution 1

If you don't want to use the export-and-reinstall method suggested by danlefree, you might also find the Nuke extension useful. Once installed, visiting the special page Special:Nuke as an administrator gives you a form like this:

Screenshot of MediaWiki Nuke extension interface

There are also several built-in MediaWiki maintenance scripts that could be useful, including:

  • cleanupSpam.php, which can be used to rollback and/or delete all revisions containing a link to a particular hostname,

  • deleteBatch.php, which can be used to delete all pages listed in a file, and

  • rollbackEdits.php (which doesn't currently seem to have proper on-wiki documentation), which can be used to roll back all edits of a specified user.


Spam cleanup using direct database access

It's also be possible to do what you want by directly manipulating the database. There details can vary a bit depending on your situation, but the basic steps would go something like this:

  1. Set your wiki to read-only mode. You do not want someone to try editing the wiki while you're messing with the database.

  2. Make a backup of your wiki. (This is highly recommended before any irreversible mass deletions anyway.)

  3. Delete all user accounts created by the spammers. If, as in the question above, you were the only valid user, you can just do:

    DELETE FROM user WHERE user_id != YOUR_USER_ID;
    

    Alternatively, if no new valid accounts were created after the spammers discovered the wiki, you can find the highest valid user ID number and do:

    DELETE FROM user WHERE user_id > LAST_VALID_USER_ID;
    

    Or you can use an admin tool like phpMyAdmin to manually pick out the valid accounts and delete the rest.

  4. Clean up the extra data associated with the deleted accounts. This is not strictly necessary, but those orphaned records have no use and will just clutter your database if you don't delete them:

    DELETE FROM user_groups WHERE ug_user NOT IN (SELECT user_id FROM user);
    DELETE FROM user_properties WHERE up_user NOT IN (SELECT user_id FROM user);
    DELETE FROM user_newtalk WHERE user_id NOT IN (SELECT user_id FROM user);
    
  5. Delete any revisions not made by a valid user:

    This is the big step; everything before it was preparation, everything after it is cleanup. With all the spam accounts deleted, you can simply do:

    DELETE FROM revision WHERE rev_user > 0 AND rev_user NOT IN (SELECT user_id FROM user);
    

    If your wiki had anonymous editing disabled (which I strongly recommend for private / test wikis), the query above should be enough to get rid of all the spam revisions. If you had anon editing enabled, though, you'll have to nuke the anonymous spam separately.

    If you're sure that all anon edits on your wiki are spam, the only edits made by UID 0 that we may need to preserve are those made by MediaWiki itself (such as pages imported from outside the wiki). In that case, something like the following query should work:

    DELETE FROM revision WHERE rev_user = 0 AND rev_user_text BETWEEN '1' AND '999';
    

    This will delete any revisions by UID 0 where the username looks (vaguely) like an IPv4 address; that is, it starts with a digit between 1 and 9.

    If your wiki has some actual legitimate anon edits, you may have to get a bit more creative. If the number of IP addresses used by legitimate unregistered editors is limited, you can just add a clause like AND rev_user_text NOT IN ('1.2.3.4', '5.6.7.8', '9.10.11.12') to the query above to exclude contributions by those IPs from deletion. You can also add conditions like, say, AND rev_user_text NOT LIKE '192.168.%' to save all edits from IP addresses beginning with a particular prefix.

  6. The queries above will get rid of the spam revisions (although their content will still remain in the text table), but will leave the page_latest field of any affected pages pointing to a nonexistent revision. This could cause confusion, so we'd better fix it.

    First, we need to wipe out the page_latest column for all pages:

    UPDATE page SET page_latest = 0;
    
  7. Next, we'll rebuild the column, either by running the attachLatest.php maintenance script (recommended; remember to use the --fix parameter so that the script actually changes the database) or with a manual SQL query:

    UPDATE page SET page_latest =
        (SELECT MAX(rev_id) FROM revision WHERE rev_page = page_id);
    
  8. Finally, we'll delete all pages for which no valid revisions could be found (because they were created by spammers, and never had any valid content):

    DELETE FROM page WHERE page_latest = 0;
    
  9. For a final touch, rebuild the links, text index and recent changes tables by running the rebuildall.php maintenance script. You may also want to remove the content of the deleted spam revisions from the database, so that they won't take up unnecessary space there, by running the purgeOldText.php maintenance script.

Once that's all done, check that everything looks good, and if so, turn off read-only mode — hopefully after installing some anti-spam features to keep the problem from reoccurring.

For small wikis, I highly recommend the QuestyCaptcha extension, which allows you to configure a simple custom text-based CAPTCHA. The trick is that, with every wiki having its own set of questions, programming a spambot to answer them correctly would be a lot of work for very little gain. I installed it on my own wiki after getting hit by XRumer a couple of times, and have seen no spam ever since.

Ps. I have used these instructions to nuke about 35,000 spam revisions created by equally many users from a small wiki. Everything went fine. In this particular case, the wiki (fortunately!) did not allow anonymous editing, and almost all of the legitimate users were created before the spammers found the wiki, so I could fairly easily first delete all the spam accounts, and then all the revisions they'd created. (I did accidentally delete one legitimate account at first, so I had to restore from backup and redo the process more carefully.) I've updated the instructions above to better reflect what I actually ended up doing, and to be a bit more generic.

Solution 2

The easiest way to handle this situation (if you don't mind a nuke'n'pave) would be to export all wiki pages created or edited by your username, reinstall the wiki, and import the export file you'd generated.

"Reinstall" in this context would mean:

  1. Export articles created by you (presumably logged in as the WikiSysop user or similar)
  2. Drop the MW database
  3. Create an empty MW database
  4. Copy your LocalSettings.php file to a safe location
  5. Re-upload the /config/ directory
  6. Run the installation process on the new MW database (note that you will want to re-create your old admin user)
  7. Delete the /config/ directory and move your old LocalSettings.php file back to the MW root
  8. Import the file created at Step #1

Edit: You may want to pull down a database backup (including spam revisions) in case you encounter any problems with this process or would like to experiment with alternate ways to purge the spam.

Solution 3

In theory, you could write a MediaWiki extension to do whatever you like to a MediaWiki instance, including to do the things you mentioned.

Short of that, and short of the "nuke'n'pave" suggested by danlefree, you might find the User Merge and Delete extension useful: you can use it to consolidate multiple spambot accounts into a single account whose edits can then be addressed more easily.

Solution 4

The easiest way to handle this situation is to install extension DeleteBatch. Use Special:AllPages on your wiki to get a script file of the page names you want deleted, and load it into Special:DeleteBatch.

Solution 5

I strongly recommend not to mess with MediaWiki's SQL! MediaWiki is a complex beast, very optimized for Wikipedia. There are some weird things going on in SQL and if you simply DELETE rows things might loose consistency.

If you have some programming skills, go through the API. Pywikibot is a good choice.

Otherwise, check the tools in the maintenance/ directory. You could try my own tool, mewsh to help with that (and I just added "anti-spam tools" as a todo there).

Share:
6,471

Related videos on Youtube

Andrew Bolster
Author by

Andrew Bolster

Updated on September 17, 2022

Comments

  • Andrew Bolster
    Andrew Bolster almost 2 years

    Basically my 'private' mediawiki instance was about as secure as a toddlers piggybank. I've tightened it up now, but am left with about a hundred or so new pages and revisions generated by hundreds of randomly generated users.

    2 part question; Is there a way to delete all orphaned pages? Can I say to roll back all revisions NOT made by a particular user (me)?

  • joosthoek
    joosthoek over 10 years
    There's no point in trying to delete spam links from externallinks, since that's a redundant metadata table that's basically only used for things like Special:LinkSearch; once you've cleaned up the actual pages, you can just run rebuildall.php to wipe and rebuild it. Ditto for searchindex.
  • Ant6n
    Ant6n almost 8 years
    This question is a couple of years old, it still seems to have worked nicely on a small wiki that had accumulated 100,000 spam bots. Have things changed since then; are there maybe additional steps?
  • Peter Krauss
    Peter Krauss over 7 years
    Some news here? These are the "best practices" and "best tools" in nowadays?
  • Jamie Hutber
    Jamie Hutber over 5 years
    rebuildall.php isn't in maintenance :O Otherwise thank you
  • Honeybear
    Honeybear about 3 years
    thanks a lot. I've followed all steps, adjusting to the updated actor table (revs are now stored with actor-id instead of user-id) - everything worked. however my wiki statistics are still bloated (Special:Statistics) with all the spam pages. is there a way to update these as well?
  • joosthoek
    joosthoek about 3 years
    @Honeybear: I haven't tried it, but based on a quick google search, maybe running initSiteStats.php would help.
  • Honeybear
    Honeybear about 3 years
    Thank you for the hint, php initSiteStats.php --update did the trick.
  • NightElfik
    NightElfik about 2 years
    I had issues with the DELETE FROM revision since rev_user does not exist, but as a workaround I deleted records using timestamp: DELETE FROM revision WHERE rev_timestamp >= yyyymmddhhmmss. Also, the attachLatest.php did not work for me but the SQL did. Thanks!