How to remove thousands of URLs from Google cache?

6,703

Solution 1

Seems like you've already figured out how to request removal of a single URL, which is obviously out of the question here. The second step in that process also lets you request removal of an entire directory, if the file URLs are predictable in that particular manner. (If you have thousands of PDFs, I'd hope they're at least somewhat organized.) If not, you're pretty much out of options, unfortunately.

Solution 2

I recently had a hack that added several thousand bogus pages to my site.

I submitted a corrected sitemap to the Google Search Console (previously called Webmaster Tools) and turned all the links to 410, but Google still had most of them indexed.

I used WebMaster Tools - Bulk URL removal Chrome Extension to automatically submit the urls for removal. It is basically a script that takes a list of the URLs then submits them for you, one at a time. It will take hours to submit them all, but at least you won't have to do it yourself. Here's an article on how to use it.

You can get a list of the URLs that google is indexing by downloading the data directly from the Search Console. Go to Status > Index Coverage and select the valid results then scroll down. You will see that Google has indexed a ton of URLs that are not in your sitemap. You can download the first 1000 results. There's apparently a roundabout way to get all of them, not just the first thousand, but it involves API calls from excel. I just waited a few days between each thousand, as they slowly fell out of the index.

Google Index Coverage Snapshot

Another route is to have a WP plugin create a sitemap, then filter out the PDFs or whatever you are targeting. You'll probably have to do a bit of manual copy/paste/delete here. Just to be safe, I slowly scrolled through my list of about 2,700 spam URLs and deleted the legitimate URLs. It only took about 20 minutes.

If you aren't trying to permanently nuke something, like spam, and instead are trying to obfuscate premium resources, you should use other methods to prevent indexing those resources, such as a robots file. But if it turns out Google didn't listen or you dropped the ball, at least now you can fix the issue and get them removed from the index in only a few days.

In my particular circumstance, I'm wondering why Google doesn't have a time machine button, or undo, or reset. The idea is that I can tell Google the site was hacked a few days ago, but we've repaired it, therefore undo the last x number of days of crawling and indexing. But that would be too easy.

Solution 3

If the files "shouldn't be public" then they should be on the public internet. You can remove the files from Google listings (via robots.txt and other methods), but if the files are still there then anyone can still download them.

You should keep them behind some kind of authentication. For example, move the files out of the public web directory and serve them from a script that checks if the user is valid first.

Share:
6,703

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin over 1 year

    Google has cached 1000s of PDFs from my website which shouldn't be public. I have updated my headers, but need to remove the existing Quick View cache.

    The Google webmaster tool allows me to remove them one by one - however, this clearly isn't practical given the quantity of files to be removed.

    Does anyone know how I can batch remove PDFs from Google cache? Ideally I'd like to a way to remove everything that matches "site:mysite.com *.pdf"