How to properly (dis)allow the archive.org bot? Did things change, if so when?

web-crawlers robots.txt internet-archive

19,970

Solution 1

Update: As @KevinFegan notes in the comments, their documentation changed. The below part describes how the Internet Archive handled it in the past (at least in 2014).

Their FAQ How can I have my site's pages excluded from the Wayback Machine? refers to Removing Documents From the Wayback Machine, which documents that their bot is called ia_archiver.

So this record should allow their bot to crawl your entire site:

User-agent: ia_archiver
Disallow:

Solution 2

There are really 2 issues here:

Will the robots.txt on your site Disallow (block) Wayback from crawling your site.
Will Wayback crawl your site.

For point #1:
As others have said, the correct entry for robots.txt is:

User-agent: ia_archiver
Disallow:

Keep in mind that it might take a while (perhaps a good long while), for Wayback to notice any changes you have made to robots.txt.

To check if the robots.txt on your site will allow Wayback to crawl your site:

Go to this URL: https://archive.org/web/
In the box at the TOP of the page, enter the URL of a page on your site, and click the "Browse History" button.
Or, in the box under "Save Page Now" (currently near the bottom on the right), and enter the URL of a page on your site, and click the "Save Page" button.

At this point, you should see 1 of 3 things:

You will see an error message indicating that Wayback can't access pages on that site due to "robots.txt".
You will see the "calendar" of historical save points for the page on your site. In this case, you know that Wayback is NOT blocked from crawling your site.
Or, you will see a message indicating that Wayback doesn't have an archive of that page, and an offer to click a link to add the page to Wayback. In this case also, you know that Wayback is NOT blocked from crawling your site.

Now, for point #2:

Will Wayback crawl your site?

Just because you Allow Wayback to crawl your site, doesn't mean that they (ever) will crawl your site.

According to the Wayback FAQ (emphasis added):

How can I get my site included in the Wayback Machine?

Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.

Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.

Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.

Update: 09-May-2017

Others have left comments/answers indicating that Archive.org no longer honors robots.txt. Perhaps this is a "work-in-progress" and it will eventually be the case, but I have not seen this new behavior yet.

The case for this seems to come from this article: Robots.txt: ROBOTS.TXT IS A SUICIDE NOTE by archiveteam.org. While that page has little if anything good to say about "Robots.txt", it doesn't mention anywhere that Archive.org will no longer honor robots.txt.

Also of note: that article is hosted on archiveteam.org, which is most definitely not archive.org, and I'm not sure there is any (official) relationship between archive.org and archiveteam.org.

In fact, this page on archive.org about Archive Team, seems to declare a distinction between archive.org and archiveteam.org (emphasis added):

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. ...

In any case, I decided to give this a try, and I found that, at least at this time, Archive.org STILL honors robots.txt:

I found a random item on eBay: Item #: 131795294232
Click to view the sold items:

The "Items sold" page opens: http://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=131795294232 Copy the link to the clipboard.
Goto web.archive.org, and paste the link from eBay.
You will see that archive.org indicates that the "Page cannot be displayed due to robots.txt."

So, at this time, I remain unconvinced, but I would love to be proven wrong... it would be great if it were true.

Solution 3

I tried the robots.txt method and it didn't work. So I contacted the website on their email [email protected]:

Hello,

Can you please remove my personal website dimitarnestorov.com from your archive?

Thanks!

Dimitar

And I got the following answer:

Hello,

The Internet Archive can exclude websites from the Wayback Machine (web.archive.org), but we first respectfully request that you help us verify that you are the site owner or content author of dimitarnestorov.com by doing any one of the following:

(Note: Some of these options can be in reference to the content located in prior Wayback Machine captures, and/or documentation you may have related to the specified time period.)

post your request on the current version of the site (and send us a link).

send your request from the main email contact listed on the site and show us where it can be located (if one is present).

send a request from the registrant's email (if publicly viewable on a WHOIS lookup you can link us to) or webmaster’s email listed on the site.

point us to where your personal information (name, point of contact, image of self) appears on the site in a way that identifies you as owner of the site or author of the content you wish to have excluded - in this instance, we ask to verify your identity via a scan of a valid photo ID (sensitive information such as birth date, address, or phone number can be redacted).

forward to us communication from a hosting company or registrar addressed to you as owner of the domain.

(Note: The simple mention of someone's name/username, and/or a hyperlink/redirect between sites/pages/accounts in itself is typically not sufficient to have archives excluded.)

If none of these options are available to you, please let us know in a reply to this email.

We would be grateful if you would help us preserve as much of the archive as possible. Therefore, please let us know if there are only specific URLs or directories about which you are concerned so that we may leave the rest of the archives available.

As you may know, Internet Archive is a non-profit digital library, seeking to maintain via the Wayback Machine a freely accessible historical record of the Internet. The material in the archives are not exploited by Internet Archive for commercial profit.

The Internet Archive Team

I created wayback-removal-request.html with the following content (not even valid HTML):

<p>Hello,</p>

<p>Can you remove my website from the Wayback Machine?</p>

<p>Thanks!</p>
<p>Dimitar</p>

Uploaded it and replied to their email with the URL from which the webpage was available and later I received the following reply:

Hello,

The site/URL referenced in your email below has now been submitted for exclusion from the Wayback Machine at http://www.archive.org (in regard to all past captures):

dimitarnestorov.com

Please allow up to a day for the automated portions of the process to run their course and for the changes to take effect.

The Internet Archive Team

When I checked a couple of hours later my website was removed.

Solution 4

Update 2017

Archive bot now does not care about your robots.txt.

If you really want to block it, send them a email according to this page, or block their IP address via htaccess.

Solution 5

The robots.txt ia_archiver Disallow entry (with the "/") should be fine for the need you describe (to "preserve for eternity", but not yet publicly).

I just did a quick test, commenting out the ia_archiver Disallow entry for a site that had it for at least the past 10 years. Then I looked the site up on archive.org/web, and it showed up grabs it had collected in 2007, 2008, 2009, 2011, 2012, 2013, 2014, 2015, 2016 and 2017! This means that Archive.org never strictly honored what others thought to be a "do not archive" statement during these years, it was merely not exposing the archived copies.

View more solutions

19,970

the

For moneyless fun I'm running some wikis, such as Hitchwiki, Trashwiki and Nomadwiki. Here's my personal blog.

Updated on September 18, 2022

Comments

the over 1 year
I have a website that I mostly don't want to be indexed by search engines, but I do want to preserve it for eternity on archive.org. So my robots.txt starts with this:
```
User-agent: *
Disallow: /
```
Today, according to archive.org I have to add the following in my robots.txt to allow their bots:
```
User-agent: ia_archiver
Disallow:
```
But, I already had done what they indicated a couple of years ago, at least, I added the following:
```
User-agent: archive.org_bot
Disallow:
```
Then there's another source claiming that you have to add the two above Disallows, plus another one:
```
User-agent: ia_archiver-web.archive.org 
Disallow:
```
Note that you need to put Disallow: / if you don't want the bot to archive your site.

Has there been a change with the IA bot? If so, when?

What is the recommended way? Should I just allow all three for now and hope that IA will not change their bot name again in the future?
- Admin over 9 years
  
  Updated the question. Hope it's clearer now. The tiny version: I don't want search engines bots on this site, I do want archive.org bots. But maybe I should reverse the question since that's what most people are looking for?
- Admin over 9 years
  
  Actually, if you do not use any of these in, you are allowing archive.org providing that you are not blocking with a blanket statement.
- Admin over 9 years
  
  Using just "ia_archiver" should also block "ia_archiver-web.archive.org", so the later would seem to be unnecessary (providing this bot follows the standard).
- Admin over 9 years
  
  Do you see the ia-archiver (or archive.org_bot) bot in your access logs?
- Admin about 5 years
  
  I confirm that ia_archiver is now ignored.
- Admin about 5 years
  
  Proof that WayBackMachine ignores the removal requests: archive.org/post/345681/removing-site-from-wayback-machine
- Admin about 4 years
  
  I noticed that they also stopped using a proper user-agent on requests from their bots. :-(
MrWhite over 9 years

The order of the groups should not matter. The most specific (ie. longest) user agent that matches is the one that wins. The * group only matches when no other group has matched.
unor over 9 years

@w3d: You’re right, I removed this part. Thanks for the info :)
wortwart almost 7 years

Locking out archive.org with robots.txt won't work anymore:
Kevin Fegan almost 7 years

@wortwart - That would be great if it were so (see the update I added to my answer). Do you have links to info about this?
wortwart almost 7 years

Sure: blog.archive.org/2017/04/17/… "A few months ago we stopped referring to robots.txt files on U.S. government and military web sites (...) We are now looking to do this more broadly."
unor almost 7 years

@KevinFegan: Thanks for your notice! I updated my answer to link to archived versions of the documentation that contained the name.
Gravity about 5 years

Yes. Now the Archive totally ignores the removal requests.
Gravity about 5 years

I like how they purposely make things complicated so they can get away!
John Cowan over 3 years

ia_archiver has always been Alexa. Since Alexa and the Internet Archive were created by the same person, they shared a bot for a long time. That is no longer true.