Bot Database

Historical Web Indexing

These historical web indexing bots are archiving Bots or change recording bots that examine web content over time. Most users will know Way Back When and archive.org which both record changes to your website over time.

Vendor

Bot Service

Recommendation

Description

logotypelogotype

Nicecrawler

Nicecrawler

Recommended

Not recommended

NiceCrawler crawls 330 millions of web sites per month. Their goal is to create an image archive of the entire internet as it changes over time for historical preservation. We crawl a maximum of 25 pages per domain and we never open up more than one page at a time.

logotypelogotype

netEstate GmbH

datenbank.de

Recommended

Not recommended

Datenbank indexes metadata from 5.4 million German websites. If your website is not in Germany or it is and you do not wished to be indexed, do not allow this crawler.

logotypelogotype

Common Crawl

Common Crawl

Recommended

Not recommended

Builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. Has AWS storage and academic storage links to historical web pages used for research, data labelling and a variety of other big data projects. Respects robots.txt.

logotypelogotype

British Library

Legal Deposit

Recommended

Not recommended

If you're a publisher, you need to give a copy of every UK publication you make to the British Library. Five other major UK libraries may also ask you to give them a copy. This system is called legal deposit and it's been a part of English law since 1662. Legal deposit has many benefits for publishers and authors. Your deposited publications can be read inside the British Library and will be preserved for future generations. Your works become part of the nation’s heritage, providing inspiration for new books and other publications. This bot crawls sites collecting data to be stored as part of the repository.

logotypelogotype

BIBLIOTHÈQUE Nationale de France

BnF

Recommended

Not recommended

“The mission of the BnF is to collect, catalog, conserve, enrich and communicate the national documentary heritage. The BnF ensures access for the greatest number of people to the collections on site, remotely and develops national and international cooperation. "If you are seeing this bot its because some of your site content is being collected by the National Library of France (BnF). To facilate this, BnF has a crawler which scrapes content.Their crawler applies long delays between two requests so as not to interfere with the operation of your web servers.This bot disregards robots.txt. In order to accomplish its legal deposit mission, the BnF may choose to capture some of the files concerned by the robots.txt, when these are necessary to reconstitute the editorial form of the site (in the case in particular of image files or sheets of style). Interactive web pages use the JavaScript language which builds links and triggers actions on events (page loading, navigation in a menu, mouse click or scroll, etc.).Not being able to accurately interpret all JavaScript code, Heritrix can generate false URLs: this behavior is not considered an error in the functionality of the robot ( https://github.com/internetarchive/heritrix3/wiki / crawling% 20JavaScript ).The BnF does its utmost to avoid the generation of these false URLs, by placing numerous filters in the collection profiles, and concentrates on the relevant URLs.

logotypelogotype

Arquivo.pt

Arquivo.pt

Recommended

Not recommended

Crawler for the Portuguese National web archive, Arquivo.

logotypelogotype

Archive.org

Waybackwhen

Recommended

Not recommended

Way back in 1996, The Internet Archive launched its nonprofit digital library that preserves web data and makes it available for research purposes through the Wayback Machine. The Internet Archive partners with universities, libraries and others to preserve the world's cultural heritage. The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Respects robots.txt, and this will also automatically delete the associated archives.