Historical Web Indexing

These historical web indexing bots are archiving Bots or change recording bots that examine web content over time. Most users will know Way Back When and archive.org which both record changes to your website over time.

Vendor

Bot Service

Recommendation

Description

Nicecrawler

Recommended

Not recommended

NiceCrawler crawls 330 millions of web sites per month. Their goal is to create an image archive of the entire internet as it changes over time for historical preservation. We crawl a maximum of 25 pages per domain and we never open up more than one page at a time.

netEstate GmbH

datenbank.de

Recommended

Not recommended

Datenbank indexes metadata from 5.4 million German websites. If your website is not in Germany or it is and you do not wished to be indexed, do not allow this crawler.

Common Crawl

Recommended

Not recommended

Builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone. Has AWS storage and academic storage links to historical web pages used for research, data labelling and a variety of other big data projects. Respects robots.txt.

British Library

Legal Deposit

Recommended

Not recommended

If you're a publisher, you need to give a copy of every UK publication you make to the British Library. Five other major UK libraries may also ask you to give them a copy. This system is called legal deposit and it's been a part of English law since 1662. Legal deposit has many benefits for publishers and authors. Your deposited publications can be read inside the British Library and will be preserved for future generations. Your works become part of the nation’s heritage, providing inspiration for new books and other publications. This bot crawls sites collecting data to be stored as part of the repository.

BIBLIOTHÈQUE Nationale de France

BnF

Recommended

Not recommended

“The mission of the BnF is to collect, catalog, conserve, enrich and communicate the national documentary heritage. The BnF ensures access for the greatest number of people to the collections on site, remotely and develops national and international cooperation. "If you are seeing this bot its because some of your site content is being collected by the National Library of France (BnF). To facilate this, BnF has a crawler which scrapes content.Their crawler applies long delays between two requests so as not to interfere with the operation of your web servers.This bot disregards robots.txt. In order to accomplish its legal deposit mission, the BnF may choose to capture some of the files concerned by the robots.txt, when these are necessary to reconstitute the editorial form of the site (in the case in particular of image files or sheets of style). Interactive web pages use the JavaScript language which builds links and triggers actions on events (page loading, navigation in a menu, mouse click or scroll, etc.).Not being able to accurately interpret all JavaScript code, Heritrix can generate false URLs: this behavior is not considered an error in the functionality of the robot ( https://github.com/internetarchive/heritrix3/wiki / crawling% 20JavaScript ).The BnF does its utmost to avoid the generation of these false URLs, by placing numerous filters in the collection profiles, and concentrates on the relevant URLs.

Arquivo.pt

Recommended

Not recommended

Crawler for the Portuguese National web archive, Arquivo.

Archive.org

Waybackwhen

Recommended

Not recommended

Way back in 1996, The Internet Archive launched its nonprofit digital library that preserves web data and makes it available for research purposes through the Wayback Machine. The Internet Archive partners with universities, libraries and others to preserve the world's cultural heritage. The Internet Archive Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Respects robots.txt, and this will also automatically delete the associated archives.

Device Fingerprints at the edge of the network

Bot Detection Software: Stop the bots

Prevent Account Takeover ATO Protection At The Edge Of Network

Prevent Fake Account Creation With Effective Bot Management Tools

API Abuse Protection At The Edge

Good Bots v. Bad Bots. How to comprehensively manage bots easily

Comprehensive Guide to Bot Detection in 2023

Web Portfolio Management for Agencies or Brands

Case Study: API Abuse Protection At the Edge

E-commerce top 10 Bot Threats

Bots and Fraud Management: Zero Trust at the Edge of Network

Behavioural visitor management

Cloud hosting providers

Integration & API Documentation

Status Page

Bot Database

Historical Web Indexing

Vendor

Bot Service

Recommendation

Description

Nicecrawler

datenbank.de

Common Crawl

Legal Deposit

BnF

Arquivo.pt

Waybackwhen

Ready to get protected?