Bot Database

Web Site Data Collection

Many bots target IT infrastructure to understand the full tech stack and all components used. In many cases this can be harmless data. Which webserver, Content Distribution Network (CDN), or e-commerce platform aren’t exactly state secrets. So what right? Legitimate commercial services such as Built-With then package the data up, allowing sales and marketing teams to precisely target domains with the exact spec and build they have solutions for. It can be helpful to the entire supply chain - sellers get precise targeting, and buyers get solutions, that god forbid, they may actually need. However, on the illegitimate side, you can easily see the opportunity for hackers who can target known infrastructure vulnerabilities. They can launch illegal bots to quickly and easily find compromised versions and weak tech stacks across the web. Of course, this is just another reason to ensure we’re always updating software, and we have robust version controls in place, but we all know that’s not always the reality. Bots can extract very detailed information right down to specific releases and versions. Often these generic crawlers will hijack an existing common user agent string, pretending to be a legitimate search or media crawler.

Vendor

Bot Service

Recommendation

Description

logotypelogotype

Dataprovider.com

Dataprovider site explorer

Recommended

Not recommended

logotypelogotype

ZoomInfo Powered by DiscoverOrg

Datanyze

Recommended

Not recommended

Datanyze is a worldwide leader in technographics. The company uses machine learning and proprietary methodologies to capture technologies that are used or implemented by more than 35 million companies globally. Part of ZoomInfo, a leading B2B Growth Acceleration Platform for sales and marketing teams.Datanyze is a solution which gathers information about your website technology and business in order to allow is customers a more complete picture of your business to support their sales and marketing efforts. It's like you'll be crawled if one of their customers is interested in your company or market segment.

logotypelogotype

Yandex

Yandex Webmaster Bot

Recommended

Not recommended

The Yandex.Webmaster indexing robot. This provides data for the Yandex webmaster platform.

logotypelogotype

Wappalyzer

Wappalyzer

Recommended

Not recommended

Wappalyzer is a cross-platform utility that uncovers the technologies used on websites. It detects content management systems, ecommerce platforms, web frameworks, server software, analytics tools and many others.

logotypelogotype

Spaziodati

Spaziodati

Recommended

Not recommended

A big data company who help businesses with Enterprise Data solutions and B2B Lead Generation solutions, they will come looking for publicly available data to support their product offerings.

logotypelogotype

Similartech

Similartech

Recommended

Not recommended

Similartech crawl your site to add to their database of websites, and the technologies used to build them. They claim to scan more than 30 billion web pages per month. It monitors and analyzes over 317 million domains.

logotypelogotype

SafeDNS

Categorization Crawler

Recommended

Not recommended

SafeDNS offer a wide selection of secure, fast and reliable solutions for content and web filtering. The main reason for us at SafeDNS to collect web pages, is to correctly categorize the Internet resources and to develop new technologies and products for SafeDNS.

logotypelogotype

Nominet

Nominet .UK Domain Registry

Recommended

Not recommended

If you use a .UK domain you may wish to allow this both from Nominet, who are one of the largest .UK domain registries. This bot collects data, on a regular basis, on whether .UK domain names resolve, where they are hosted, whether they are used for email and whether a website is in place. As part of this, they collect information about the landing page and About Us or Contact Us pages of your website so that they can categorize the website type (e.g. blog, parking page etc). They may also perform additional checks such as whether you have an SSL certificate and whether there is a matching domain name in a different top level domain (e.g. .com), and collect similar information about them in order to see how they differ from the .UK domain name. In addition they also collect information on which content management systems (CMS) websites are using along with version numbers in an attempt to identify security vulnerabilities. This bot should be well behaved, for example, they will restrict the number of times we visit websites which use the same IP address. Any information gathered is used to help Nominet better understand how .UK domains are used by registrants, identify security vulnerabilities and identify changes over time.

logotypelogotype

Neticle Labs

Neticle

Recommended

Not recommended

Neticle provide solutions to read and analyze website text. Neticles crawler will make a small series of GET requests to your site to take content for text analysis for their various analysis products, which are typically used by research and comms teams.

logotypelogotype

Net Systems Research

Net Systems Research

Recommended

Not recommended

Net Systems Research is an independent research organization focusing on a range of topics in internet security. Their crawler is used to survey and analyze real world network systems to better understand and study internet security problems. The crawler will make a small number of requests to your site, spread over a few minutes.

logotypelogotype

Netcraft

Netcraft

Recommended

Not recommended

Netcraft has explored the internet since 1995 and is a respected authority on the market share of web servers, operating systems, hosting providers, ISPs, encrypted transactions, electronic commerce, scripting languages and content technologies on the internet.

logotypelogotype

hyScore.io

hyScore.io

Recommended

Not recommended

hyScore.io offer a platform allowing Companies to understand and structure text, website, documents, pictures, audio and video data based on content. hyScore.io is used by businesses to fetch and analyse website content using a crawler which behaves similar to many search engine crawlers. Pages are only ever visited on demand, so if the hyScore.io Crawler has visited your site then this means someone (in your company or external) requested the content analysis and insights for that page where the hyScore.io information was either not yet available or needed to be refreshed. For this reason, you will often see a request from the hyScore.io crawler shortly after a user has visited a page. They state that the Crawler is engineered to be as friendly as possible, such as limiting request rates to any specific site, automatically backing away if a site is down or slow or is repeatedly returning non-200 (OK) responses.As this solution is used by a number of third party platforms, such as Data Management Platforms (DMP) or Demand Side Platforms (DSP) and many others. These systems are often used by other third-party systems (Adserver, DMP, Brand Safety, Ad Fraud…) as part of the customers’ strategy (Agencies, Brands, Publishers, etc.).Check with your team if they use hyScore.io or a platform which uses it before deciding to allow this bot.

logotypelogotype

Headline

Headline

Recommended

Not recommended

Headline are a VC Investment company based in San Fransisco. Their crawler is designed to trawl websites for publicly available information like home page content, job postings, team pages, location references. They do this to discover interesting companies.

logotypelogotype

Google

GoogleOther

Recommended

Not recommended

GoogleOther is a new crawler from Google. Its a Generic crawler that may be used by Googles internal product teams for fetching publicly accessible content from sites. For example, it may be used for one-off crawls for internal research and development. The GoogleOther crawler always obeys robots.txt rules.

logotypelogotype

GetProxi.es

GetProxi.es

Recommended

Not recommended

Spanish site which uses a bot to check for proxy sites that are active and working

logotypelogotype

Censys

Internet Measurement

Recommended

Not recommended

Internet Measurement is operated by Driftnet.io. The purpose of this crawler is to measure services that network owners and operators have publicly exposed. If you don't want this third party crawling your site, do not allow this crawler.

logotypelogotype

BuiltWith

BuiltWith

Recommended

Not recommended

BuiltWith allows anyone to interrogate a website to find out what technologies are in use on it, users can access this both from the BuiltWith website and via browser plugins with a Freemium offering and paid for tiers too. If you see this bot either someone in your team or a partner or potential partner maybe checking your web tech stack out!

logotypelogotype

Babbar

Babbar

Recommended

Not recommended

Babbar is crawling the web in order to measure it, calculating helpful metrics (popularity, trust, categorization) along the way. Its goal is to allow its users to estimate the trust, popularity and topic of any website. Babbar helps find the best media to put your ads or links.

logotypelogotype

1&1

IONOS

Recommended

Not recommended

IONOS Crawler is the web crawler of IONOS. Its job is to constantly crawl the web in order gather information to allow 1&1 to improve their hosting offering. If you are happy for 1&1 to gather information on your site then please allow this bot.