What are Web crawlers?
source: own elaboration
We often discuss the topic of bad, malicious BOTs, which currently pose such a great threat to everyone using the benefits of the Internet. So, this time we will focus on good, useful BOTs that intend to make internet user’s life easier, namely Web crawlers, which are otherwise called spider spiderbot, crawlers or web wanderer. But what do these automated programs do?
Web crawlers are programs that monitor the web in an automated manner in order to catalog its content. They were created primarily as a support for the operation of web browsers (such as Google, Bing or Yahoo) that use crawlers to properly index websites so that users could quickly find pages of their interest. Without this solution, the browser would be out of date - it wouldn’t know about the changes that have occurred on a given website, or about new content that appeared on it. The main functions of spiders are checking the code and examining the content of websites, monitoring updates, creating page mirrors, and sometimes collecting additional information about a given website. It is their hard work allows you to position your pages in the search engine results.
What are the most famous Web crawlers?
There are hundreds of more or less known web spiders. Some of them are regulars on almost every website. The most popular internet BOTs are:
- GoogleBot – BOTs used by Google can be further divided into „Google’s fresh crawl”, so those visiting websites frequently and regularly to check what has changed on them and „Google’s deep crawl”, whose task is to download more data from websites.
- Bingbot – the Bing search engine robot implemented by Microsoft in 2010.
- Slurp Bot, a newer version of Yahoo Slurp – a spider used by the Yahoo search engine. Examination of the page by this BOT allows the site to appear in Yahoo Mobile search results. Additionally, Slurp checks and collects content from partner sites for inclusion in Yahoo News, Yahoo Finance, and Yahoo Sports.
- DuckDuckBot – web robot created for the DuckDuckGo search engine, which is known for protecting privacy and is becoming more and more popular. It currently handles over 12 million queries per day.
- Baiduspider – is the official name of the Web crawler of the Chinese search engine Baidu, a leading Chinese search engine with 80% market share in mainland China.
- YandexBot – a web robot of the largest Russian Yandex search engine.
- Sogou Spider – BOT working for Sogou.com, a Chinese search engine that was launched in 2004.
- Exabot – the web robot of the French Exalead search engine. It was created in 2000 and currently has over 16 billion indexed pages.
- FaceBOT – a Facebook BOT that allows you to temporarily display some images or details related to internet content - such as the title of the page or the video embed tag that Facebook users want to share. FaceBOT retrieves this information only after the user provides the link. And its job is also to improve ad performance.
- ia_archiver – Amazon Alexa web crawler.
Of course, apart from those working for internet giants, there are also other types of Web crawlers. For example, Xenon is a web robot used by the government tax authorities of the Netherlands, Austria, Canada, Denmark, UK and Sweden to search for websites (online stores, gambling sites or porn sites) whose owners are tax evading and WebCrawler has been used to build the first publicly available index of a subset of the web.
How does an Web Crawler work?
Web crawlers either receive or create lists of URLs to visit, known as seeds. When the BOT visits these URLs, it looks for all the hyperlinks on the site and adds them to a list of URLs to visit, known as the crawl frontier. If additionally, to his tasks is the archiving of pages - it copies and saves data on a regular basis. The archive called the Repository contains the latest version of the page downloaded by the crawler. Of course, this large number of pages means that the crawler can retrieve a limited amount of information in a given period of time - so it must create a hierarchy of visits.
What does a crawler on the site?
Once the spider has selected the page it wants to visit, it identifies itself to the web server via the User-Agent request header in the HTTP request with its own unique identifier. In most cases, the crawler traffic can be checked in the referrer logs of the web server. Then, the BOT collects information about the page, such as the content, links, broken links, sitemaps and HTML code verification, according to the rules set by the owner of the page.
Can you avoid visiting crawlers on a given page?
Sometimes it may happen that the indexing robots constantly poll the site, which may cause problems with the page loading. To avoid such a situation, the administrator of a given site can define the rules that the Internet BOT will have to follow on his site - using the robots.txt file. This solution was proposed in February 1994 by Martijn Koster while working for Nexor on the www-talk forum, and it quickly became what was expected of current and future web robots. The current version was created by Google - on July 1, 2019 internet giant announced its Robots Exclusion Protocol or robot.txt as an official standard within the Internet Engineering Task Force.
By placing an appropriately modified robots.txt file in the root of your website hierarchy (e.g. https://www.example.com/robots.txt), you can define rules for crawlers, such as URLs that they can visit and those to which they don’t have access, permission to index certain resources or block them, etc... BOTs must follow the rules set out in this file - they should download the file and read the instructions before downloading any other data from the website. You can also specify whether these rules should apply to all BOTs or those with a specific User-Agent. If the robot.txt file doesn’t exist, the web crawlers assume that the site owner doesn’t want to impose on them any restrictions.