Latest news, food, business, travel, sport, Tips and Tricks...

What is Web Robot and Robots.txt?

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google, Bing, etc use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

Web site owners use the robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to visits a Web site URL, http://example.blogspot.com/p/welcome.html. Before it does so, it firsts checks for http://example.blogspot.com/robots.txt, and finds:

User-agent: *

Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using robots.txt:

  • Tobots can ignore your robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • The robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

The details

The robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:

In addition there are external resources:

How to create a robots.txt file for Blogger

What is Web Robot and Robots.txt?

  1. Go to your blogger blog.
  2. Navigate to Settings >> Search Preferences ›› Crawlers and indexing ›› Custom robots.txt ›› Edit ›› Yes
  3. Now paste your robots.txt file code in the box.
  4. Click on Save Changes button.
  5. You are done! 

What to put in it

The "robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *

Allow: /search/

Disallow: /search

Disallow: /p/sample.html

Disallow: /search/label/somelabel

Disallow: /2018/10/somepost.html

In this example, single part allowed and four part are excluded.

There is /search/ and /search where /search will disallow all content with parameter such as /search?q=keyword or /search?updated-max. For the /search/ will Allow all url contain /search/ such as /search/label. Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /search/ /p/sample.html/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records. The '*' in the User-agent field is a special value meaning "any robot".

Here some other examples:

To exclude all robots from the entire server

User-agent: *

Disallow: /

To allow all robots complete access

User-agent: *

Disallow:

(or just create an empty "robots.txt" in input field)

To exclude a single robot

User-agent: BadBot

Disallow: /

To allow a single robot

User-agent: Google

Disallow:

User-agent: *

Disallow: /

,
//