poweredarticle.com
Search:    Index Page >> About Us >> Privacy >> Terms of Use >> Place Your Link >> Add Your Article   

Finance & Investment

Self Help

Hotels & Travel

Fitness & Health

Employment & Careers

Business & Companies

Fashion & Relationships

Estate & Realty

Research & Science

Drink & Food

Vehicles & Automotive

Issues & News

Recreation

Healthcare & Treatment

Computers & Software

Art & Creative

Government & Politics

Academics & Education

Sports & Adventure

Online Shopping

Online & Board Games

Family & Home

Society & Communities

Teens & Children

 

Index Page –› Computers & Software –› SEO Services
 

Search Engine Robots or Web Crawlers

 
Most of the common users or visitors use different available search engines to search out the piece of information they required. But how this information is provided by search engines? Where from they have collected these information? Basically most of these search engines maintain their own database of information. These database includes the sites available in the webworld which ultimately maintain the detail web pages information for each available sites. Basically search engine do some background work by using robots to collect information and maintain the database. They make catalog of gathered information and then present it publicly or at-times for private use.

In this article we will discuss about those entities which loiter in the global internet environment or we will about web crawlers which move around in netspace. We will learn

?? What it??s all about and what purpose they serve ?
?? Pros and cons of using these entities.
?? How we can keep our pages away from crawlers ?
?? Differences between the common crawlers and robots.

In the following portion we will divide the whole research work under the following two sections :

I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained.

I. Search Engine Spider : Robots.txt

What is robots.txt file ?

A web robot is a program or search engine software that visits sites regularly and automatically and crawl through the web??s hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. Sometimes site owners do not want all their site pages to be crawled by the web robots. For this reason they can exclude few of their pages being crawled by the robots by using some standard agents. So most of the robots abide by the ??Robots Exclusion Standard??, a set of constraints to restricts robots behavior.
??Robot Exclusion Standard?? is a protocol used by the site administrator to control the movement of the robots. When search engine robots come to a site it will search for a file named robots.txt in the root domain of the site (http://www.anydomain.com/robots.txt). This is a plain text file which implements ??Robots Exclusion Protocols?? by allowing or disallowing specific files within the directories of files. Site administrator can disallow access to cgi, temporary or private directories by specifying robot user agent names.

The format of the robot.txt file is very simple. It consists of two field : user-agent and one or more disallow field.

What is User-agent ?

This is the technical name for an programming concepts in the world wide networking environment and used to mention the specific search engine robot within the robots.txt file.
For example :

User-agent: googlebot

We can also use the wildcard character ??*?? to specify all robots :
User-agent: *

Means all the robots are allowed to come to visit.

What is Disallow ?

In the robot.txt file second field is known as the disallow: These lines guide the robots, to which file should be crawled or which should not be. For example to prevent downloading email.htm the syntax will be:

Disallow: email.htm

Prevent crawling through directories the syntax will be:

Disallow: /cgi-bin/

White Space and Comments :

Using # at the beginning of any line in the robots.txt file will be considered as comments only and using # at the beginning of the robots.txt like the following example entail us which url to be crawled.

# robots.txt for www.anydomain.com

Entry Details for robots.txt :

1) User-agent: *
Disallow:

The asterisk (*) in the User-agent field is denoting ??all robots?? are invited. As nothing is disallowed so all robots are free to crawl through.

2) User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /private/

All robots are allowed to crawl through the all files except the cgi-bin, temp and private file.

3) User-agent: dangerbot
Disallow: /
Dangerbot is not allowed to crawl through any of the directories. ??/?? stands for all directories.

4) User-agent: dangerbot
Disallow: /

User-agent: *
Disallow: /temp/

The blank line indicates starting of new User-agent records. Except dangerbot all the other bots are allowed to crawl through all the directories except ??temp?? directories.

5) User-agent: dangerbot
Disallow: /links/listing.html

User-agent: *
Disallow: /email.html/

Dangerbot is not allowed for the listing page of links directory otherwise all the robots are allowed for all directories except downloading email.html page.

6) User-agent: abcbot
Disallow: /*.gif$

To remove all files from a specific file type (e.g. .gif ) we will use the above robots.txt entry.

7) User-agent: abcbot
Disallow: /*?

To restrict web crawler from crawling dynamic pages we will use the above robots.txt entry.

Note : Disallow field may contain ??*?? to follow any series of characters and may end with ??$?? to indicate the end of the name.

Eg : Within the image files to exclude all gif files but allowing others from google crawling
User-agent: Googlebot-Image
Disallow: /*.gif$

Disadvantages of robots.txt :

Problem with Disallow field:

Disallow: /css/ /cgi-bin/ /images/
Different spider will read the above field in different way. Some will ignore the spaces and will read /css//cgi-bin//images/ and may only consider either /images/ or /css/ ignoring the others.

The correct syntax should be :
Disallow: /css/
Disallow: /cgi-bin/
Disallow: /images/

All Files listing:

Specifying each and every file name within a directory is most commonly used mistake
Disallow: /ab/cdef.html
Disallow: /ab/ghij.html
Disallow: /ab/klmn.html
Disallow: /op/qrst.html
Disallow: /op/uvwx.html

Above portion can be written as:
Disallow: /ab/
Disallow: /op/

A trailing slash means a lot that is a directory is offlimits.

Capitalization:

USER-AGENT: REDBOT
DISALLOW:

Though fields are not case sensitive but the datas like directories, filenames are case sensitive.

Conflicting syntax:

User-agent: *
Disallow: /
#
User-agent: Redbot
Disallow:

What will happen ? Redbot is allowed to crawl everything but will this permission override the disallow field or disallow will override the allow permission.

II. Search Engine Robots: Meta-tag Explained:

What is robot meta tag ?

Besides robots.txt search engine is also having another tools to crawl through web pages. This is the META tag which tells web spider to index a page and follow links on it, which may be more helpful in some cases, as it can be used on page-by-page basis. It is also helpful incase you don??t have the requisite permission to access the servers root directory to control robots.txt file.
We used to place this tag within the header portion of html.

Format of the Robots Meta tag :

In the HTML document it is placed in the HEAD section.
html
head
META NAME=??robots?? CONTENT=??index,follow??
META NAME=??description?? CONTENT=??Welcome to????.??
title??????????title
head
body

Robots Meta Tag options :

There are four options that can be used in the CONTENT portion of the Meta Robots. These are index, noindex, follow, nofollow.

This tag allowing search engine robots to index a specific page and can follow all the link residing on it. If site admin doesn??t want any pages to be indexed or any link to be followed then they can replace ?? index,follow?? with ?? noindex,nofollow??.
According to the requirements, site admin can use the robots in the following different options :

META NAME=??robots?? CONTENT=??index,follow??> Index this page, follow links from this page.
META NAME=??robots?? CONTENT =??noindex,follow??> Don??t index this page but follow link from this page.
META NAME=??robots?? CONTENT =??index,nofollow??> Index this page but don??t follow links from this page
META NAME=??robots?? CONTENT =??noindex,nofollow??> Don??t index this page, don??t follow links from this page.

Author: Susmita
 
Author Bio:

Susmita love researching on web marketing and on SEO related issues. She prepared her blog for distributing knowledge and gathering knowledge as well as for sharing her views on different aspects of life.

 
 
 

Related Articles

 
Quality Web Design With Results - The Basics
 
3 Fast Ways To Get Free Content For Your Home Based Business Website
 
Meta Metrics: There Is No Need To Measure The Minutae When Inspecting Schools Or Local Authorities
 
The Advantage of Submitting to Niche Directories
 
Why Not Use the Microsoft Firewall?
 
How to Grow a Money Tree
 
Motorola Pebl U6: Exclusive and Elegant
 
Buyer Beware: Web Hosting, Registration, and Site Building "All in One" Package Nightmares
 
To Link or Not To Link? That is the Question
 
LG 880 Pink: A perfect blend of bizarre beauty and multifaceted functionality
 
 
 

Related Links

 
Top-Magazine-Subscriptions.com
Subscribe to magazines online and save up to 80% off the newsstand price.
 
Computer Training Course
Computer Training Course - Spanish version Study computation without leaving the home. Complete courses of computation in format of electronic books. You unload immediately. Free demos downloads.
 
Car rental at malaga airport
Car Hire at malaga airport, Andalucia car offers car hire on malaga, marbella, ronda, spain a fast and easy way to book your car hire in Malaga.
 
Toni Geiling - violinist, composer, songwriter
Download free MP3 or weed files by award winning violinst, composer, songwriter Toni Geiling from East Germany. Acoustic music, Folk, classical and more. Also find sheet notes for string players.
 
Media Arts Education
Media Arts Education is a website that helps you find the education you are looking for in Advertising, Architecture, Broadcast, Communications, Design, Fashion, Film, Games development, Journalism.
 
Learn Bulgarian Phrases
Alphabet, Pronunciation guide, Dialogs, Phrases, Grammar, Exercises accompanied by audio recordings made by native speakers. Free samples. Includes a whole unit on Love & Romance words and expressions!
 
 
 
 

To Link or Not To Link? That is the Question

Vertical linking; will it help my rankings? Is it important? Does it matter? These questions are... - Kevin Gee
 

Choosing A Web Host: From A Web Host??s Perspective

Anyone who has ever browsed a forum related to web hosting is sure to have seen a topic or two on ho ... - James Adams
 

Driver's Education Tests

Many websites provide interactive online driver education courses with study guides along with pract ... - Jimmy Sturo
 

LG KG800 Chocolate: The latest luxury fashion icon

The LG KG800 is also known as the LG chocolate due to its sleek and luxurious design. The handset de ... - Elizabeth
 

Ringtones and The Music Industry

When they first appeared in the consumer market place, ringtones seemed simply like a more inventive ... - Dave Carter
 

Quality Web Design With Results - The Basics

Beyond Appearance: Every design has a "sales" potential. Think about it, why do some websites sell b ... - Todd Levi
 
 
Index Page >> Privacy >> Terms of Use  
© www.poweredarticle.com - All Rights Reserved Worldwide