So, this thing increased my curiosity and I started looking into what are web spiders , web crawlers , robots , bots and how all these things work ?.
One interesting thing that I found about all the above mentioned terms that, they all are same only difference is that different people use different terminology to explain the things. So for rest of the post I am going to use the term Web Spider.
What is Web Spider ?
Let’s try to understand this thing from the other direction, whenever you do a search on any search engine it shows you the best results on the basis of your provided input. To fetch this result it is required for search engines that they are already storing this information somewhere and whenever anybody asks for this information they consult to some program and return you the best results.
So, that someone who is helping search engines in finding the information that you requested from millions of pages are known as “Web Spiders or Web Crawlers”.
Each and every search engine have web spiders associated with it – the main responsibility of these spiders is to go through each and every website present of the internet , collects important keywords found on the page , generate index on those keywords and store it in their database.
How Web Spider Works?
Generally each and every search engine have multiple instances of crawlers running and whenever any web spider reached at a particular website it first looks for a file robots.txt ,this file actually gives the instructions to the web spider so that web spider knows in which all parts of a website it has to crawl to collect the information. Today almost all the web spiders from different search engines follow the same approach.
Now let’s look into little bit detail on how web crawler works.
After reaching to a particular website it crawls all the pages of it and start indexing all the keywords present in it. Once it’s finished doing this activity it starts looking for the outbound links (external links present in your site), it navigates to those outbound link and start the same process again. If you see it from a programmatic perspective then you can understand that all of this is going through a recursive process. But now you might be thinking what is the starting point for a web spider or starting input for our recursive program, so generally most of the web spiders start there navigation from the heavily loaded servers and that are more popular . As there are multiple instances of crawlers are running , they have to make sure that they are not storing the duplicate data otherwise it will impact hugely on the performance of search engine.
Please provide your valuable thoughts and input on the same , or please add if I have missed out something.
Thanks for visiting.
Latest posts by Saurabh Jain (see all)
- java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskInputOutputContext, but class was expected - August 8, 2014
- org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO “TABLE_PARAMS” – Hive with Kite Morphlines - July 17, 2014
- java.io.IOException: can not read class parquet.format.PageHeader: null – Hive - July 12, 2014