Board logo

subject: Search Engine Crawler [print this page]


A search engine crawler is a program or automated script that browses the World Search Engine CrawlerWide Web in a methodical manner in order to provide up to date data to the particular search engine. While search engine crawlers go by many different names, such as web spiders and automatic indexers, the job of the search engine crawler is still the same. The process of web crawling involves a set of website URLs that need to be visited, called seeds, and then the search engine crawler visits each web page and identifies all the hyperlinks on the page, adding them to the list of places to crawl. URLs from this list are re-visited occasionally according to the policies in place for the search engine. The policies of the search engine can be different for each search engine, and may be a cautionary action to ensure that some of the pages that have been added to the index before have not become spam.

Search engine crawlers have a hard times crawling the web on occasion because the Internet has three main characteristics that make it harder to continually keep the index up to date. Because of the large volume of web pages on the Internet, the fast pace and frequency of change to the pages, and the addition of dynamic pages, many search engine crawlers have a hard time crawling. These variations produce a massive amount of URLs to crawl, and cause the search engine crawler to prioritize certain web pages and hyperlinks. This prioritization can be summed up in four different search engine crawler policies that are found commonly within all search engines, though they might differ slightly.

The selection policy is the policy that states which pages to download for the crawling.

The re-visit policy type is a policy that indicates to a search engine crawler when to check web pages for changes

The politeness policies are used to inform crawlers as to how to avoid overloading websites to check the URLs

The parallelization policy is a policy which states how to coordinate distributed web crawlers

Search engine crawlers generally not only have a good crawling strategy with the policies that allow it to narrow down and prioritize the web pages that need to be crawled, but also need to have a highly optimized architecture. This architecture is used build high-performance systems for search engines that are capable of downloading hundreds of millions of pages over several weeks. This architecture can be followed easily, but must also be ready for high performance results. In a well formed search engine crawler, the web page is taken from the World Wide Web and put through a multi-threaded downloader. The URLs from this multi-threaded downloader head into a queue, and then pass through a scheduler to prioritize the URLs, finally going through the multi-threaded downloader again so that the text and Meta data ends up in storage.

There are many different professional search engine crawlers available today, such as the Google Crawler, and are used to list the URLs for use in the search engine. Without search engine crawlers, there would be no results for search engine results pages, and new pages would never be listed.

10 Ways to Increase Your Site Crawl Rate

Regular and frequent visits by the crawler is the first sign that your site appeals to Search Engine ReviewsGoogle. Thus the most efficient way to get frequent and deep crawls is to develop a website that search engines see as important and valuable.

Note that you cant force Googlebot to visit you more often what you can do is to invite it to come. Possible measures to take to increase the crawl rate may include:

Update your content often and regularly (and ping Google once you do) well, an obvious one, so not much to describe here; in a word, try to add new unique content as often as you can afford and do it regularly (3 times a week can be the best solution if you cant update your site daily and are looking for the optimal update rate).

Make sure your server works correctly: mind the uptime and Google Webmaster tools reports of the unreached pages. Two tools I can recommend here are Pingdom and Mon.itor.us.

Mind your page load time: note that the crawl works on a budget if it spends too much time crawling your huge images or PDFs, there will be no time left to visit your other pages.

Check the site internal link structure: make sure there is no duplicate content returned via different URLs: again, the more time the crawler spends figuring your duplicate content, the fewer useful and unique pages it will manage to visit.

Get more back links from regularly crawled sites.

Adjust the crawl speed via Google Webmaster tools.

Add a sitemap (though its up for a debate whether the sitemap can help with crawling and indexing issues, many webmasters report they have seen increased crawl rate after adding it).

Make sure your server returns the correct header response. Does it handle your error pages properly? Dont make the bot figure out what has happened: explain it clearly.

Make sure you have unique title and meta tags for each of your pages.

Monitor Google crawl rate for your site and see what works and what not:

Search Engine Indexing - Search Engine Indexing Process

Search engine indexing is done once the search engine spider (web crawler) returned Search Engine Spidersfrom it's web crawl. Refer to our Web crawler article to see what information it collects.

Search Engine Indexing Process

The search engine indexing process takes the detailed information collected by the search engine spider (web crawler) and analyses the information.

Search Engine Indexing Process

There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about exact working of search engine indexing process since most search engines offer limited information about how they architect the indexing process. Webmasters get some clues by checking their log reports about the crawler visits but are unaware of how the indexing happens or which pages of their website were really crawled.

While the speculation about search engine indexing process may continue, here is a theory, based on experience, research and clues, about how they may be going about indexing 8 to 10 billion web pages even so often or the reason why there is a delay in showing up newly added pages in their index. This discussion is centered around Google, but we believe that most popular search engines like Yahoo and MSN follow a similar pattern.

Google runs from about 10 Internet Data Centers (IDCs), each having 1000 to 2000 Pentium-3 or Pentium-4 servers running Linux OS.

Google has over 200 (some think over 1000) crawlers / bots scanning the web each day. These do not necessarily follow an exclusive pattern, which means different crawlers may visit the same site on the same day, not knowing other crawlers have been there before. This is what probably gives a daily visit record in your traffic log reports, keeping web masters very happy about their frequent visits.

Some crawlers jobs are only to grab new URLs (lets call them URL Grabbers for convenience). The URL grabbers grab links & URLs they detects on various websites (including links pointing to your site) and old/new URLs it detects on your site. They also capture the date stamp of files when they visit your website, so that they can identify new content or updated content pages. The URL grabbers respect your robots.txt file & Robots META tag so that they can include/exclude URLs you want/do not want indexed. (Note: same URL with different session IDs is recorded as different unique URLs. For this reason, session ID's are best avoided, otherwise they can be misled as duplicate content.) The URL grabbers spend very little time & bandwidth on your website, since their job is rather simple. However, just so you know, they need to scan 8 to 10 Billion URLs on the web each month. Not a petty job in itself, even for 1000 crawlers.

The URL grabbers write the captured URLs with their date stamps and other status in a Master URL List so that these can be deep-indexed by other special crawlers.

The master list is then processed and classified somewhat like:

New URLs detected

Old URLs with new date stamp

301 & 302 redirected URLs

Old URLs with old date stamp

404 error URLs

Other URLs

The real indexing is done by (what were calling) Deep Crawlers. A deep crawler's job is to pick up URLs from the master list and deep crawl each URL and capture all the content - text, HTML, images, flash etc.

Priority is given to "Old URLs with new date stamp" as they relate to already index but updated content. "301 & 302 redirected URLs" come next in priority followed by "New URLs detected". High priority is given to URLs whose links appear on several other sites. These are classified as important URLs. Sites and URLs whose date stamp and content changes on a daily or hourly basis are stamped as News sites which are indexed hourly or even on minute-by-minute basis.

Indexing of Old URLs with old date stamp and "404 error URLs" are altogether ignored. There is no point wasting resources indexing Old URLs with old date stamp, since the search engine already has the content indexed, which is not yet updated. "404 error URLs" are URLs collected from various sites but are broken links or error pages. These URLs do not show any content on them.

The Other URLs may contain URLs which are dynamic URLs, have session IDs, PDF documents, Word documents, PowerPoint presentations, Multimedia files etc. Google needs to further process these and assess which ones are worth indexing and to what depth. It perhaps allocates indexing task of these to Special Crawlers.

When Google schedules the Deep Crawlers to index New URLs and 301 & 302 redirected URLs, just the URLs (not the descriptions) start appearing in search engines result pages when you run the search "site:www.domain.com" in Google. These are called supplemental results, which mean that Deep Crawlers shall index the content soon when the crawlers get the time to do so.

Since Deep Crawlers need to crawl Billions of web pages each month, they take as many as 4 to 8 weeks to index even updated content. New URL's may take longer to index.

Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then processed, sorted and replicated (synchronized) to the rest of the IDCs. A few years back, when the data size was manageable, this data synchronization used to happen once a month, lasting for 5 days, called Google Dance. Nowadays, the data synchronization happens constantly, which some people call Everflux.

When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending upon their speed and availability. Since the data at any given time is slightly different at each IDC, you may get different results at different times or on repeated searches of the same term (Google Dance).

Bottom line is that one needs to wait for as long as 8 to 12 weeks, to see full indexing in Google. One should consider this as cooking time in Googles kitchen. Unless you can increase the importance of your web pages by getting several incoming links from good sites, there is no way to speed up the indexing process, unless you personally know Sergey Brin & Larry Page, and have a significant influence over them.

Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even a small data can create unlimited URLs, which can clutter Google index with duplicate content.

Recommendations:-

Ensure that you have cleared all roadblocks for crawlers and they can freely visit your site and capture all URLs. Help crawlers by creating good interlinking and sitemaps on your website.

Get lots of good incoming links to your pages from other websites to improve the importance of your web pages. There is no special need to submit your website to search engines. Links to your website on other websites are sufficient.

by: M Umar




welcome to loan (http://www.yloan.com/) Powered by Discuz! 5.5.0