Search Engines
Tutorial DirectoryWhere are you searching?
When you use a search engine you are not searching the live Web. Instead, you are working with the search engine's database of Web page information. Each search engine copies and organizes information available on the Internet into their own databases. A database may include partial or complete copies of millions (billions) of Web pages. Some databases are broad and general; others are narrow and specific. Clicking on hyperlinks listed in the search results opens up the live Web page of interest, assuming the live page is still there.
The Search Engine three stage process: crawl, index, retrieve
In the simplest terms, building a search engine database is a three-stage process. Search engines must find pages, organize the information for fast retrieval, and serve up the information based on a user's request. This is an ongoing process because search engine databases are constantly changing and growing.
Stage 1 - Crawling
Search engines compile huge databases of information. The database is continually updated by robotic 'crawlers or spiders' that automatically copy the contents of hundreds of billions of pages of Web information. Not every page on every site is crawled and copied. Google, for example, crawls only publicly available pages. Crawlers can be prevented from visiting pages by files called 'robot.txt.' Typically a spider program crawls a site directory and follows available links. A spider program might copy 100 pages from a 300 page site. Valuable information may be missed. This un-indexed information is sometimes called the 'opaque' or partially visible Web.
Crawlers jump from Web page to Web page by following site links. A site that is linked to many others is more likely to be visited frequently. Isolated pages will be 'crawled' less often, or may be missed altogether. This method of collecting information can lead to gaps in a search engine's database. In part, this explains why different search engines produce different results.
It takes time and money to crawl the Net. Economic decisions affect the depth, breadth and freshness of a search engine's database. While freshness and completeness are touted by each search engine, users should be aware that the results they get may include old copies of information that has been updated since the crawler's last visit.
Stage 2 - Indexing
Indexing or cataloging is the process of organizing a search engine's database of Web pages to maximize retrieval efficiency. The exact methods of indexing information used by commercial search engines are closely guarded, proprietary information. Each search engine uses a variety of 'black box' indexing algorithms to sort the contents of a page. In general the occurrence of keywords, the proximity of keywords to each other, and the contents of html elements like meta tags, titles, and headers, are all taken into account as the index is created.
The raw results delivered by the spider software, are 'pre-processed' to eliminate duplicate pages, and remove 'spam' pages. Spam pages are designed to fool search engines and achieve a higher ranking by using rigged attributes such as misleading meta-tags. There is no human judgment regarding the actual quality of information. Robotic crawling and indexing is an automated process.
Pages are unsearchable until they are indexed. There is lag time between the 'spidering' of a page and when it is indexed and made available to the users of the search engine. Once the Web page copy is in place, it remains unchanged in the database until the next 'visit' by the search engine's crawler.
Stage 3 - Retrieving
Finally, a search engine provides a way for a user to retrieve information from its database. Each search engine uses a proprietary 'retrieval algorithm' to process your query. These are increasingly more sophisticated and may involve semantic processing to understand what a search query means. Responding to a user query is usually a two-step procedure. First the system checks for records that match the query, then it sorts or arranges the results in some kind of hierarchy. Exactly how each search engine matches queries to records is a carefully guarded trade secret.
In general, search engines look for the frequency of keyword matches, and the distribution of those words in a document to determine which information is relevant to a request. Keyword matches found in titles and headings on the first page are given greater weight and considered semantically more relevant. To improve retrieval speed, many search engines filter out high frequency words such as: or, to, with, and, if, etc. These words are sometimes referred to as 'stop words'.
The most likely matches for a query are then displayed in a results list. The order in which matched pages are listed varies. Google lists results based in part on their popularity. Popularity is based on the number of other pages that link to the page in question. Commercial providers might also pay to be ranked at the top of the results list as ads.
FAQ's
Why should I use more than a single Search Engine?
No single search engine covers it all. In fact there are billions of pages of information that remain hidden to search engines. Since search engines differ in their crawling, indexing, and retrieval procedures, their results will vary. While overlap exists, each search engine will contain Web pages that have been missed by the others. This means that identical queries made to different search engines will yield different results. For this reason alone, using more than a single search engine is a wise move. If you rely on a single source of information you will get an incomplete picture.
How 'fresh' or 'current' is the information we get from a search engine?
Search engines must first find, copy, and index a web resource before it is made available for retrieval. This process takes time. Crawlers may return to a page on a daily, weekly, or monthly basis. Additionally, author submitted web pages might take weeks or months to process. Once a page is in the database, it is only a copy of the original, which may have already changed. The best way to determine the 'currency' of a web page is to examine the original material, looking for a record of when the page was last updated.
How do search engines find web pages?
We have already seen how information discovery software, sometimes-called spiders or crawlers, automates the process of finding new web pages. Most search engines also allow authors to submit their pages directly. The author supplies the web address and some information about content. The sites are then crawled, indexed and made available. Some search engines allow web page authors to buy quick placement in their systems.
How much information on each page is actually indexed?
Some search engines make a copy of the entire Web page. Others take a snapshot consisting of essential address information and the first few hundred words of text on the page. There is no guarantee that all of the pages on a site have been indexed. Only the main pages may be included in a search engine's database. Additionally, common parts of speech may be left out of the indexing process.
Are crawlers, spiders, and robotic information discovery the same thing? Do they work the same way?
Crawlers, spiders, and robotic information discovery all describe the process of automatic Web page copying. This is an essential first step in the process of building and maintaining a search engine database. Because it is an automated process, crawlers work around the clock to find new sites and recheck sites for changes. Crawlers can be set to investigate a website in depth, visiting and copying every page. Crawlers might also just skim the surface content of a site, leaving a lot of information in the shadows.
Authored by Dennis O'Connor 2003 | Revised 2019