Search-Engine-Structure-2


1 Including Your Website into a Search Engine

I will give you a summary of the standard procedure, you find in this or any similar form in every search engine. On specific details of the programming I renounce for the sake of clarity.

We start at a list of Web addresses (URL) given to a spider. He has to query thSearch Enginee content. (Structure of a Search Engine)

You register your page in the appropriate template so it appears on the URL list. Another source for the list are hyperlinks found when the sites are analyzed. The software follows the links. If it finds in this procedure a yet unknown page, this is also included in the list. With Google and some other engines there is the particular form of automatic application for a site, called “sitemap”. You put all individual pages of your website on a list and upload it to Google. Read here to find out how this works. (Google sitemap upload).

http://www.google.com/support/webmasters/bin/answer.py?answer=40318

The robot starts at the beginning of the URL list and calls on the relevant IP address. The Google URL list contains, according to unofficial estimates around 24 billion pages. All these pages are called sequentially. The robotic software reads the textual content of the pages and saves the entire content in compressed form. Take a look at what the search robot detects and records. With your browser it is quite simple. On any page you call in the first line of the browser menu the command view / source. You see the raw data for the browser. From this text the page is created you see on the screen. The Language of this text is called HTML (Hypertext Markup Language) (Fig. 3 source)Source Code

Program instructions and textual contents are mixed in the page description. The information is carried in this form over the traces of the Internet and arrives at the robot, when it is processing the IP addresses. How the data is passed, is the task of the so-called “parser”.

How can the robot be instructed to gather as much information as possible? Look at the procedure in detail for the scanning of any website: the robot looks first to see if he finds a file, which is called “robots.txt”. It gives him the control orders and instructions about what he should do. Google describes the procedure very precise:
http://www.google.com/support/webmasters/bin/topic.py?topic=8843

A second possibility to control the robots is the formulation of meta tags. You place them in the so-called head-Section (<head>) of the page description.

Let’s assume the robot has found your presence and now calls on the documents related to the website. These are at first the page description (HTML) files, but also image files, graphics, PDF files and sound or video sequences.

The robot starts at the top level of the website. In this respect, it is sufficient if the homepage (first page) is notified to the search engine, because this page leads the visitor to the following pages. The related links are also enough food for the robot to find more pages. The robot autonomously writes unknown URL's read from the pages in the URL list (see above).

In this way the robot scans all the pages from your Web site and sends a copy to the database of the search engine. It is said that Google, for example, is combing about 8 billion pages in 4 weeks and passes them to the parser.

The parser is a sort engine. He selects a plethora of potential misinformation the search engine cannot cope with. That is for example, coding errors, spaces, nested loops and all other sorts of data useless for the subsequent data processing.

The indexer, as the following program is called, browses through the pages and attracts the keywords. The keywords get an index, and are sorted in a lexicon. If you want to get an idea about which keywords can be found on your page, enter your URL (or any) into the following program and receive an analysis that is roughly corresponding to the result of the indexer: http://www.addme.com/keyword-density.htm. Now there is a list for each page with the number of keywords: page -> keywords.

This database will be sorted one more time. Think of it as a matrix that is being converted. The result is a reversal: keywords -> pages. The keywords are an encyclopedia from which a reference is created pointing to the website where a search words appear. Typically there is not only one result for a request. With the large number of sites searched, the reverse index shows thousands or millions of references. For example: If a surfer looks for the term "Adwords course", Google shows 1,210,000 pages on which the word combination occurs.

It be clear by now that after entering the search term, by no means is the search engine browsing through the Internet looking for pages that contain these keywords. The result is assembled from a number of databases within the shortest possible time. This can only be achieved if the entire recorded data (the contents of the pages) is preselected. This process is the essential, time-consuming work for the search engines. It takes weeks and months to analyze the many billions of pages, even using thousands of computers in parallel.

Then the new index is prepared and the old results from the last indexation run will be overwritten. This procedure will take a certain amount of time. During this rearranging the different computing centers show different results. In the forums of webmasters and experts the results of the new processing runs are being anxiously awaited. But even a layman easily gets information on whether new revisions are on the way - with the Google Dance Tool (www.google-dance-tool.com).

Search-Engine-Structure-3

If you eventually got to this point without being registered to my Internet Marketing Course, don't miss it. It is a free course about Internet Marketing and it gives you deep insights and many Know How.

Look a this page for more information: Internet Marketing Course