Search Engines – a look under the hood

Posted by: Rea Maor In: Internet and SEO - Friday, February 23rd, 2007

It seems impossible that Google was born only nine short years ago. These days, the art of quickly retrieving data from the Internet based on keywords has risen to such major importance, that many have questioned whether Google, the current leader, is more important than the TCP/IP proxy itself.

A search engine works in three phases: It crawls the web, moving from link to link. It indexes what it finds. And it returns results. The crawling part is like an indiscriminate bot – it tries to explore every link. Indexing is another story; efficient data storage and indexing is the kind of thing that’s worth money and patents. Some engines cache the pages directly, some create a keyword-frequency-based index, some file everything away in a database, and some save every word of every page.

Plain text is still the main thing that search engines see. They can only index images, video, and audio going by what you tell them. This is another reason why it’s important to use the “alt” tag in an HTML “img” tag: the alternative text is what will display if the image itself doesn’t show. Take a look at this screenshot:

Slashdot in Lynx

This is the famous Slashdot.com – seen in Lynx, the text-mode web browser (and after five screens of scrolling to get to the headlines). If you are a web developer, you owe it to yourself to try out a copy of lynx, because this is how a search engine spider sees your page!

Not every search engine does its own work, or even operates under its own name. If you check your server logs and see “inktomi Slurp”, that’s the Yahoo bot. AskJeeves, once the name of the search engine that almost became Google before Google, is now just the engine part of Ask.com. AOL and Netscape just borrow Google’s data. AltaVista uses Yahoo’s. And Alexa and Lycos use MSNSearch results. Meta-search engines such as Mamma and Dogpile use the results of several engines. Then there is the DMOZ directory project, which gets indexed by many search engines for use in results.

These busy spiders crawling the web behind the scenes have a life of their own. Count on their becoming more important as the available volume of the Internet continues to expand!


Related Posts:


Leave a Reply