A search engine, at its most basic element, is a system built to quickly retrieve information from a computer system. While we usually mean to search the World Wide Web, this definition applies also to anything from an inter-corporate network to a database right down to one computer. So, Google desktop search, Microsoft Vista’s desktop search, Linux’s Beagle program, and the Unix tools “grep” and “locate” are all examples of search engines as well.
Because grep, (loosely, the Global Regular Expression Printer) is such an old and established program, it was a synonym for “search” before we started saying “google” (as a verb). Grep also gives us a good starting place to understand the simplest type of search engine. For instance, firing off the following command in your local $UNIX console:
grep -Hir "muffin" ./
Will produce results similar to Google’s for your hard drive or current directory: The options are -H, which prints the file-names for each match, -i to ignore the case, and -r to search recursively within all directories below the current level. Doing this from the base “/” directory will search your entire hard drive for the search term (and may take a long time to complete)! Grep will print the title of each document and some relevant lines where the matching text occurred, just like an Internet search engine does.
Before we go farther, let’s define “string” as a string of characters to search for: In the above example, “muffin” is the string.
The Unix “locate” program simply finds all files whose names contain the string. Saying “locate jpg” will list every jpg image file on the machine. This is much faster than using the Unix “find” command, and the reason why teaches us another important fact about search engines. The “locate” command actually searches a pre-indexed database, while the “find” command checks the name of every file as it is listed. The “locate” command is optimized for speed; but at a cost of accuracy: it builds a database (typically once a day) by crawling the Unix PC’s directory.
If you’ve just deleted a bunch of files, the “locate” command will still list them, where the “find” command won’t; the “slocate” program will have to wait until its next scheduled directory crawl before it gets the news. This small, temporary sacrifice in accuracy is more than made up for in the time saved; searching a simple list of files in one place is many times faster than accessing the entire hard drive one directory and file at a time.
By putting both of these concepts together, we can apply the exact same ideas to the bigger world of a network of connected computers which is the public Internet. The two ideas have to be combined – indexing the massive amount of files on the Internet, and matching string patterns – to make search engines for the Internet into the same handy tools that they are on the desktop. We’ll be exploring the world of these search engines and the mechanics of how they work in later installments…
- Search Engine Study – part 2: History of Early Search Engines
- Fixing Windows using a Live Linux CD – part four
- Google vs Microsoft – the Search Battle
- Search Engine Study – part 3: The Great Search Engine Era
- Search Engines – a look under the hood