I am republishing (without permission) here these tips about Google Search from “Google Hacks”, a book by www.oreilly.com.
Search engines for large collections of data preceded the World Wide Web by decades. There were those massive library catalogs, hand-typed with painstaking precision on index cards and eventually, to varying degrees, automated. There were the large data collections of professional information companies such as Dialog and LexisNexis. Then there are the still-extant private, expensive medical, real estate, and legal search services.
Those data collections were not always easy to search, but with a little finesse and a lot of patience, it was always possible to search them thoroughly. Information was grouped according to established ontologies, data preformatted according to particular guidelines.
Then came the Web.
Information on the Web—as anyone knows who’s ever looked at half-a-dozen web pages knows—is not all formatted the same way. Nor is it necessarily particularly accurate. Nor up to date. Nor spellchecked. Nonetheless, search engines cropped up, trying to make sense of the rapidlyincreasing
index of information online.
Eventually, special syntaxes were added for searching common parts of the average web page (such as title or URL). Search engines evolved rapidly, rying to encompass all the nuances of the billions of documents online, and they still continue to evolve today.
Google™ threw its hat into the ring in 1998. The second incarnation of a search engine service known as BackRub, the name “Google” was a play on the word “googol,” a one followed by a hundred zeros. From the beginning, Google was different from the other major search engines online—AltaVista, Excite, HotBot, and others.
Was it the technology?
Partially. The relevance of Google’s search results was outstanding and
worthy of comment. But more than that, Google’s focus and more human face made it stand out online.
With its friendly presentation and its constantly expanding set of options, it’s no surprise that Google continues to get lots of fans. There are weblogs devoted to it. Search engine newsletters, such as ResearchBuzz, spend a lot of time covering Google. Legions of devoted fans spend lots of time uncovering documented features, creating games (like Google whacking) and even coining
new words (like “Googling,” the practice of checking out a prospective date or hire via Google’s search engine.)
In April 2002, Google reached out to its fan base by offering the Google API. The Google API gives developers a legal way to access the Google search results with automated queries (any other way of accessing Google’s search results with automated software is against Google’s Terms of Service.)
Why Google Hacks?
“Hacks” are generally considered to be “quick-n-dirty” solutions to programming problems or interesting techniques for getting a task done. But what does this kind of hacking have to do with Google?
Considering the size of the Google index, there are many times when you might want to do a particular kind of search and you get too many results for the search to be useful. Or you may want to do a search that the current Google interface does not support.
The idea of Google Hacks is not to give you some exhaustive manual of how every command in the Google syntax works, but rather to show you some tricks for making the best use of a search and show applications of the Google API that perform searches that you can’t perform using the regular Google interface. In other words, hacks.
Dozens of programs and interfaces have sprung up from the Google API. Both games and serious applications using Google’s database of web pages are available from everybody from the serious programmer to the devoted fan (like me).
1.1 Hacks #1-28
Google’s front page is deceptively simple: a search form and a couple of buttons. Yet that basic interface—so alluring in its simplicity—belies the power of the Google engine underneath and the wealth of information at its disposal. And if you use Google’s search syntax to its fullest, the Web is your research oyster.
But first you need to understand what the Google index isn’t.
1.2 What Google Isn’t
The Internet is not a library. The library metaphor presupposes so many things—a central source for resource information, a paid staff dutifully indexing new material as it comes in, a well understood and rigorously adhered-to ontology—that trying to think of the Internet as a library can be misleading.
Let’s take a moment to dispel some of these myths right up front·
- Google’s index is a snapshot of all that there is online. No search engine— not even Google—knows everything. There’s simply too much and its all flowing too fast to keep up. Then there’s the content Google notices but chooses not to index at all: movies, audio, Flash animations, and innumerable specialty data formats.
- Everything on the Web is credible. It’s not. There are things on the Internet that are biased, distorted, or just plain wrong—whether intentional or not. Visit the Urban Legends Reference Pages (http://www.snopes.com/) for a taste of the kinds of urban legends and
other misinformation making the rounds of the Internet.
- Content filtering will protect you from offensive material. While Google’s optional content filtering is good, it’s certainly not perfect. You may well come across an offending item among your search results.
- Google’s index is a static snapshot of the Web. It simply cannot be so. The index, as with the Web, is always in flux. A perpetual stream of spiders deliver new-found pages, note changes, and inform of pages now gone. And the Google methodology itself changes as its designers and maintainers learn. Don’t get into a rut of searching a particular way; to
do so is to deprive yourself of the benefit of Google’s evolution.
1.3 What Google Is
The way most people use an Internet search engine is to drop in a couple of keywords and see what turns up. While in certain domains that can yield some decent results, it’s becoming less and less effective as the Internet gets larger and larger.
Google provides some special syntaxes to help guide its engine in understanding what you’re
looking for. This section of the book takes a detailed look at Google’s syntax and how best to use it. Briefly:
Within the page
Google supports syntaxes that allow you to restrict your search to certain components of a
page, such as the title or the URL.
Kinds of pages
Google allows you to restrict your search to certain kinds of pages, such as sites from the
educational (EDU) domain or pages that were indexed within a particular period of time.
Kinds of content
With Google, you can find a variety of file types; for example, Microsoft Word documents, Excel spreadsheets, and PDF files. You can even find specialty web pages the likes of XML, SHTML, or RSS.
Google has several different search properties, but some of them aren’t as removed from
the web index as you might think. You may be aware of Google’s index of news stories
and images, but did you know about Google’s university searches? Or how about the
special searches that allow you to restrict your searches by topic, to BSD, Linux, Apple,
Microsoft, or the U.S. government?
These special syntaxes are not mutually exclusive. On the contrary, it’s in the combination that the true magic of Google lies. Search for certain kinds of pages in special collections or different page elements on different types of pages. If you get one thing out of this book, get this: the possibilities are (almost) endless.
This book can teach you techniques, but if you just learn them by rote and then never apply them, they won’t do you any good. Experiment. Play. Keep your search requirements in mind and try to bend the resources provided in this book to your needs—build a toolbox of search techniques that works specifically for you.
1.4 Google Basics
Generally speaking, there are two types of search engines on the Internet. The first is called the searchable subject index. This kind of search engine searches only the titles and descriptions of sites, and doesn’t search individual pages. Yahoo! is a searchable subject index. Then there’s the full-text search engine, which uses computerized “spiders” to index millions, sometimes billions, of pages. These pages can be searched by title or content, allowing for much narrower searches than searchable subject index. Google is a full-text search engine.
Whenever you search for more than one keyword at a time, a search engine has a default method of how to handle that keyword. Will the engine search for both keywords or for either keyword?
The answer is called a Boolean default; search engines can default to Boolean AND (it’ll search for both keywords) or Boolean OR (it’ll search for either keyword). Of course, even if a search engine defaults to searching for both keywords (AND) you can usually give it a special command to instruct it to search for either keyword (OR). But the engine has to know what to do if you don’t give it instructions
1.4.1 Basic Boolean
Google’s Boolean default is AND; that means if you enter query words without modifiers, Google will search for all of them. If you search for: snowblower Honda “Green Bay”
Google will search for all the words. If you want to specify that either word is acceptable, you put an OR between each item:
snowblower OR snowmobile OR “Green Bay”
If you want to definitely have one term and have one of two or more other terms, you group them with parentheses, like this:
snowblower (snowmobile OR “Green Bay”)
This query searches for the word “snowmobile” or phrase “Green Bay” along with the word “snowblower.” A stand-in for OR borrowed from the computer programming realm is the | (pipe) character, as in:
snowblower (snowmobile | “Green Bay”)
If you want to specify that a query item must not appear in your results, use a – (minus sign or dash).
snowblower snowmobile -“Green Bay”
This will search for pages that contain both the words “snowblower” and “snowmobile,” but not the phrase “Green Bay.”
1.4.2 Simple Searching and Feeling Lucky
The I’m Feeling Lucky™ button is a thing of beauty. Rather than giving you a list of search results from which to choose, you’re whisked away to what Google believes is the most relevant page given your search, a.k.a. the top first result in the list. Entering washington post and clicking the I’m Feeling Lucky button will take you directly to http://www.washingtonpost.com/. Trying president will land you at http://www.whitehouse.gov/.
1.4.3 Just in Case
Some search engines are “case sensitive”; that is, they search for queries based on how the queries are capitalized. A search for “GEORGE WASHINGTON” on such a search engine would not find “George Washington,” “George washington,” or any other case combination. Google is not case sensitive. If you search for Three, three, or THREE, you’re going to get the same results.
1.4.4 Other Considerations
There are a couple of other considerations you need to keep in mind when using Google.
First, Google does not accept more than 10 query words, special syntax included. If you try to use more than ten, they’ll be summarily ignored. There are, however, workarounds [Hack #5].
Second, Google does not support “stemming,” the ability to use an asterisk (or other wildcard) in the place of letters in a query term. For example, moon* in a search engine that supported stemming would find “moonlight,” “moonshot,” “moonshadow,” etc.
Google does, however, support an asterisk as a full word wildcard [Hack #13]. Searching for “three * mice” in Google would find “three blind mice,” “three blue mice,” “three red mice,” and so forth. On the whole, basic search syntax along with forethought in keyword choice will get you pretty far. Add to that Google’s rich special syntaxes, described in the next section, and you’ve one powerful query language at your disposal.
1.5 The Special Syntaxes
In addition to the basic AND, OR, and quoted strings, Google offers some rather extensive special syntaxes for honing your searches.
Google being a full-text search engine, it indexes entire web pages instead of just titles and
descriptions. Additional commands, called special syntaxes, let Google users search specific parts of web pages or specific types of information. This comes in handy when you’re dealing with 2 billion web pages and need every opportunity to narrow your search results.
Specifying that your query words must appear only in the title or URL of a returned web page is a great way to have your results get very specific without making your keywords themselves too specific.
Some of these syntaxes work well in combination. Others fare not quite as
well. Still others do not work at all. For detailed discussion on what does
and does not mix, see [Hack #8].
intitle: restricts your search to the titles of web pages. The variation,
allintitle: finds pages wherein all the words specified make up the title of the
web page. It’s probably best to avoid the allintitle: variation, because it doesn’t
mix well with some of the other syntaxes.
allintitle:”money supply” economics
inurl: restricts your search to the URLs of web pages. This syntax tends to work well for finding search and help pages, because they tend to be rather regular in composition.
An allinurl: variation finds all the words listed in a URL but doesn’t mix well with
some other special syntaxes.
intext: searches only body text (i.e., ignores link text, URLs, and titles). There’s an allintext: variation, but again, this doesn’t play well with others. While its uses are limited, it’s perfect for finding query words that might be too common in URLs or link titles.
inanchor: searches for text in a page’s link anchors. A link anchor is the descriptive text of a link. For example, the link anchor in the HTML code <a
href=”http://www.oreilly.com>O’Reilly and Associates</a>is “O’Reilly and Associates.”
site: allows you to narrow your search by either a site or a top-level domain.
AltaVista, for example, has two syntaxes for this function (host: and domain:), but Google has only the one.
link: returns a list of pages linking to the specified URL. Enter
link:www.google.com and you’ll be returned a list of pages that link to Google.
Don’t worry about including the http:// bit; you don’t need it, and, indeed, Google appears to ignore it even if you do put it in. link: works just as well with “deep” URLs—http://www.raelity.org/apps/blosxom/ for instance—as with top-level URLs such as raelity.org.
cache: finds a copy of the page that Google indexed even if that page is no longer available at its original URL or has since changed its content completely. This is particularly useful for pages that change often.
If Google returns a result that appears to have little to do with your query, you’re almost sure to find what you’re looking for in the latest cached version of the page at Google. cache:www.yahoo.com
daterange: limits your search to a particular date or range of dates that a page was indexed. It’s important to note that the search is not limited to when a page was created, but when it was indexed by Google. So a page created on February 2 and not indexed by Google until April 11 could be found with daterange: search on April 11.
Remember also that Google reindexes pages. Whether the date range changes depends on whether the page content changed. For example, Google indexes a page on June 1.
Google reindexes the page on August 13, but the page content hasn’t changed. The date for the purpose of searching with daterange: is still June 1.
Note that daterange: works with Julian [Hack #12], not Gregorian dates (the
calendar we use every day.) There are Gregorian/Julian converters online, but if you want to search Google without all that nonsense, use the FaganFinder Google interface (http://www.faganfinder.com/engines/google.shtml), offering daterange: searching via a Gregorian date pull-down menu. Some of the hacks deal with daterange: searching without headaches, so you’ll see this popping up again and again in the book.
“George Bush” daterange:2452389-2452389
filetype: searches the suffixes or filename extensions. These are usually, but not
necessarily, different file types. I like to make this distinction, because searching for filetype:htm and filetype:html will give you different result counts, even though they’re the same file type. You can even search for different page generators, such as ASP, PHP, CGI, and so forth—presuming the site isn’t hiding them behind redirection and proxying. Google indexes several different Microsoft formats, including: PowerPoint
(PPT), Excel (XLS), and Word (DOC).
“leading economic indicators” filetype:ppt
related:, as you might expect, finds pages that are related to the specified page. Not all pages are related to other pages. This is a good way to find categories of pages; a search for related:google.com would return a variety of search engines, including HotBot, Yahoo!, and Northern Light.
info: provides a page of links to more information about a specified URL. Information includes a link to the URL’s cache, a list of pages that link to that URL, pages that are related to that URL, and pages that contain that URL. Note that this information is dependent on whether Google has indexed that URL or not. If Google hasn’t indexed that URL, information will obviously be more limited.
phonebook:, as you might expect, looks up phone numbers. For a deeper look, see the section [Hack #17].
phonebook:John Doe CA
As with anything else, the more you use Google’s special syntaxes, the more natural they’ll become to you. And Google is constantly adding more, much to the delight of regular webcombers. If, however, you want something more structured and visual than a single query line, Google’s
Advanced Search should be fit the bill.
1.6 Advanced Search
The Google Advanced Search goes well beyond the capabilities of the default simple search, providing a powerful fill-in form for date searching, filtering, and more.
Google’s default simple search allows you to do quite a bit, but not all. The Google Advanced Search (http://www.google.com/advanced_search?hl=en) page provides more options such as date search and filtering, with “fill in the blank” searching options for those who don’t take naturally to memorizing special syntaxes.
Most of the options presented on this page are self-explanatory, but we’ll take a quick look at the kinds of searches that you really can’t do with any ease using the simple search’s single text-field interface.
1.6.1 Query Word Input
Because Google uses Boolean AND by default, it’s sometimes hard to logically build out the nuances of just the query you’re aiming for. Using the text boxes at the top of the Advanced Search page, you can specify words that must appear, exact phrases, lists of words, at least one of which must appear, and words to be excluded.
Using the Language pull-down menu, you can specify what language all returned pages must be in, from Arabic to Turkish.
Google’s Advanced Search further gives you the option to filter your results using SafeSearch. SafeSearch filters only explicit sexual content (as opposed to some filtering systems that filter pornography, hate material, gambling information, etc.). Please remember that machine filtering isn’t 100% perfect.
1.6.4 File Format
The file format option lets you include or exclude several different Microsoft file formats, including Word and Excel. There are a couple of Adobe formats (most notably PDF) and Rich Text Format as options here too. This is where the Advanced Search is at its most limited; there are literally dozens of file formats that Google can search for, and this set of options represents only a small subset.
Date allows you to specify search results updated in the last three months, six months, or year. This date search is much more limited than the daterange: syntax [Hack #11], which can give you results as narrow as one day, but Google stands behind the results generated using the date option on the Advanced Search, while not officially supporting the use of the daterange search.
The rest of the page provides individual search forms for other Google properties, including news search, page-specific search, and links to some of Google’s topic -specific searches. The news search and other topic specific searches work independently of the main advanced search form at
the top of the page.
The advanced search page is handy when you need to use its unique features or you need some help putting a complicated query together. Its “fill in the blank” interface will come in handy for the beginning searcher or someone who wants to get an advanced search exactly right.
That said,bear in mind it is limiting in other ways; it’s difficult to use mixed syntaxes or build a single syntax search using OR. For example, there’s no way to search for (site:edu OR site:org) using the Advanced Search.
Of course, there’s another way you can alter the search results that Google gives you, and it doesn’t involve the basic search input or the advanced search page. It’s the preferences page.
MORE ON “GOOGLE HACKS” NEXT TIME…