aiic.net
  EN  
  Contact   Index   Search
  Quick Links   People   Info   Webzine  
 
December 2000 - Searching the Internet
  My aiic.net   Extranet

Advanced Searching on the WWW

In a previous article, we indicated how useful the World Wide Web can be for conference preparation. This month, we will be describing some advanced search techniques to help you forage for relevant information on the Web.

There are two points you must keep in mind when using a search engine:

  • you are only searching through the engine's collection of WWW documents and not the World Wide Web itself;
  • you can only retrieve WWW documents based on the exact words they contain.

The one thing you need to remember is that search engines are not 'intelligent' agents that categorise websites and pages by publisher, target audience, theme or content. All they do send robots that spider the Web and grind pages to extract the exact words found on them. That information is then compiled in the search engine's database in the form of indices.

Thus, when you search with a search engine, your first step should be to try and guess what words are likely to be used in the documents you're looking for. Search engines works with words, not concepts. Unless you choose your words well, the needle in the haystack will most probably not come your way.

Let's give an example. You're an interpreter currently based in cold and rainy Brussels and you're thinking of moving to Spain. You're interested to find some tips or help via the Internet. So you go to Alta Vista and type "move Spain". Next thing you know is that your search retrieved 5,301,410 pages and that Zidane was also toying with the idea... Are you going to hit the Next button and wade through all links until you find what you want? Clearly not. Your only hope is to find better keywords. Try "relocation services spain" (1,414,695 pages found, but some interesting links at the top) for example.

However, given the sheer size of the Web, finding the right keywords is not enough. Another example will make that clear. You're slated for an automotive industry conference next week and the participants will be talking about body painting. So you go to Alta Vista and type "body painting". Don't be surprised if you then get heaps of links to sites telling you how the women of Papua and World Cup fans alike paint their bodies. Of course, you wanted your search to be restricted to the car industry. But the Internet is not organised semantically, far from it. Fortunately, keywords can be combined intelligently to narrow down your search to one particular field. Mastering such techniques will enable you to formulate a query like [ "body painting" AND ( industry AND (car OR automotive)) sort by "process" ], retrieving a limited number of well-focused hits, with relevant websites like www.precisionpainting.com.

So, how does that work?

Enter Boolean operators. George Boole (1815-1864) was an English mathematician who attempted to prove the existence of God through formal logic. His greatest legacy in this day and age is a system of logical operators that lies at the heart of computing. Proper use of those logical operators is essential to skim through today's massive WWW.

Boolean logic may sound forbidding but its basic principles are really straightforward. Any interpreter who can make neat piles of documents for the coming week's meetings, broken down by day or client, is actually using Boolean logic without knowing it.

These operators are very useful in most search engines because they allow you to expand or contract your search results.

There are basically three logical operators and each of them can be visually described by using Venn diagrams (remember Set Theory?), as shown below:

"AND"

 

The "AND" operator is used to look for search terms in combination. If we were doing a search for documents about INTERPRETER and TRANSLATOR and used the logical expression [interpreter AND translator], all of the resulting documents would have to contain both of the terms. The fact that they are all present in a document, though, does not mean that they're linked conceptually in any way. It just means that they all co-occur in the document somewhere. So don't be surprised if [venetian AND blind] retrieves information on Venetian blinds and blind Venetians.

"OR"

The "OR" operator is used to broaden your search. A query for documents about INTERPRETER or TRANSLATOR will expand the result set to contain links to pages with either of those terms or both. The OR operator is best used in combination with others to include synonyms.

"NOT", "AND NOT"

The "NOT" operator - invoked by "AND NOT" on Alta Vista - is used to narrow down the scope of your search and so will exclude records from your results. It is a very powerful operator that you should use carefully since the term you do want may be present in documents that also contain the word you wish to avoid.


Nesting

Very often, you will want to use two or more Boolean operators in the same query. You should then be careful to specify in what order you want the logic processed. Some search engines let you use parentheses to group expressions, a process called nesting.

In this example, the nested logic in "body painting" AND ( industry AND (car OR automotive)) instructs the search engine to first shortlist all documents that have the terms "car" and "industry" or "automotive" and industry", and to extract from them those that also contain the phrase "body painting".

To search for pages with information on cold fusion theory, you could try [("cold fusion" AND theory) AND NOT (allaire OR middleware)]. In this case, we ask the search engine to select documents with both the phrase "cold fusion" and the word "theory", but also to filter out from the result set all documents that also contain the words "allaire" or "middleware". (Allaire is a company whose flagship product is a middleware software called Cold Fusion).


Proximity Operator

"NEAR"

In some cases, you will want to do more than just combine blind words or phrases using the three operators mentioned above, in the hope that they will land you an interesting catch. Combining search words with AND is all very well, but very often it turns out that it returns pages where the words occur haphazardly. This is where proximity operators come in handy.

To look for links to pages about mobile booths and ISO standards, you could try ["mobile booth*" NEAR iso] for instance. At the time of writing, Alta Vista returned 19 links whereas ["mobile booth*" AND iso] returned 40 links - 19 of which come from the AIIC website, as [("mobile booth*" AND iso) AND host:www.aiic.net] will tell you.

On Alta Vista, when two terms or phrases are linked with the NEAR operator, only documents where the terms appear within 10 words of one another will be retrieved, thereby indicating a higher probability of a conceptual link. This gives the NEAR operator considerably greater power in focusing in on a topic.


Phrase Searching with double quotes

By enclosing multiple words in quotation marks, you can search for them as a phrase. ["conference interpretation"] will return only pages containing the character string "conference interpretation", and not any document where the words "conference" and "interpretation" occur separately.

Without the quotation marks, pages containing both or even either of these words – anywhere in the text and not necessarily together - will be returned. The first hit from Alta Vista is for the “Conference on the Optimisation of Interpretation”, which has nothing to do with conference interpretation.


Truncation and stemming

Most search engines allow wildcards, i.e. symbols that can represent one character or a character string. The asterisk * is the most commonly accepted wildcard.

[consult* AND interpret* AND aiic] will look for documents with the words "aiic" and all words starting with "consult" and "interpret", such as consultant interpreter or consulting interpreters.

Some search engines take that logic even further and will try to determine stems in your query terms using a built-in dictionary.


Field searching

Most good search engines allow for field searching, which is an extremely powerful complement to the basic search features covered above. Field searching is especially useful for interpreters as it can be used to narrow a search to a country, language, or domain.

The syntax for field limiting is [field:keyword]. Be careful not to put a space before or after the colon. If you want to restrict your search to more than one word or phrase, you must repeat the field and the keyword to search for, such as [ field:keyword1 AND field:keyword2 ]

Let's review this technique using Alta Vista's field descriptors.

domain:

This command will restrict a search to only those pages located at a certain domain. A domain can be a country extension, like .fr for France or .ca for Canada, or class descriptor, like .com for a commercial company, .gov for a governmental entity or .edu for an university.

Let’s look at an example. We know that what in British English are called “unit trusts” are referred to as “mutual funds” in American English. But how about in Hong Kong?

On Alta Vista, [domain:hk AND "unit trust*"] returns 1,135 hits whereas [domain:hk AND "mutual fund*"] returns 1,114 hits, which probably means that both terms are safe enough to use.

host:

This field allows you to search every indexed page in a website. For example, ["E. coli" AND host:unep.org] returns every indexed page on the UNEP domain (www.unep.org) that contains the term "E. coli".

On Google, you would achieve the same effect with the syntax [site:unep.org "E. coli"]. And since Google has indexed about all publicly available pages on the AIIC website (5,700 of them), you can effectively use that search engine to perform site-wide queries with the following syntax: [site:www.aiic.net your_query_terms].

link:

Using this technique, you can find all the indexed pages that link to a specified site. For example, [link:aiic.net] will return all the indexed sites that contain links to the AIIC website, including links from one page of the site to another. So you might want to try [standard* AND (link:aiic.net AND NOT host:aiic.net)] for a more precise list of the sites linking to us and including any term starting with "standard", for example.

like:

This operator finds pages similar to or related to the specified URL. For example, [like:www.aiic.net] finds websites that are related to conference interpretation.


Language searching

The language field is presented in most search engines in a menu near the query window. By selecting a language from the list of languages supported, you can limit search results to pages written in the specified language.

This is an extremely valuable strategy for conference interpreters. For example, let’s say we want to find Chinese terms for “civil society”.

In the Alta Vista query box, we type "civil society". Then, on the Languages menu, we select Chinese. This search returned 111 pages in Chinese that contained the English term “civil society”.

In reviewing the results, we noticed that there are several different Chinese terms used. In order to identify which of these are used in mainland China specifically, we can further refine the search:

"civil society" AND domain:cn      [with Chinese selected from the Languages menu]

This search returned 6 hits.


Date searching

This field is available in most good search engines, if not in basic search mode then in advanced search mode. It allows you to specify the earliest and/or latest date of the pages to be included. For example, in Alta Vista’s advanced search mode:

HIV AND AIDS         [From: 01/01/00]       [Language: German]

This search will return all indexed pages in German last modified after 1 January 2000 containing both "HIV" and "AIDS".


Document field limiting

Let's get even fancier. Any web page contains several elements that potentially hold searchable information. Most pages will have a title (shown in the browser's title bar) that will presumably give a fairly accurate description of its content. Likewise, pages may contain images or links with potentially interesting information.

On Alta Vista, you can further restrict your search to the following page elements:

[anchor:text] finds pages that contain the specified word or phrase in the text of a hyperlink.

[image:filename] finds pages with images with a specific filename. For instance, [("simultaneous interpret*") AND image:booth*] retrieves pages with pictures whose filename includes the string "booth".

[text:text] finds pages that contain the specified text in any part of the page other than an image tag, link, or URL. The search [host:www.aiic.net AND (text:european AND text:commission)] would find all pages with the terms "european" and "commission" on the AIIC website.

[title:text] finds pages that contain the specified word or phrase in the page title (which appears in the title bar of most browsers). The search [title:agreement AND host:www.aiic.net] would find pages on the AIIC server with the word "agreement" in the document title.

[url:text] finds pages with a specific word or phrase in the URL. Use [url:job* AND language* AND (interpret* OR translat*)] to find all documents on all servers that have the stem "job" anywhere in the host name, path, or filename and the words language* and interpret* or translat* on the page. Again, you must remember that the engine's hits are based on words and not concepts, so don't be surprised if the above retrieves URLs about the Book of Job and how to interpret its language.

The field parameters supported by each search engine and their syntax differ from engine to engine. I recommend that you print out the detailed search instructions provided for each engine (click on the “Help”, “Tutorial”, or other appropriate link from its main page).

Keeping these instructions handy while you are searching with the different engines will make your life a lot easier.


Relevance ranking

Search engines return results with confidence or relevancy rankings, which means that they rank the hits according to how closely they think the results match the query.

In most cases, the frequency at which your search terms occur in a document will determine whether the search engine 'thinks" the document is relevant. This very often becomes a problem when the search term is polysemous or very frequent. If you are flooded with irrelevant documents, try reformulating your query, using phrase searching and Boolean logic.

Some search engines also consider the positioning of keywords to determine relevancy, there is a greater likelihood that the document will be relevant if the keywords appear in the title of the page, or early in the document, or in the headers.

On Alta Vista Advanced Search, you can use the Sort by: box to assign relevance weights to your query terms before conducting a search. Although the feature takes some getting used to, it offers the best relevancy results. But even sophisticated sorting features can be defeated by some extremely common words or phrases and call for yet other relevancy ranking techniques.

Try Alta Vista for "Bill Clinton" for instance, sorted on the words "bill" and "clinton". Now try the same search, "Bill Clinton", on Google. The results are reproduced below.

Google's first match is an official website, www.whitehouse.gov, whereas Alta Vista's is the '"Unofficial" Bill Clinton'. The reason is that Google uses a special ranking algorithm based on how many 'good' sites link to the matching page, along with other factors like the proximity of your search keywords or phrases in the documents. It claims not only to use the number of other links, but also the 'quality' of the other links.


Sub-searching

Very often, you will need to refine your searches to improve relevancy. Alta Vista's Sort by is a powerful tool to search within results. Another engine, Northern Light, has developed an interesting sub-searching technique.

Northern Light's Custom Search Folders group your results by subject (e.g., translation companies, computers, ) type of documents (e.g., press releases, product reviews,) source descriptors (e.g. commercial sites, personal pages, magazines, encyclopaedias, databases), language.

["conference interpreters"]
on Northern Light
Drill down on one of Northern Light's
automatically generated folders

 


With AND, NOT, OR, phrase and field searching under your belt, you are ready to do some serious foraging on the World Wide Web. The snag is that some of the techniques described above, and most of the syntax will only work on one or two search engines. Our reference tool for this paper was Alta Vista Advanced Search, which is possibly the most sophisticated - but not the easiest to use - search engine to date.

Does this mean that you should only use Alta Vista? By no means. Don't forget that no individual search engine can claim to have indexed all the WWW so using them in combination is often essential.

So, what general search engines can we recommend? Our pick includes Google, Alta Vista Advanced Search, Northern Light Power, and Fast Search Advanced because they all have big collections and useful distinguishing features. Google doesn't support Alta Vista's powerful Boolean logic but makes up for it by its no-frills approach and powerful ranking algorithm. Northern Light will help you narrow down your searches with its patented Folders technique. Fast Search has a huge database and a split-second response time.

To give you an overview, you will find a cheat sheet detailing the strengths and weaknesses of each of them here.

Before you go on, I suggest that you practise Boolean logic as described above in Alta Vista Advanced Search (be careful, Alta Vista simple search does not support Booleans) to find material that is useful and relevant to you.

After that, I would encourage you to go to the other engines listed at the top of this section, print out the help files you will find there, take note of the particular syntax for each engine, and then practise using these techniques to perform the same searches in the other engines. Be sure to compare the results you obtain from the same search in different search engines; don’t be surprised if you find very useful results in one engine that are not indexed in another.

Congratulations! You are now equipped with all the tools you will need to find targeted, multilingual information anywhere on the publicly indexable Web. A few hours invested in mastering these techniques will pay off in future by making your conference preparation and terminological research efforts much more focused and efficient.




Message Board

  Judith Jacobson   
Date: 28 Dec 2000 18:22
Subject: Excellent material

This article really gives a head start to those who have not ventured out into the search engine world. It is a very clear, to the point, informative article. Many thanks for an excellent job.

  J. Ana Johnson   
Date: 28 Dec 2000 20:28
Subject: effective use of search engines

Thank you very much indeed for doing this excellent research on how to use search engines more effectively! I have only just begun reading over your material and am already very excited about what I am learning, especially as I am struggling to create a web page for my 20-year-old company (based in Calgary). This is essential information for those of us who are, fortunately and/or unfortunately, relying more and more on the Internet for our bread and butter. Again, merci bien. Vielen Dank. ¡Muchísimas gracias! - Judith Ana Johnson.

  Manuel Sant'Iago Ribeiro   
Date: 30 Dec 2000 13:47
Subject: booleans and other creepy crawleys

brill stuff!
if only I had known all of this when first setting out to use search engines...but then of course, to have found this info elsewhere I probably would have needed a smattering of booleans to begin with...
thanks a bunch, Vincent et al...and excellent 2001!
:-))
m.

  B. Simons-Fischer   
Date: 2 Jan 2001 11:53
Subject: Thanks

Dear Vincent and other colleagues,

thank you for the trouble you take to instruct us "digibets" on how to use the www for our conference preparations (and private searches). Although I personally have been using the net for the last two years intensively for that purpose and think that I find my way about it pretty easily, I always am interested in your articles, just in case there is something that is new for me. And, indeed, I still find things which I either did not know or did more complicatedly.

To all of you a happy and healthy 2001!
Kind regards,
Bärbel Simons

  bettina palaschewski   
Date: 6 Jan 2001 12:59
Subject: Your article on search engines

great belated X-mas gift, thanks and all the best for 2001

  Susan Asselin   
Date: 19 Jan 2001 19:03
Subject: Research on the www.

Thank you for both articles. I am sure this information will be very useful in future research enveavors.

  Danielle Grée   
Date: 5 Feb 2001 17:10
Subject: complément et compliment

Merci,Vincent! C'etait encore plus complet que ton cours au SMP à Nice. Contente de savoir que tu te renseignes pour déménager en Espagne ;-)
Danielle



What do you think? - Share your views about this article!




Email this to someone
Get a printer-friendly version
Post your comment


Related content tree

4 stars Average user rating: 4/5
(11 votes)

Rate this article:
Excellent
Good
Fair
Poor
Disappointing
 

 © 1998-2010, AIIC
  Contact   Disclaimer   Terms of use   Privacy   Credits