Emerald Group Publishing Limited
Copyright © 2001, MCB UP Limited
Searching the Internet
The Internet is a major tool for accessing knowledge resources located worldwide. As these resources grow the task of searching for the appropriate resource for a particular piece of research becomes more and more difficult. Powerful search engines make the task possible but, if inappropriately used, result in tens of thousands of possible sites. Associate Professor Charlotte Wien, University of Southern Denmark, provides a practical guide to more effective searches and writes from the perspective of an academic teaching others to retrieve information via the Net.
Searching the Internet
Teaching information retrieval is not an easy task. I know because I have been teaching it for a number of years. It is not easy for several reasons, some related to the student expectations, and some related to the technological need for using computers with Internet connection while teaching. The last problem can be overcome, although it is frustrating to have to cancel lessons because of a hacker attack on the university system, or because of a server breakdown somewhere or because the telecompany decided to pull the plug or take down the backbone machine for a couple of hours or…or…or…
The student expectations are, in fact, the thing that makes information retrieval difficult to teach. First of all because they all have tried searching on the www using AltaVista or Google or any similar search engine, and frequently complain about the irrelevance of the stuff they find on the Internet. And they always blame that on the www and not on their own searching skills. This is because they think that searching is a question of typing some words in the search box of any search engine, then pressing the search button. If the result then turns out not to be what they expect or want to find, then the www is to blame.
Whenever I offer a course in information retrieval, I have to start by proving that information retrieval can be done systematically, in such a way that – and I use this as a rule of thumb – that you will retrieve 15 relevant documents – not more, not less. And this is what information retrieval is all about. It is not about finding zero documents and it is not about finding 23,987 documents. Both the search that retrieves zero documents and the search that retrieves a large number of documents are bad searches. The point is that a search engine is delicate tool like a camera. You can have a pocket camera or a disposable camera and sometimes such a camera will provide you with excellent pictures, but professional pictures are made with a Hasselblad or Leica – and only if you know how to adjust the diaphragm, the shutter speed and the light metre. And knowing this is not enough: you also have to add some creativity in the process of making a truly professional picture.
When this point has been made clear and when the students have had a demonstration of how exact and precise the goal of 15 relevant documents can be obtained when searching the www, Lexis-Nexis, Dialog, Social Science Citation Index, the local library database etc. then the time has come to teach them what the "diaphragm, the shutter speed and the light metre" of a search engine are.
The first thing to teach is never start searching without a question for which an answer is needed. Why? Because if they are not looking for something specific, how can they then know when they have found what they are looking for? The other thing that makes the question extremely important is that it can be used for the relevance judgements. The documents that are retrieved by the search engine can be considered relevant if they give part of the answer or the whole answer to the question. Another reason why the information retrieval task must start by asking a question is that students can use the question in order to derive their search terms. For example, a question might be "What has been the most recent development within the peace process or peace talks between Israel and the Palestinians?" When the question is printed like here it is obvious that the search terms that can be derived from this question are peace talks or peace process, Israel, Palestinians. There is a large probability that any documents that are relevant will contain all three. Thus the search statement so far must be: peace process and Israel and Palestinians.
At this point one does not know whether the search will retrieve zero, 23,897 or 15 relevant documents. So the next step is to prepare some broader terms that can be used if the search retrieves zero documents and some narrower terms if the search retrieves 23,897 documents. The boarder terms for the search question above could be peace and Middle East, while the narrower terms could be any aspect of the peace negotiations like the water issue and (Ehud) Barak and (Yassir) Arafat. All of these search terms should be added into the schema presented in Figure 1.
Figure 1 Schema of search terms
This schema serves two purposes: it helps to find the relevant search terms but it also helps in defining what the issue is about for the student. Furthermore all these search terms can be combined in a variety of ways in order to ensure a better searching result. All the vertical columns can be combined: (Israel or Middle East or Barak) by using the Boolean "or" and each of these three "or-searches" can be combined with "and" in order to retrieve as much as possible. Afterwards the search can be limited by using other parameters.
Choosing the best search terms is – to use the photography metaphor – comparable to a photographer's choice of a motive. The next step is to get to know the search engine's "diaphragm, the shutter speed and the light metre" and to understand how they are used. They are called the truncation, the masking, the proximity operator, the exact phrase, and the addressing of indexes, the time limitation, and, last but not least, the Boolean operators and the parentheses.
Each of these features will help in finding either more documents or fewer documents. Starting with the last ones; the Boolean operators and parentheses.
Boolean logic is simple it consists of "and", "or", "not". The way it works can best be shown in Figure2.
What Figure 2 shows is that whenever combining two search terms with "and" one will retrieve fewer documents than when only using one search term. On the other hand, combining two search terms with "or" will lead to the retrieval of more documents than using just one search term. However, one should only use "or" in order to handle any problem concerning synonyms. This is the reason why the schema on Figure1 has "or's" between the search terms on the vertical level.
Figure 2 Operation of Boolean logic
Also, it is important for students to understand that the Boolean logic can only deal with combining two search terms at a time. For example, if a combination of three search terms are required the search engine will have to put in a set of parentheses in the search statement in order to be able to carry out the command. Thus a search like "Clinton and Gore or Bush" will be interpreted by the search engine like either "(Clinton and Gore) or Bush" or it will be interpreted like "Clinton and (Gore or Bush)". In the first case, the search engine will find many documents because this search in reality is an "or" search, while it will find few documents in the last case because this search in reality is an "and" search.
If students want to remain in control of where the parentheses are put, then they will have to type them in themselves. Otherwise the search engine will put the parentheses in one of the two positions without informing the student where it has put it, and thus make the search result unpredictable.
The next thing to teach is the effect of the truncation, the masking, the proximity operator, the exact phrase, and the addressing of indexes.
Truncation is also known to many students as "wildcard". It takes the shape of a variety of signs like: *, !, ?, @, #, $, etc. By now most search engine programmers seems to have agreed on using mainly the *. It informs the search engine to ignore whatever signs are written instead of or after (or before) the truncation. Searching for "elephant" will not retrieve "elephants". However, searching for "elephant*" will retrieve both. Some search engines allow for both an initial, a medial and a final truncation. The most common is only a final and a medial truncation. The initial truncation is mainly used in languages where there are many compound words, like in Danish. The medial truncation is of great help in languages where there are several variations of spelling or many irregular nouns (colour/color and woman/women). One should be aware though that the truncation – if not used with caution – also will result in a lot of so-called noise, meaning irrelevant documents. For example searching for "ele*ant" will provide documents that contain both the word "elephant" and the word "elegant".
In order to avoid this, some search engines allow the user to use masking. Masking is a truncation with a limited effect: it allows students to define the maximum number of characters they want the search engine to ignore. This feature is most common in search engines that search in languages with many compound words.
Both truncation and masking will result in the retrieval of more documents than if the search term is just typed.
The proximity operator, the exact phrase and the addressing of indexes are all features that are designed in order to retrieve fewer documents. The proximity operator is perhaps best explained with a well known example: both the name Clinton and the name Lewinsky are quite common names. If one wanted to retrieve documents that deal with Bill Clinton's relationship with Monica Lewinsky, then searching for "Clinton and Lewinsky" might provide documents where the word Clinton occurs in the beginning of the text and the word Lewinsky is to be found at the end of the text. The documents do not necessarily deal with the relationship between the two. The proximity operator instructs the search engine to search for any occurrence of the two words near each other: in the same paragraph, sentence or within five, ten or 15 words depending on the search engine. This might be expressed as "Clinton near Lewinsky". The probability that a document where the two words occur near each other is in fact about both of the words is of course higher than when one is just using an "and" when searching.
The exact phrase is trickier. If a student wants to retrieve documents where John Smith occur and he or she types John Smith in the search box, some search engines on www will look for an exact match on both words, written exactly as printed here. Other search engines will consider the space bar between the two words as an "or" which means that it will look for any document where "John" occurs and any document where "Smith" occurs. For the latter type of search engine, the exact phrase is quite useful. The exact phrase is expressed by putting the search terms into quotation marks, i.e. "John Smith".
The most difficult thing to explain to student is the addressing of the indexes of the search engines. I usually explain it by giving the following two examples: this text is not in any way about "John Smith". However, his name occurs in this document, and if this document was searchable in, for example, AltaVista and someone was searching for documents about "John Smith" then this irrelevant document would be found. The point is that documents are not about all of the words that are in the text. Frequently, the title of a document more precisely describes what the document is about. This text is about searching and about the Internet and especially about searching the Internet. Thus if a search for John Smith results in 2 million documents, then searching for documents where the title contains John Smith will limit the amount of retrieved records and increase the probability that the retrieved documents are, in fact, about John Smith. For example, in AltaVista, addressing the title index when searching is simply done by typing "title:John Smith" in the search box.
The other example is found in the fact the Internet contains a lot of valuable information alongside a lot of low quality information. In the literature about evaluating search results from the Internet many authors have stated that documents originating from governmental institutions, and educational institutions are the ones that have the highest validity speaking in general terms. As all US Web sites have inherited the top domains of the ARPANET the .edu, .gov, .org, .com etc. then addressing the index of domain can help ensure that, for example, .com's are not included in the search result. In AltaVista again, this is simply done by typing "domain:edu". Then the search engine will only retrieve documents that were put on the net by educational institutions. Finally, most search engines allows the user to limit the search in terms of a time frame, for example the last week, six months or the last year.
However, all of these features are mainly found in the so-called advanced search. And very few people bother to check out what features are available on the various search engines when they are searching the Net. Students remain at the simple search which is usually the opening page of the search engine and they remain unhappy with their search results because they do not take the necessary time to get to know all its features.
It is understandable why: the help function of the large search engines always looks very complicated; there are always a lot of pages that one has to download or print separately. However, there is a short-cut: whenever students become aware of the different possibilities in terms of adjusting the search engine – Boolean logic, truncation, masking, the proximity operator and the exact phrase – and whenever they understand the effect of these features, then they can quite easily find their way around in the help functions by simply looking for any page in the search engines' help function called something like "advanced search syntax" or "search cheat sheet". These pages in the help function of every search engine will inform the student which character to use for truncation, masking, the proximity operator and the exact phrase, and it will also inform the student how to express for example the "and" (most frequently just "and" or +), the "or" (most frequently just "or" or / or a space bar) and finally the "not" (most frequently "not" or "and not" or –).
The final point about teaching students how to search the www is that they should choose one or two search engines and get to know those well rather than try a new search engine for every new search. The reason is that most of the search engines on the Net have different so-called ranking algorithms built into them. This means that some search engines will top rank a document if the search terms occur frequently in the document, others will top rank a document if the search terms occur in the beginning or the title of the document, etc. Some will even try to interpret the search statement made by the user and try to find documents that contain any synonyms for any of the search terms. Which of these algorithms are used and how they work are considered trade secrets by the companies that make a living from trying to provide search engines that gives the best results. Therefore, I always instruct the students to experiment with various search engines by making, for example, ten search profiles, submitting these to, for example, ten different search engines on the www and then carefully evaluate the search results in order to find the search engine(s) that best meet their demands. After all, searching is like making a photograph: one camera brand might work best in the hands of one photographer, while another camera brand might work best in the hands of another. And the photographer can only make real art by using a personal favourite.
Dr Charlotte Wien is Associate Professor at Department of Journalism, University of Southern Denmark and can be contacted at email@example.com or +45-6550-1000 ext: 2159.
Casey Sweet has provided an informative review of the latest qualitative research as reported at the "Qualitative research in the twenty-first century" conference in Paris. Dr Charlotte Wien has developed a methodology for teaching students – academic or executive – how to make their Internet searches more effective. My thank you goes to them for their enlightening articles.
Have you pioneered a new qualitative research or market research methodology? Have you made an innovative use of the Internet in facilitating research? Are you a provider of cutting-edge technology or software that assists research? Do e-mail me if you would like to showcase your ideas in Internet News.
Rehan ul-HaqEditor, Internet News, QMRIJ, The Birmingham Business School, The University of Birmingham, E-mail: firstname.lastname@example.org Tel: +44 (0)121-414-3456 Fax: +44(0)121-414-2263.