Abstract
Purpose
Internet search is a $120bn business that answers lists of search terms or keywords with relevant links to Internet webpages. Only a few companies have sufficient scale to compete and thus economics of the process are paramount. This study aims to develop a detailed industry-specific modeling of the economics of internet search.
Design/methodology/approach
The current research develops a stochastic model of the process of Internet indexing, search and retrieval in order to predict expected costs and revenues of particular configurations and usages.
Findings
The models define behavior and economics of parameters that are not directly observable, where it is difficult to empirically determine the distributions and economics.
Originality/value
The model may be used to guide the economics of large search engine operations, including the advertising platforms that depend on them and largely fund them.
Keywords
Citation
Westland, J.C. and Mou, J. (2024), "A stochastic model of the economics of Internet search", Journal of Electronic Business & Digital Economics, Vol. 3 No. 3, pp. 203-221. https://doi.org/10.1108/JEBDE-10-2023-0023
Publisher
:Emerald Publishing Limited
Copyright © 2024, James Christopher Westland and Jian Mou
License
Published in Journal of Electronic Business & Digital Economics. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode
1. Introduction
Internet search is a $242bn business in the USA (IBIS, 2024) that connects lists of query search term keywords to 4.62 billion webpages, around 30 trillion of which are indexed by around 1 million search terms (Kunder, 2024). In 2024, Google dominated with 91.62% of the search market and 39% of the digital advertising market (followed by Facebook’s 18% and Amazon’s 7%).
Google’s primary source of revenue is keyword advertising which generated $238bn in 2023. The management of this revenue stream involves extensive modeling of internet consumer choice of keywords. Searchers' keywords are roughly divided into informational and transactional, where only the transactional keywords are typically tied to ad placement. Beyond these rough divisions, there is a complex and mostly proprietary set of internal models that Google uses to assess consumer behavior. Since Google Panda’s release in February 2011, Google has been very secretive about features, constructs and models used to determine consumer search behavior.
Google starts with human-generated natural language queries to link query keywords to indexed webpages. Their fundamental challenge has remained constant throughout Google’s history – generate keyword-specific relevance statistics for their indexed webpages. There is a substantial need for detailed industry-specific modeling of the economics of internet search, if only for the reason that the creation and operation of successful business models are predicated on broader studies, like ours, of factors in the architecture of search, and how this affects costs and revenue generation. We think our revision better supports the novelty and contributions of our particular research to the relatively thin existing literature on the economics of internet search.
In the current research, we develop a comprehensive economic model of Internet keyword search. Though there are other approaches to search, keyword algorithms are by far the most economically important approaches, as they are adopted by the industry leaders, Alphabet/Google and Microsoft. Its objective is to summarize the economics of search engine processes and predict expected costs and revenues of particular configurations and usages. By better characterizing the indexing and query-specific retrieval tasks using a stochastic model, we can provide a robust descriptions of the producer’s search engine problem of optimizing the acquisition of information on webpages in order to satisfy the consumers’ information needs with the information returned to them. The accurate characterization of both producer and consumer economic problems in this paper allows the construct of a complete economic model of keyword search.
An important advantage of a stochastic modeling lies in the fact that most parameters involved in the search are directly observable, particularly, given the immense scope of the entire Internet, and the fact that indexing occurs for only around 3% of webpages. Thus, it is difficult to empirically determine the distributions and economics. By creating a global stochastic model of indexing, search and retrieval, it may still be possible to validate the model locally or for topic-specific subsets, while the global conclusions of the model will help to guide the economics of large search engine operations such as managed by Google and Microsoft (including the advertising platforms that depend on them and largely fund them).
2. Prior literature
Internet search has been extensively studied over the past two decades, and there have been substantial advances in Internet search algorithms during that period. The vast majority of this has either been network and demographic research or has been algorithmic computer science and engineering research with the objective of optimizing performance on specific metrics, e.g. sensitivity-specificity, Bayes risk, precision-recall, perceived user engagement time metrics and so forth. The mechanics of search have been studied extensively, often from a computer science standpoint of building models, running them and reporting on the performance of these implementations. In contrast, there have been few peer-reviewed journal articles on the economics of internet search. The majority of work has been descriptive, similar to industry white papers, which have appeared as book chapters or languish unpublished on ArXiv. Many of the comprehensive models of Internet search economics are speculative and anecdotal, presented in non-peer-reviewed books, e.g. (Levin, 2011; Peitz & Reisinger, 2015; Varian, 2016), or they reflect particular agendas or industry-specific experiences, e.g. (Chen & He, 2011; Argenton & Prüfer, 2012; Jolivet & Turon, 2019; Al-Shabi, 2020; Dinerstein, Einav, Levin, & Sundaresan, 2018). Peer-reviewed articles are almost completely restricted to sector or industry-specific models, partly because of limitations in data access. Where natural language processing (NLP) models are studied, e.g. (Khoo & Johnkhan, 2018; Liu et al., 2010; Harari, Parola, Hartwell, & Riegelman, 2020; Koto & Adriani, 2015; Khan et al., 2016), they are typically in pursuit of sentiment extraction from large text databases. None of these provide a generalized underlying NLP model of search processes and search economics.
Search query tasks are generally perceived as situation-specific and highly variable. But research over the past 30 years in linguistics and search applications has shown that some order can generally be imposed upon the queries generated by information “consumers” (Blair, Urland, & Ma, 2002; Anderson, 2005; Barkovich, 2019; Bruder, Düsterhöft, Becker, Bedersdorfer, & Neumann, 2000; Crystal, 2005; Gulla, Auran, & Magne Risvik, 2002; Sharoff, 2006) and (Hundt, Nesselhauf, & Biewer, 2007). User search from the consumer standpoint can be projected onto search engine algorithms and index bases, where the query content may be extended by a linguistic generative pre-trained transformer (e.g. ChatGPT).
Where there have been purportedly “economic” studies, these have typically not been comprehensive economic models, rather they are anecdotal, case studies or limited empirical analysis through particular narrative perspectives or they are proposed algorithms. These may be important, but they are entirely different from the problem we are addressing which is to create a generalized comprehensive economic model of Internet search that can be used as a revenue model or a business model incorporating costs and architectures. Many of the models presented in books (where the majority of such models are mooted) are extended white papers, without general scholarly research underlying their structure and proposals. And indeed, Internet search economic models are rare, and where they exist, the majority seem to be on ArXiv or in books, e.g. (Levin, 2011; Peitz & Reisinger, 2015; Varian, 2016). These tend to share specific industry perspectives but have not appeared in peer-reviewed scholarly journal studies where they may be shared more widely with researchers around the world.
It is also important to note that indexing occurs on only a small subset of Internet sites, as most webpages are of low quality or redundant; they contain little additional information relevant to consumer information needs and queries. There is no “master” index, and indices are accumulated inefficiently through hyperlink crawling. The complexity and inefficiency of Internet search and retrieval of relevant information provide strong incentives for seeking ways to improve the cost-effectiveness of running a search engine.
3. Consumer choice in Internet search
Behavioral economics (BE) arose in the 1950s in response optimization and rational choice models being promoted in economics; behavioral economists opined that humans were subject to irrational thinking, overconfidence, anchoring and representativeness and because of that real price series and economic behavior veered from economic ideals. It introduced important concepts that are now mainstream in economics but also embraced fringe theories that are highly controversial. The current research steers clear of more controversial methods in BE and highlight the specific models and the context and extant literature for the accepted and statistically valid methods of BE. In general, we note that methodologies in this area are fundamentally the same as in any other scientific research: (1) empirical feature extraction; (2) model building; and (3) model confirmation. Each of these has specific families of methods for statistical analysis, empirical data collection and curation, and in order to link into empirical testing protocols, we have designed our models to be predicated on accepted methods in BE, taking into account that behavioral economics is economics where a much wider, and more realistic, range of behaviors are attributed to humans and their economic decision making. The current research develops a set of axioms based on widely accepted behavioral assumptions. We start by articulating the consumer’s choice problem and then will move on to architecture in an axiomatic consumer search query model. The consumer choice problem is fundamental to economically important activities such as keyword advertising on search platforms and search engine optimization.
The information consumers articulate a specific need using only a set of keywords – i.e. a term list VS of random length, composed of selections from their own vocabulary V. Unique vocabulary terms ∈ V may be 1, 2 or 3 g with longer key phrases being exceptionally rare. Google Ads recognizes about 400,000 vocabulary terms in English. The number probably varies significantly between languages, with relatively spare languages like Chinese being at the low end and Arabic being at the higher end.
Search term lists are constructed through several concurrent user-initiated processes. The first process is the one that generates the information need. This process may range in importance from trivial curiosity to a serious investigation of complex and important issues (e.g. for a purchase transaction). The intensity of the user’s information need sets limited by the effort, utility or expense that is willing to be incurred in obtaining that information. This will affect the number of responses that the consumer is willing to demand, read and digest. If the cost of accessing website documents in terms of the user’s time, money and so forth is fixed, the major variable cost component of the user’s effort is the construction of the search query. This search query can be depicted as a set of keywords or terms summarizing the user’s information need. Term relationships can be depicted through models of the Boolean operations of conjunction ∩ and disjunction ∪. The consumer’s effort in constructing the query is directly proportional to the number of terms in the query ρ. Making only weak assumptions constraining the size of the search query (e.g. that there is an upper bound for the maximum expenditure of effort possible), the next section will derive a stochastic distribution of search query term list length.
A search query is comprised of language terms ∈ V linked by operators drawn from
Preprocessing of search queries is increasingly common – through query completion, and more recently by machine learning algorithms like LaMDA, ChatGPT or Bing AI ChatGPT. Since information from this preprocessing does not come from the consumer but rather from the search algorithm, it is not assumed to change the nature of the transaction between the consumer (searcher) and the producer (search engine).
Search index databases are ideally designed with a high correlation between the vocabulary of the consumer and the vocabulary expressed in the indexing of the search data. In modern search engines such as Google and Bing, this concordance is enhanced dynamically based on location, language, prior use and other factors. Consumers also tend to self-select into being users of search engines that have historically provided them with the highest satisfaction from the information retrieved. Google and Bing know this and have active internal development teams seeking the best ways to maximize consumer satisfaction with their search engines. As a result, terms in the search query have isomorphic counterparts in the index set, and the index set a direct linkage to the web documents that they index.
Index sets do not exist as intrinsic parts of the web pages they reference rather they are inferred by the web crawlers that collect this information, and the algorithms that curate this into the index database. This production process parses web documents/pages and produces links to descriptive terms, usually dominated by nouns, adjectives and noun clauses. The keyword terms are drawn from a generalized vocabulary or may be artificially restricted from a context-specific keyword vocabulary in order to enforce consistency and better satisfy consumers. The net effect of the web crawling-curating process that updates the index database is that keyword query terms end up specifically linking to specific web document pages.
Given a specific query term, there is a specific probability that a randomly selected web page will include this specific query term in its summarization in the index set recorded on the search engine’s index database. The next section develops the stochastic models giving the probability distributions of these terms.
The central figure of merit in search assessment is relevance which is directly related to consumer satisfaction with a search, and over the long run, a particular search engine’s performance. Relevance can take as many forms as there are forms of intent for a search. Transactional searches look for products and searches, information searches look for authoritative information, curiosity can impel some search, as can boredom where the figure of merit may be simple entertainment. Relevance here will measure the satisfaction the searcher has with the results of a particular search.
Information consumers are presumed to be less concerned with the number of links returned from a search and much more concerned about the relevance of those links to the consumers search intention. Search intention is not a priori observable, but the consumer’s behavior in reviewing responses (e.g. a number of links clicked, time on the response page) can be used to assess consumer satisfaction with the search. Formally, the search of a specified domain D of size n receives a random sequence of search queries arising from consumer information needs
The sort of partitioning of links into relevant and not-relevant suggests that we could adapt a Neyman–Pearson hypothesis testing model, where choice is supported by the Neyman–Pearson Lemma. Indeed, search engines use modifications called Precision-Recall models or Sensitivity-Specificity models that are variations on Neyman–Pearson models.
3.1 Axiomatic search query model
This section characterizes the consumers' construction of a search term (keywords) list. The first task in this process enumerates the web pages to which a specific term refers. This is accomplished by equating the indexer’s and consumers' term vocabularies and then statistically characterizing the breadths of terms referencing the search engine index database. The second task characterizes the amount of effort which the user expends in articulating a search query, and the maximum amount of information which a query may contain. It then develops a probability distribution of the number of terms
We will start by making some basic modeling assumptions that are obvious but need to be articulated. The following assumptions are made in constructing the search model.
The intersection of all webpages indexed and the set of relevant webpages exactly captures the information required to satisfy the consumer’s information needs
This assumption tells us that the consumer will be satisfied with a search if it only contains information that exists in the indexed database. In Google’s case, this indexed database of webpages consists of 30 million high-information content webpages out of around 4.62 billion webpages. This is a very precise approximation of reality.
Consumers gauge the appropriateness of a search term for describing a particular set of webpages by the breadth of that term in indexing those webpages, where “appropriate” implies that they reference the largest number of webpages in the full set of relevant webpages DR.
This assumption tells us that the consumer, in making a choice among search keywords, will choose keywords that are broad, rather than specific, in the hope that a larger proportion of relevant webpages will be returned in the search. It does not suggest that consumers cannot make specific searches, merely that they will want more choices from the webpages returned rather than less.
Consumers want to maximize the amount of information retrieved without expending more than some maximum amount of effort κq in formulating a search query.
This assumption tells us that the consumer will minimize the energy and attention expended on any single search.
The search engine index database is organized by topics (e.g. derived through cluster analysis or unsupervised learning) though that set of topics may not be static over time. This assumption tells us that clustering is a part of indexing of webpages, an assumption which has been widely confirmed by search firms. Clustering means that where there is either uncertainty in understanding the consumer’s intention or the consumer is interested in serendipitous search responses, that the responses will be organized by “closeness” to the main search responses.
Vocabularies used in the curation of the search engine index W and vocabularies of the consumers V are identical. In the notation in this research, V ≡ W and vi ≡ wi, ∀i.
This assumption tells us that the consumer and the search engine use the same language.
3.1.1 Search term list length and composition
3.1.1.1 Breadth of indexing
This section formulates the stochastic distribution for breadth, defined as the size of the webpage set in the index database for the search engine, referenced by a particular vocabulary term vi ∈ V. This material builds on models developed by Ijiri and Simon (1975), Mandelbrot (1983) and Hill (1974).
Average term breadth is the inverse of the average of the commonly used term specificity measure which is described in Lancaster (1971). The following notation is adopted. β(wi) is the set of webpages di ∈ D which are referenced by vi ≡ wi ∈ V ≡ W and the size of
The following three theorems provide the model for the breadth of referencing by the user’s query terms. Assumption #2 states that users choose particular wi comprising the query term lists so that they have greater breadth, individually, in referencing documents restricted to DR than any other term in W. As r/n → 0 (i.e. as the ratio of the size of the set of relevant webpages to the size of the entire search-indexed webpages tends to zero, the topic composition of DR grows significantly different from the composition of D. Therefore, the “most appropriate” search terms for referencing DR will have uniform randomly selected breadths when referencing the entire search index database.
Theorem 1 provides a method to determine how many webpages are referenced by a specific search term, assuming that vocabularies are infinite and infinitely divisible. This is important because it lets us know how many webpage will be retrieved if a search query is formulated that only includes that search term. Theorems 2 and 3 will relax this assumption, while extending results to show how to transform and combine the consequences of Theorem 1 to obtain the webpage count for a more complex search query, such as a Boolean query or a “concept” extended Boolean query that might be generated by completion or machine learning extensions to the original search query.
Given an infinite search term vocabulary, the probability that a webpage is uniform randomly selected from D or from any subset DR ∈ D is referenced by a search term wi ∈ W follows a Pareto distribution. If
Proof: Assume D consists of an aggregation of information chunks where each chunk contains a uniform measure of information and an information chunk is the smallest unit of text indexable by a search term-type; in other words, it is the limit of resolution of the index set W. For example, in a full-text system, this limit of resolution is the term-token in the text; at the other extreme of links and citation lists, the limit of resolution may be only term-tokens in both the webpage title and a limited vocabulary (e.g. a glossary) of terms. Assume a specific term-type or chunk reference relationship occurs only once in a webpage. Then the set of all information chunks in D can be rearranged by term-type. Figure 1 depicts term-tokens or chunks as small circles, while the term-types are determined by clustering these term-tokens between the delimiting bars for the term-type.
Assume that the left and right-most bars are terminators, and other bars delimit the sets of tokens or chunks by term-type. Assumption 4 above states, in the topic-specific index database, that the current topic orientation of the index determines the selection of new acquisitions to that index. This is formalized by stating that the proportion of term-type or chunk references in any newly acquired webpage in the index database is proportional to the number of chunks of that term-type already residing in the index. This is a “preferential attachment process” where new acquisitions are distributed according to how much already exists and has been widely documented on the Internet (e.g. see (Newman, 2005; Barabási & Albert, 1999; Krapivsky, Redner, & Leyvraz, 2000; Falkenberg et al., 2020). Ijiri and Simon (1975) showed that cell occupancies in preferential attachment follow Bose-Einstein statistics and cell size distributions are asymptotically Paretian as the number of cells (i.e. term-type classifications delimited by the bars in Figure 1) grows large. In a large index database, p ∝ i−1+θ (Pareto distribution with “temperature of discourse” θ and rank i ∈ [1, u]) is the probability that a randomly selected term-type wi references
A corollary provides a method to determine the number of webpages referenced by a specific term when the vocabulary is finite and of size u.
Corollary: When vocabulary size u ≤ ∞ the probability that an arbitrarily selected term wi ∈ W references a randomly selected indexed webpage in DR has the probability mass function
Proof:
The next theorem uses these results to determine the number of webpages that are indexed by a randomly selected index term. The rank-frequency distribution of the
Let
The corresponding mass function for
Proof:
That is, the probability that the breadth of a randomly selected term is less than
The formula for the mass function follows directly from the fact that
The following corollary uses these results to determine how many of the relevant webpages (i.e. that provide the consumer with high satisfaction) are indexed by a randomly selected index term. In formulating a search query keyword list, the consumer is assumed to maximize their future satisfaction by choosing search terms which reference a large number of relevant webpages and few irrelevant webpages. Indeed, this is the intended purpose of search support tools in the search engine, such as keyword completion, and ChatGPT style assistance for search.
Corollary: Let
Then the cumulative distribution function of
Proof: Follows the form of theorem 2.
3.1.2 How much information is contained in an index?
How much information is contained in the index database for DR? Assume that a particular η, a consumer search need realization (i.e. a search), induces a partitioning of the set of webpages D into DR and D − DR. The fixed set DR may be considered to contain a fixed amount of information which is the maximum information that may be communicated using a search term list.
A consumer’s information needs, with the intent of satisfying them through search, determine two factors of importance for search: (1) it determines the set of webpages that are relevant and (2) it determines the maximum effort that the consumer is willing to expend in executing the search, and satisfying those need. Consumers are presumably aware of this trade-off, but can only incompletely communicate it. Search engines are similarly aware of these trade-offs. The search engine itself can be viewed as a transformation or mapping of the consumer’s articulation of search needs projected on the search term list (keywords) and mapped into a set of indexed webpages at a particular point in time. The indexed database is always changing, because of updates by the web crawler and also by producer updates intended to improve the retrieval experience and customer satisfaction – i.e. the relevance of the returned links.
Although there are many definitions of information, the one that is most appropriate in a Boolean framework is the (Shannon, 2001) information metric. Adopting Shannon’s perspective, let there be a set of probabilities pi associated with a set of search terms wi that reference a randomly selected webpage d ∈ D. Within the context of this specific index database (i.e. one that is a snapshot at a given point in time) the amount of Shannon information in the set of search terms is:
Khinchin (2013) interpreted Shannon’s metric as a measure of reduction in uncertainty resulting from obtaining a particular piece of information – e.g. the content of a retrieved webpage. Following the Bayesian statistical method, prior information can be segregated from new information and there may be a quasi-absolute measure of information corresponding to the Bayesian likelihood function. In the current research, I make the weak assumption that this measure is monotonic increasing in the Shannon metric H. This information is (1) embedded in the indexing of webpages on the search engine’s index database, drawn from vocabulary W; and (2) embedded also in the search queries with keyword-terms drawn from the consumer’s vocabulary V.
Now consider the value-added service offered by the search engine, which adds information to the index set, but is hugely costly; in most commercial Internet searches, these costs are typically covered by advertising revenues. Consider a hypothetical situation to describe the addition of information in terms of the Shannon metric. Assume that you are visiting a rather strange library where all of the books have indistinguishable bindings, devoid of titles, authors or other markings. Assume also that you are not allowed to browse through the pages of these books. You have a lot of information in front of you, but no way to differentiate one information receptacle for another. This is the problem faced at start-up by the search engine facing an Internet full of webpage links, and a relatively limited budget with which to make sense of this. In this situation, a randomly selected book may be relevant to any search term in a consumer’s vocabulary with equal probability. Since there is no way of differentiating books from each other, indexing would provide no information, because no linkages can be established between a search query and “relevant” books. More formally, make the simplifying assumption that vi ≡ wi and W ≡ V. A link scheme with u terms would be
Theorem 3 formalizes the concept of the information contained in an indexing scheme. Theorem 3 ’s measure is an exponential of H, which is order-preserving and preference preserving. It describes the maximum amount of information which may be embedded in the consumer’s search query or in the index database and sets an upper limit on the amount of information that a search may communicate. The amount is a maximum because search queries reflect information needs that can only partially be satisfied by the search engine – for most searches, the consumer is likely to want information that is not included anywhere on the Internet. Some examples of searches that are unlikely to be satisfied with Internet webpages might be anything in top secret documents, the elixir of life, the truth behind numerous conspiracy theories and so forth. Search engines can only deliver information contained in the indexable Internet.
Theorem 3 assumes that webpages may contain variable numbers of meaningful chunks of information, each chunk being information that might satisfy the consumer.
The amount of information contained in the indexing of DR with vocabulary W is
Proof: Assume that there exist a total of k information chunks in DR ∈ D and let ki be the number of chinks represented by the index term wi. The total chunk count of relevant information on the index database is k, while the total webpage count of relevant information on the index database is r. The number of different subsets of DR characterized by the same set of term repetitions ki is
As DR becomes large
Take the base 2 logarithm of this expression
Divide through by k, placing that term in front of the entire expression and not that as k increases the term involving 2π tends to zero and the 1/2’s in the former exponents become insignificant. Therefore, we can approximate this as
Raise 2 to the power of this exponent (since we originally took the log2) and you derive the asymptotic approximation
3.1.3 Consumer’s choice of search term list length
Given the time, energy and motivation to sift through an entire collection of retrieved webpages, a consumer could precisely identify the set DR of webpages relevant to need η. This assumption is unrealistic, and in fact, consumers will hold some “reservation price” – i.e. an amount of effort that represents an upper limit to the amount of time, energy, etc. that they are willing to expend on a search – call this kq. The count of terms in the search term list may be expected to be a monotonic increasing function of the level of effort that the consumer expends, and that expenditure is represented by the cost (disutility) function ZQ(ρ) where ρ is the length of the search term list. Theorem 4 calculates how the consumers cost-benefit considerations translate to search term lists.
The probability that the user chooses a search term list of length ρ is
This is Pareto distributed in the cost ZQ and kq is the maximum amount of effort that the user is willing to expend in formulating a query.
Proof: Assume that the user has a total of Y possible search query term lists with which to represent the relevant webpage set DR composed of the u search terms in V. Construct a set of u + 1 ′trunks for the potential search term list “tees”, with one for each search term-type plus a null “terminator” that marks the end of a search term list. Then add u + 1 similar branches to each trunk except for the terminator trunk and so forth onto the branches so constructed. Because the branches extending forward from any particular branch in this structure form a structure that is self-similar, i.e. it duplicates the entire structure, the distribution of
Define ρ(y) as the length of the search term list of rank y from the trunk to the terminator. The cost or disutility associated with a search term list of length ρ(y) is
Additionally, note that a search term list with one additional term must always be of rank greater than y
Let
- (1)
Effort expenditure
- (2)
Summation to 1
- (3)
Maximum total information is kH
I use Lagrange multipliers to optimize this objective function with constraints, starting with the Lagrangian. Here I assume that
The first-order conditions are computed by
Eliminating
Substitute into the previous equation and note that all terms in their probability density function depend on ρ(y)
The kernel of the density of
This is a Pareto distribution in the cost (disutility) function ZQ(⋅) (Q.E.D.)
Theorem 4 provides a mathematical basis for the observation of Pareto distributions in the choice of search terms. Such distributions have been repeatedly verified in search term lists and in indexing, most extensively in Zipf (2016) which is a compendium of examples of indexing and word choice that follow Pareto distributions.
3.1.4 Consumer’s choice of search query terms
The previous arguments defined an economically optimal amount of information for the consumer to communicate to the search engine; it did not specify what vocabulary to use, i.e. which member of V should be chosen to communicate that information. Our second assumption provides a measure of the the strength or appropriateness of a particular vocabulary search term in referencing the set of relevant webpages DR. It states that the most appropriate search terms to specify DR will be the ones that individually reference the largest number of webpages in DR. This is consistent with the assumption that the user knows his information needs precisely but fails to fully include that knowledge in the search query terms because of effort aversion. It is also consistent with the user’s desire to improve the hit rate in search retrieval by manipulating the choice of query terms. The following theorem formulates the distribution of the most appropriate terms that the consumer may select in constructing the query term list.
The breadth of the jth “most appropriate” term choice referencing the relevant web[ages DR has the probability density function of the jth-order statistic of term breadth
Which in turn is equal to the incomplete Beta function:
Proof: The derivation of distributions for order statistics appears in David and Nagaraja (2004), chapter 1.
4. Conclusions and discussion
This paper has developed a stochastic model of search based on well-articulated mathematical models, as opposed to empirically optimized computer algorithms. Such models provide assessments of consumer search behavior that are much more generalizable and less sensitive to local or data-specific situations that create problems for empirically tested computer algorithms. This model fills a gap in the current literature, though hopefully it will provide a model that can be extended and applied to develop better industry business models for search.
Search query tasks are generally perceived as situation-specific and highly variable we show that the application of linguistics can lend some order to understanding the way that information consumers craft queries and the way that these can be projected onto search engine algorithms and index bases, even where the query content may be extended by a linguistic generative pre-trained transformer (e.g. ChatGPT). In the past, where there have been purportedly “economic” studies, these have typically not been comprehensive economic models, rather they are anecdotal, case studies or limited empirical analysis through particular narrative perspectives, or they are proposed algorithms. These may be important, but they are entirely different from the problem we are addressing which is to create a generalized comprehensive economic model of internet search that can be used as a revenue model or a business model incorporating costs and architectures. The complexity and inefficiency of Internet search and retrieval of relevant information provide strong incentives for seeking ways to improve the cost-effectiveness of running a search engine.
In this study, we started with the assumption that the consumer’s search query arises from a process which is completely independent of the process of webpage crawling that acquires the indexed web page links that comprise the index database. As a result, the distribution if indices over DR is independent of the distribution indices over D. As r/n → 0, the contents of DR become less and less representative of the contents of D with respect to the indexable chunks of information; otherwise, the user would always choose a search query term list consisting of the lower ρ unrestricted order statistics,
Any search term corresponding to the ith breadth order statistic
To give an idea of the scale of the Internet in context (Kunder, 2024), reports Google has discovered around 200 trillion web pages, and indexes around 30 trillion of those. Around 30% of web pages are spam, and a much larger portion is redundant or copies to place their storage and use closer to consumers.
The size of the search term vocabulary that is used in indexing the index database affects performance through both the search query length and breadths of search terms (i.e. keywords). This collection of indexes is spanned by the set of index terms that are retrieved by the web crawler that is used to build the index database.
Consider the search consumer’s personal vocabulary. A native English speaker is likely to have a vocabulary of 20,000 to 100,000 words, and the effective vocabulary for search will be a subset of that vocabulary. Subsets may be restricted to nouns and adjectives or to topic-specific terms. The model developed in this paper assumes an indexing vocabulary that is the same as the searcher’s vocabulary. It is not clear how significantly this restriction impacts the results of the model. I believe the impact will be minimal, because on average the searchers of websites, and the editors of websites are likely to share approximately the same vocabularies, particularly if we are generalizing over the entire Internet.
Should the congruence of the indexer’s and consumer’s vocabularies be low, this may result in lowered retrieval rates and potentially less consumer satisfaction. My modeling could be extended using distinct indexer and consumer vocabularies, making assumptions about four sets of terms – the 2 × 2 matrix of consumer and indexer search terms. I make an argument here that this extension of the model is unlikely to improve the explanatory model, specifically because of the Pareto distribution (i.e. the long tail) of search terms.
As a result of the Pareto rank-frequency distribution of search term breadth, the vocabulary size acquires significance for several of the formulas in this analysis. The upper limit of the sum in the coefficient of normalization for the query length distribution has the value
The value of u most likely affects the breadths of relevant webpages indexed and referenced by a query
On the other hand, u affects the distribution of breadth of indexing of chosen search terms – i.e. the size of the subsets of di ∈ D indexed by a specific wi ∈ W. Because the size distribution is Pareto distributed and has very little probability weight in its rightmost (long) tail, these probabilities tend to be insensitive to changes in u after u exceeds a certain value. In practice, this occurs because large vocabularies consist mainly of rarely used words, but definitions, reference are only small subsets of D.
The distribution of search term breadth is Paretian. Therefore, if the search engine indexer selects vocabulary W large enough to index n webpages, the size ‖W‖ = u(n, θ) ≈ i: mini[(nkθϵi)−(1+θ) ≤ 1] which implies ‖W‖ ≈ [(nkθϵ)−(1+θ)]. This may be restated more directly that when the expected number of webpages drops below a specific search term-type rank, there is no need for search terms at or above that rank. The formula is somewhat circular in that k depends on u, but u has very little effect on the value of k and k quickly approaches 1 as k grows large. Similarly, an arbitrary randomly selected subset DR of size r will be referenced by a subset of the search term-type vocabulary of expected size u(r, θ) ≈ [(rkθϵ)−(1+θ)].
This discussion highlighted each point at which vocabulary size enters into the model. At no point is vocabulary size a particularly significant parameter, nor does assuming an infinite vocabulary, ceteris paribus, affect the conclusions of the modeling. Thus, vocabulary size may be ignored except where it is explicitly entered into a function. In such cases, vocabulary referencing D can be approximated by u ≈ [(nθϵ)−(1+θ)] and vocabulary referencing DR can be approximated by u ≈ [(rθϵ)−(1+θ)].
One feature of search queries which was omitted in the analysis is the Boolean relationship between terms. Although the consumer specifies a set of operators between each search term drawn from
Figures
References
Al-Shabi, M. (2020). Evaluating the performance of the most important lexicons used to sentiment analysis and opinions mining. IJCSNS, 20(1), 1.
Anderson, D. (2005). Global linguistic diversity for the internet. Communications of the ACM, 48(1), 27–28. doi: 10.1145/1039539.1039562.
Argenton, C., & Prüfer, J. (2012). Search engine competition with network externalities. Journal of Competition Law and Economics, 8(1), 73–105. doi: 10.1093/joclec/nhr018.
Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–12. doi: 10.1126/science.286.5439.509.
Barkovich, A. (2019). Informational linguistics: Computer, internet, artificial intelligence and language. In 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (pp. 008–13). IEEE.
Blair, I. V., Urland, G. R., & Ma, J. E. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods, Instruments, and Computers, 34(2), 286–90. doi: 10.3758/bf03195456.
Bruder, I., Düsterhöft, A., Becker, M., Bedersdorfer, J., & Neumann, G. (2000). Getess: Constructing a linguistic search index for an internet search engine. In International Conference on Application of Natural Language to Information Systems (pp. 227–38). Springer.
Chen, Y., & He, C. (2011). Paid placement: Advertising and search on the internet. The Economic Journal, 121(556), F309–28. doi: 10.1111/j.1468-0297.2011.02466.x.
Chervany, N. L., & Dickson, G. W. (1974). An experimental evaluation of information overload in a production environment. Management Science, 20(10), 1335–44. doi: 10.1287/mnsc.20.10.1335.
Crystal, D. (2005). The scope of internet linguistics. In Proceedings of American Association for the Advancement of Science Conference; American Association for the Advancement of Science Conference, Washington, DC (pp. 17–21).
David, H. A., & Nagaraja, H. N. (2004). Order statistics. Hoboken, NJ: John Wiley & Sons.
Dinerstein, M., Einav, L., Levin, J., & Sundaresan, N. (2018). Consumer price search and platform design in internet Commerce. American Economic Review, 108(7), 1820–59. doi: 10.1257/aer.20171218.
Falkenberg, M., Lee, J.-H., Amano, S., Ogawa, K., Yano, K., Miyake, Y., … Kim, C. (2020). Identifying time dependence in network growth. Physical Review Research, 2(2), 023352. doi: 10.1103/physrevresearch.2.023352.
Gulla, A.J., Auran, P. G., & Magne Risvik, K. (2002). Linguistics in large-scale web search. In International Conference on Application of Natural Language to Information Systems (pp. 218–22). Springer.
Harari, M. B., Parola, H. R., Hartwell, C. J., & Riegelman, A. (2020). Literature searches in systematic reviews and meta-analyses: A review, evaluation, and recommendations. Journal of Vocational Behavior, 118, 103377. doi: 10.1016/j.jvb.2020.103377.
Hill, B. M. (1974). The rank-frequency form of Zipf’s law. Journal of the American Statistical Association, 69(348), 1017–26. doi: 10.2307/2286182.
Hundt, M., Nesselhauf, N., & Biewer, C. (2007). Corpus linguistics and the web. In Corpus Linguistics and the Web (pp. 1–5). Brill.
IBIS (2024). Search engines in the US. Available from: Https://Www.ibisworld.com/United-States/Market-Research-Reports/Search-Engines-Industry/
Ijiri, Y., & Simon, H. A. (1975). Some distributions associated with Bose-Einstein statistics. Proceedings of the National Academy of Sciences, 72(5), 1654–57. doi: 10.1073/pnas.72.5.1654.
Jolivet, G., & Turon, H. (2019). Consumer search costs and preferences on the internet. The Review of Economic Studies, 86(3), 1258–1300. doi: 10.1093/restud/rdy023.
Khan, M. T., Durrani, M., Ali, A., Inayat, I., Khalid, S., & Khan, K. H. (2016). Sentiment analysis and the complex natural language. Complex Adaptive Systems Modeling, 4, 1–19. doi: 10.1186/s40294-016-0016-9.
Khinchin, A. Y. (2013). Mathematical foundations of information theory. New York: Courier Corporation.
Khoo, C. S. G., & Johnkhan, S. B. (2018). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4), 491–511. doi: 10.1177/0165551517703514.
Koto, F., & Adriani, M. (2015). A comparative study on Twitter sentiment analysis: Which features are Good?. In Natural Language Processing and Information Systems: 20th International Conference on Applications of Natural Language to Information Systems, NLDB 2015, Passau, Germany, June 17-19, 2015 (Vol. 20, pp. 453–57). Springer, Proceedings. doi: 10.1007/978-3-319-19581-0_46.
Krapivsky, P. L., Redner, S., & Leyvraz, F. (2000). Connectivity of growing random networks. Physical Review Letters, 85(21), 4629–4632. doi: 10.1103/physrevlett.85.4629.
Kunder, M.de (2024). The size of the world wide web. Available from: https://www.worldwidewebsize.com/
Lancaster, F. W. (1971). The cost-effectiveness analysis of information retrieval and dissemination systems. Journal of the American Society for Information Science, 22(1), 12–27. doi: 10.1002/asi.4630220104.
Levin, J. D. (2011). The economics of internet markets. Cambridge, MA: National Bureau of Economic Research.
Liu, B. (2010). Sentiment analysis and subjectivity. In Handbook of Natural Language Processing (2, pp. 627–666). Boca Raton, FL: CRC Press.
Mandelbrot, B. B. (1983). On the quadratic mapping z → Z2-μ for complex μ and z: The fractal structure of its m set, and scaling. Physica D: Nonlinear Phenomena, 7(1-3), 224–39. doi: 10.1016/0167-2789(83)90128-8.
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–51. doi: 10.1080/00107510500052444.
Peitz, M., & Reisinger, M. (2015). The economics of internet media. In Handbook of Media Economics (Vol. 1, pp. 445–530). Elsevier. doi: 10.1016/b978-0-444-62721-6.00010-x.
Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55. doi: 10.1145/584091.584093.
Sharoff, S. (2006). Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics, 11(4), 435–62. doi: 10.1075/ijcl.11.4.05sha.
Varian, H. R. (2016). The economics of internet search. In Handbook on the Economics of the Internet (pp. 385–94). Edward Elgar Publishing.
Zipf, G. K. (2016). Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books, Addison-Wesley Press.
Corresponding author
About the authors
James Christopher Westland is Full Professor in the Department of Information and Decision Sciences at the University of Illinois – Chicago. He has a BA in Statistics and an MBA in Accounting from Indiana University and received his PhD in Computers and Information Systems from the University of Michigan. He has professional experience in the USA as a certified public accountant and as a consultant in technology law in the USA, Europe, Latin America and Asia. He is the Editor-in-Chief of Electronic Commerce Research (Springer) and has served on editorial boards of several other information technology journals including Management Science, ISR, ECRA, IJEC and others. He has served on the faculties at the University of Michigan, University of Southern California, Hong Kong University of Science and Technology, Tsinghua University, University of Science and Technology of China, Harbin Institute of Technology and other academic institutions. In 2012, He received High-Level Foreign Expert status in China under the 1000-Talents Plan and is currently Overseas Chair Professor at Beihang University.
Jian Mou is Professor in the School of Business, Pusan National University, South Korea. His research interests include big data analysis, social media, trust and risk issues in e-service and information management. Dr Mou’s research has been published in Technological Forecasting and Social Change, Journal of the Association for Information Systems, Information and Management, Internet Research, International Journal of Information Management, Electronic Commerce Research, Information Processing and Management, Computers in Human Behavior, Information Technology and People, Behaviour and Information Technology, International Journal of Human-Computer Interaction, Journal of Retailing and Consumer Services, Information Development, International Journal of Mobile Communications and The Electronic Library.