A stochastic model of the economics of Internet search

James Christopher Westland (Department of Information and Decision Sciences, University of Illinois Chicago, Chicago, Illinois, USA)

Jian Mou (School of Business, Pusan National University, Busan, South Korea)

Journal of Electronic Business & Digital Economics

ISSN: 2754-4214

Article publication date: 5 August 2024

Issue publication date: 22 October 2024

Downloads

193

pdf (247 KB)

Abstract

Purpose

Internet search is a $120bn business that answers lists of search terms or keywords with relevant links to Internet webpages. Only a few companies have sufficient scale to compete and thus economics of the process are paramount. This study aims to develop a detailed industry-specific modeling of the economics of internet search.

Design/methodology/approach

The current research develops a stochastic model of the process of Internet indexing, search and retrieval in order to predict expected costs and revenues of particular configurations and usages.

Findings

The models define behavior and economics of parameters that are not directly observable, where it is difficult to empirically determine the distributions and economics.

Originality/value

The model may be used to guide the economics of large search engine operations, including the advertising platforms that depend on them and largely fund them.

Keywords

Citation

Westland, J.C. and Mou, J. (2024), "A stochastic model of the economics of Internet search", Journal of Electronic Business & Digital Economics, Vol. 3 No. 3, pp. 203-221. https://doi.org/10.1108/JEBDE-10-2023-0023

Publisher

:

Emerald Publishing Limited

License

Published in Journal of Electronic Business & Digital Economics. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

1. Introduction

Internet search is a $242bn business in the USA (IBIS, 2024) that connects lists of query search term keywords to 4.62 billion webpages, around 30 trillion of which are indexed by around 1 million search terms (Kunder, 2024). In 2024, Google dominated with 91.62% of the search market and 39% of the digital advertising market (followed by Facebook’s 18% and Amazon’s 7%).

Google’s primary source of revenue is keyword advertising which generated $238bn in 2023. The management of this revenue stream involves extensive modeling of internet consumer choice of keywords. Searchers' keywords are roughly divided into informational and transactional, where only the transactional keywords are typically tied to ad placement. Beyond these rough divisions, there is a complex and mostly proprietary set of internal models that Google uses to assess consumer behavior. Since Google Panda’s release in February 2011, Google has been very secretive about features, constructs and models used to determine consumer search behavior.

Google starts with human-generated natural language queries to link query keywords to indexed webpages. Their fundamental challenge has remained constant throughout Google’s history – generate keyword-specific relevance statistics for their indexed webpages. There is a substantial need for detailed industry-specific modeling of the economics of internet search, if only for the reason that the creation and operation of successful business models are predicated on broader studies, like ours, of factors in the architecture of search, and how this affects costs and revenue generation. We think our revision better supports the novelty and contributions of our particular research to the relatively thin existing literature on the economics of internet search.

In the current research, we develop a comprehensive economic model of Internet keyword search. Though there are other approaches to search, keyword algorithms are by far the most economically important approaches, as they are adopted by the industry leaders, Alphabet/Google and Microsoft. Its objective is to summarize the economics of search engine processes and predict expected costs and revenues of particular configurations and usages. By better characterizing the indexing and query-specific retrieval tasks using a stochastic model, we can provide a robust descriptions of the producer’s search engine problem of optimizing the acquisition of information on webpages in order to satisfy the consumers’ information needs with the information returned to them. The accurate characterization of both producer and consumer economic problems in this paper allows the construct of a complete economic model of keyword search.

An important advantage of a stochastic modeling lies in the fact that most parameters involved in the search are directly observable, particularly, given the immense scope of the entire Internet, and the fact that indexing occurs for only around 3% of webpages. Thus, it is difficult to empirically determine the distributions and economics. By creating a global stochastic model of indexing, search and retrieval, it may still be possible to validate the model locally or for topic-specific subsets, while the global conclusions of the model will help to guide the economics of large search engine operations such as managed by Google and Microsoft (including the advertising platforms that depend on them and largely fund them).

2. Prior literature

Internet search has been extensively studied over the past two decades, and there have been substantial advances in Internet search algorithms during that period. The vast majority of this has either been network and demographic research or has been algorithmic computer science and engineering research with the objective of optimizing performance on specific metrics, e.g. sensitivity-specificity, Bayes risk, precision-recall, perceived user engagement time metrics and so forth. The mechanics of search have been studied extensively, often from a computer science standpoint of building models, running them and reporting on the performance of these implementations. In contrast, there have been few peer-reviewed journal articles on the economics of internet search. The majority of work has been descriptive, similar to industry white papers, which have appeared as book chapters or languish unpublished on ArXiv. Many of the comprehensive models of Internet search economics are speculative and anecdotal, presented in non-peer-reviewed books, e.g. (Levin, 2011; Peitz & Reisinger, 2015; Varian, 2016), or they reflect particular agendas or industry-specific experiences, e.g. (Chen & He, 2011; Argenton & Prüfer, 2012; Jolivet & Turon, 2019; Al-Shabi, 2020; Dinerstein, Einav, Levin, & Sundaresan, 2018). Peer-reviewed articles are almost completely restricted to sector or industry-specific models, partly because of limitations in data access. Where natural language processing (NLP) models are studied, e.g. (Khoo & Johnkhan, 2018; Liu et al., 2010; Harari, Parola, Hartwell, & Riegelman, 2020; Koto & Adriani, 2015; Khan et al., 2016), they are typically in pursuit of sentiment extraction from large text databases. None of these provide a generalized underlying NLP model of search processes and search economics.

Search query tasks are generally perceived as situation-specific and highly variable. But research over the past 30 years in linguistics and search applications has shown that some order can generally be imposed upon the queries generated by information “consumers” (Blair, Urland, & Ma, 2002; Anderson, 2005; Barkovich, 2019; Bruder, Düsterhöft, Becker, Bedersdorfer, & Neumann, 2000; Crystal, 2005; Gulla, Auran, & Magne Risvik, 2002; Sharoff, 2006) and (Hundt, Nesselhauf, & Biewer, 2007). User search from the consumer standpoint can be projected onto search engine algorithms and index bases, where the query content may be extended by a linguistic generative pre-trained transformer (e.g. ChatGPT).

Where there have been purportedly “economic” studies, these have typically not been comprehensive economic models, rather they are anecdotal, case studies or limited empirical analysis through particular narrative perspectives or they are proposed algorithms. These may be important, but they are entirely different from the problem we are addressing which is to create a generalized comprehensive economic model of Internet search that can be used as a revenue model or a business model incorporating costs and architectures. Many of the models presented in books (where the majority of such models are mooted) are extended white papers, without general scholarly research underlying their structure and proposals. And indeed, Internet search economic models are rare, and where they exist, the majority seem to be on ArXiv or in books, e.g. (Levin, 2011; Peitz & Reisinger, 2015; Varian, 2016). These tend to share specific industry perspectives but have not appeared in peer-reviewed scholarly journal studies where they may be shared more widely with researchers around the world.

It is also important to note that indexing occurs on only a small subset of Internet sites, as most webpages are of low quality or redundant; they contain little additional information relevant to consumer information needs and queries. There is no “master” index, and indices are accumulated inefficiently through hyperlink crawling. The complexity and inefficiency of Internet search and retrieval of relevant information provide strong incentives for seeking ways to improve the cost-effectiveness of running a search engine.

3. Consumer choice in Internet search

Behavioral economics (BE) arose in the 1950s in response optimization and rational choice models being promoted in economics; behavioral economists opined that humans were subject to irrational thinking, overconfidence, anchoring and representativeness and because of that real price series and economic behavior veered from economic ideals. It introduced important concepts that are now mainstream in economics but also embraced fringe theories that are highly controversial. The current research steers clear of more controversial methods in BE and highlight the specific models and the context and extant literature for the accepted and statistically valid methods of BE. In general, we note that methodologies in this area are fundamentally the same as in any other scientific research: (1) empirical feature extraction; (2) model building; and (3) model confirmation. Each of these has specific families of methods for statistical analysis, empirical data collection and curation, and in order to link into empirical testing protocols, we have designed our models to be predicated on accepted methods in BE, taking into account that behavioral economics is economics where a much wider, and more realistic, range of behaviors are attributed to humans and their economic decision making. The current research develops a set of axioms based on widely accepted behavioral assumptions. We start by articulating the consumer’s choice problem and then will move on to architecture in an axiomatic consumer search query model. The consumer choice problem is fundamental to economically important activities such as keyword advertising on search platforms and search engine optimization.

The information consumers articulate a specific need using only a set of keywords – i.e. a term list V_S of random length, composed of selections from their own vocabulary V. Unique vocabulary terms ∈ V may be 1, 2 or 3 g with longer key phrases being exceptionally rare. Google Ads recognizes about 400,000 vocabulary terms in English. The number probably varies significantly between languages, with relatively spare languages like Chinese being at the low end and Arabic being at the higher end.

Search term lists are constructed through several concurrent user-initiated processes. The first process is the one that generates the information need. This process may range in importance from trivial curiosity to a serious investigation of complex and important issues (e.g. for a purchase transaction). The intensity of the user’s information need sets limited by the effort, utility or expense that is willing to be incurred in obtaining that information. This will affect the number of responses that the consumer is willing to demand, read and digest. If the cost of accessing website documents in terms of the user’s time, money and so forth is fixed, the major variable cost component of the user’s effort is the construction of the search query. This search query can be depicted as a set of keywords or terms summarizing the user’s information need. Term relationships can be depicted through models of the Boolean operations of conjunction ∩ and disjunction ∪. The consumer’s effort in constructing the query is directly proportional to the number of terms in the query ρ. Making only weak assumptions constraining the size of the search query (e.g. that there is an upper bound for the maximum expenditure of effort possible), the next section will derive a stochastic distribution of search query term list length.

A search query is comprised of language terms ∈ V linked by operators drawn from ∩,∪,∼, (, and) from which we compute performance statistics for specific parameter values of the search system. The performance metrics will be computed from the breadths of the indexed search terms in the search engine. Any combination of terms and operators must select a breadth which lies between a totally disjunctive query, i.e. all terms related by ∪, and a totally conjunctive search query, i.e. all terms related by ∩. This formulation ignores extreme situations involving query use of negation ∼, where performance does not necessarily fall within this envelope.

Preprocessing of search queries is increasingly common – through query completion, and more recently by machine learning algorithms like LaMDA, ChatGPT or Bing AI ChatGPT. Since information from this preprocessing does not come from the consumer but rather from the search algorithm, it is not assumed to change the nature of the transaction between the consumer (searcher) and the producer (search engine).

Search index databases are ideally designed with a high correlation between the vocabulary of the consumer and the vocabulary expressed in the indexing of the search data. In modern search engines such as Google and Bing, this concordance is enhanced dynamically based on location, language, prior use and other factors. Consumers also tend to self-select into being users of search engines that have historically provided them with the highest satisfaction from the information retrieved. Google and Bing know this and have active internal development teams seeking the best ways to maximize consumer satisfaction with their search engines. As a result, terms in the search query have isomorphic counterparts in the index set, and the index set a direct linkage to the web documents that they index.

Index sets do not exist as intrinsic parts of the web pages they reference rather they are inferred by the web crawlers that collect this information, and the algorithms that curate this into the index database. This production process parses web documents/pages and produces links to descriptive terms, usually dominated by nouns, adjectives and noun clauses. The keyword terms are drawn from a generalized vocabulary or may be artificially restricted from a context-specific keyword vocabulary in order to enforce consistency and better satisfy consumers. The net effect of the web crawling-curating process that updates the index database is that keyword query terms end up specifically linking to specific web document pages.

Given a specific query term, there is a specific probability that a randomly selected web page will include this specific query term in its summarization in the index set recorded on the search engine’s index database. The next section develops the stochastic models giving the probability distributions of these terms.

The central figure of merit in search assessment is relevance which is directly related to consumer satisfaction with a search, and over the long run, a particular search engine’s performance. Relevance can take as many forms as there are forms of intent for a search. Transactional searches look for products and searches, information searches look for authoritative information, curiosity can impel some search, as can boredom where the figure of merit may be simple entertainment. Relevance here will measure the satisfaction the searcher has with the results of a particular search.

Information consumers are presumed to be less concerned with the number of links returned from a search and much more concerned about the relevance of those links to the consumers search intention. Search intention is not a priori observable, but the consumer’s behavior in reviewing responses (e.g. a number of links clicked, time on the response page) can be used to assess consumer satisfaction with the search. Formally, the search of a specified domain D of size n receives a random sequence of search queries arising from consumer information needs η̃. Each realization of η̃ defines a specific set D_R ∈ D of relevant webpages, where size ‖D_R‖ = r ∈ [0, n]. We presume that this is very small in comparison with the total number of links retrieved. Research by Chervany and Dickson (1974) found that humans had a relatively low tolerance for consuming large quantities of information. In practice, search engines such as Google and Bing assume that searchers will not even scroll to the second page of responses, implying that generally r < 15 simply because page 2 links will not be looked at so are essentially not relevant to the search.

The sort of partitioning of links into relevant and not-relevant suggests that we could adapt a Neyman–Pearson hypothesis testing model, where choice is supported by the Neyman–Pearson Lemma. Indeed, search engines use modifications called Precision-Recall models or Sensitivity-Specificity models that are variations on Neyman–Pearson models.

3.1 Axiomatic search query model

This section characterizes the consumers' construction of a search term (keywords) list. The first task in this process enumerates the web pages to which a specific term refers. This is accomplished by equating the indexer’s and consumers' term vocabularies and then statistically characterizing the breadths of terms referencing the search engine index database. The second task characterizes the amount of effort which the user expends in articulating a search query, and the maximum amount of information which a query may contain. It then develops a probability distribution of the number of terms ρ̃. Given a specific realization ρ of ρ̃ consumer’s vocabulary terms (v₁, …, v_ρ) = V_S ∈ V for specific search query S, where η̃ is the consumer’s unobserved “true” information needs distribution and η is a realization, I assume that all possible choices VS(η̃)=V span the vocabulary V though many of these will occur with probability zero. The third task characterizes the number of relevant and non-relevant webpages indexed by a specific search query term. The fourth task addresses the Boolean relationships between terms in the query term list.

We will start by making some basic modeling assumptions that are obvious but need to be articulated. The following assumptions are made in constructing the search model.

Assumption 1.

The intersection of all webpages indexed and the set of relevant webpages exactly captures the information required to satisfy the consumer’s information needs

This assumption tells us that the consumer will be satisfied with a search if it only contains information that exists in the indexed database. In Google’s case, this indexed database of webpages consists of 30 million high-information content webpages out of around 4.62 billion webpages. This is a very precise approximation of reality.

Assumption 2.

Consumers gauge the appropriateness of a search term for describing a particular set of webpages by the breadth of that term in indexing those webpages, where “appropriate” implies that they reference the largest number of webpages in the full set of relevant webpages D_R.

This assumption tells us that the consumer, in making a choice among search keywords, will choose keywords that are broad, rather than specific, in the hope that a larger proportion of relevant webpages will be returned in the search. It does not suggest that consumers cannot make specific searches, merely that they will want more choices from the webpages returned rather than less.

Assumption 3.

Consumers want to maximize the amount of information retrieved without expending more than some maximum amount of effort κ_q in formulating a search query.

This assumption tells us that the consumer will minimize the energy and attention expended on any single search.

Assumption 4.

The search engine index database is organized by topics (e.g. derived through cluster analysis or unsupervised learning) though that set of topics may not be static over time. This assumption tells us that clustering is a part of indexing of webpages, an assumption which has been widely confirmed by search firms. Clustering means that where there is either uncertainty in understanding the consumer’s intention or the consumer is interested in serendipitous search responses, that the responses will be organized by “closeness” to the main search responses.

Assumption 5.

Vocabularies used in the curation of the search engine index W and vocabularies of the consumers V are identical. In the notation in this research, V ≡ W and v_i ≡ w_i, ∀i.

This assumption tells us that the consumer and the search engine use the same language.

3.1.1 Search term list length and composition

3.1.1.1 Breadth of indexing

This section formulates the stochastic distribution for breadth, defined as the size of the webpage set in the index database for the search engine, referenced by a particular vocabulary term v_i ∈ V. This material builds on models developed by Ijiri and Simon (1975), Mandelbrot (1983) and Hill (1974).

Average term breadth is the inverse of the average of the commonly used term specificity measure which is described in Lancaster (1971). The following notation is adopted. β(w_i) is the set of webpages d_i ∈ D which are referenced by v_i ≡ w_i ∈ V ≡ W and the size of ‖β(wi)‖=βi˚ is the count of webpages referenced by w_i where w_i is a search term type. Each occurrence of w_i in the index set of a particular webpage d_j ∈ D is a term token.

The following three theorems provide the model for the breadth of referencing by the user’s query terms. Assumption #2 states that users choose particular w_i comprising the query term lists so that they have greater breadth, individually, in referencing documents restricted to D_R than any other term in W. As r/n → 0 (i.e. as the ratio of the size of the set of relevant webpages to the size of the entire search-indexed webpages tends to zero, the topic composition of D_R grows significantly different from the composition of D. Therefore, the “most appropriate” search terms for referencing D_R will have uniform randomly selected breadths when referencing the entire search index database.

Theorem 1 provides a method to determine how many webpages are referenced by a specific search term, assuming that vocabularies are infinite and infinitely divisible. This is important because it lets us know how many webpage will be retrieved if a search query is formulated that only includes that search term. Theorems 2 and 3 will relax this assumption, while extending results to show how to transform and combine the consequences of Theorem 1 to obtain the webpage count for a more complex search query, such as a Boolean query or a “concept” extended Boolean query that might be generated by completion or machine learning extensions to the original search query.

Theorem 1.

Given an infinite search term vocabulary, the probability that a webpage is uniform randomly selected from D or from any subset D_R ∈ D is referenced by a search term w_i ∈ W follows a Pareto distribution. If ĩ is the breadth-based rank of a particular search term in the index database, the probability mass function of ĩ is fĩ(i)=θϵθi−1+θ if i ≥ϵ and 0 otherwise; where θ is the “temperature of discourse” (Mandelbrot, 1983) and reflects the richness of the indexing process.

Proof: Assume D consists of an aggregation of information chunks where each chunk contains a uniform measure of information and an information chunk is the smallest unit of text indexable by a search term-type; in other words, it is the limit of resolution of the index set W. For example, in a full-text system, this limit of resolution is the term-token in the text; at the other extreme of links and citation lists, the limit of resolution may be only term-tokens in both the webpage title and a limited vocabulary (e.g. a glossary) of terms. Assume a specific term-type or chunk reference relationship occurs only once in a webpage. Then the set of all information chunks in D can be rearranged by term-type. Figure 1 depicts term-tokens or chunks as small circles, while the term-types are determined by clustering these term-tokens between the delimiting bars for the term-type.

Assume that the left and right-most bars are terminators, and other bars delimit the sets of tokens or chunks by term-type. Assumption 4 above states, in the topic-specific index database, that the current topic orientation of the index determines the selection of new acquisitions to that index. This is formalized by stating that the proportion of term-type or chunk references in any newly acquired webpage in the index database is proportional to the number of chunks of that term-type already residing in the index. This is a “preferential attachment process” where new acquisitions are distributed according to how much already exists and has been widely documented on the Internet (e.g. see (Newman, 2005; Barabási & Albert, 1999; Krapivsky, Redner, & Leyvraz, 2000; Falkenberg et al., 2020). Ijiri and Simon (1975) showed that cell occupancies in preferential attachment follow Bose-Einstein statistics and cell size distributions are asymptotically Paretian as the number of cells (i.e. term-type classifications delimited by the bars in Figure 1) grows large. In a large index database, p ∝ i^−1+θ (Pareto distribution with “temperature of discourse” θ and rank i ∈ [1, u]) is the probability that a randomly selected term-type w_i references β˚i documents. If the entire vocabulary of u terms is ordered by the probability that each of the u terms references an arbitrarily selected document, the i^th term in the ordered set of u terms has probability fĩ(i)=θϵθi−1+θ if i ≥ϵ and 0 otherwise. (Q.E.D.)

A corollary provides a method to determine the number of webpages referenced by a specific term when the vocabulary is finite and of size u.

Corollary: When vocabulary size u ≤ ∞ the probability that an arbitrarily selected term w_i ∈ W references a randomly selected indexed webpage in D_R has the probability mass function fĩ(i)=kθϵθi−1+θ if i ≥ϵ and 0 otherwise, where we define k=ϵ−θϵ−θ−u−θ=1/1−uϵ−θ.

Proof:

∫ϵukθϵθx−1+θdx=1 implies that k=1/1−(uϵ)−θ, where integration is presumed to be Lebesgue integration over the mass function. (Q.E.D.)

The next theorem uses these results to determine the number of webpages that are indexed by a randomly selected index term. The rank-frequency distribution of the β˚i’s was derived in order to indirectly compute the actual distribution of breadth of index terms. The conversion from rank-frequency to breadth distribution is made by assuming that each of the terms has an equal probability of being at any rank ∈ [1, u], but that two terms cannot occupy the same rank (i.e. we exclude ties for a rank; this is basically what we mean by the occupancy following Bose-Einstein statistics). Theorem 2 derives the breadth distributions of β˚ and β˚R.

Theorem 2.

Let β˚̃ be a random variable with rank-frequency distribution given in the corollary to theorem 1 and whose realizations β˚i are the breadths of terms w_i in the full index database D. Then the cumulative distribution function of β˚̃ is

Fβ˚̃(β˚)=1+1u−(β˚/kθϵθ)−(1+θ)u

where

nkθϵθ≥β˚≥nkθϵθu−(1+θ)

The corresponding mass function for β˚̃ is

fβ˚̃(β˚)=Fβ˚̃(β˚+1)−Fβ˚̃(β˚)

where the subscript is dropped from breadth β˚ to indicate an arbitrary quantity for one of the u terms indexing D.

Proof: β˚i=nkθϵθi−(1+θ) for i = 1, …, u therefore:

i=β˚inkθϵθ−(1+θ)

for i = 1, …, u. If the index terms are ranked from 1 to u then the number of index terms of lesser rank than index term i is u − i + 1. The probability of the event that the user randomly selects a term of rank less than or equal to that of the i^th term i u − i + 1 divided by the probability of randomly selecting any single term u. Given the user’s random selection process, the cumulative breadth probability distribution gives:

Fβ˚̃(β˚)=Pr(β˚̃≤β˚)=u−i+1u

That is, the probability that the breadth of a randomly selected term is less than β˚i is equal to the probability of choosing any one of the subset of u − i + 1 terms in W ≡ V. Substituting β˚inkθϵθ−(1+θ) for i gives:

Fβ˚̃(β˚)=1+1u−(β˚/kθϵθ)−(1+θ)u

The formula for the mass function follows directly from the fact that β˚ takes on only discrete values and is non-differentiable. (Q.E.D.)

The following corollary uses these results to determine how many of the relevant webpages (i.e. that provide the consumer with high satisfaction) are indexed by a randomly selected index term. In formulating a search query keyword list, the consumer is assumed to maximize their future satisfaction by choosing search terms which reference a large number of relevant webpages and few irrelevant webpages. Indeed, this is the intended purpose of search support tools in the search engine, such as keyword completion, and ChatGPT style assistance for search.

Corollary: Let β˚̃R be the random variable whose realizations

rkθϵθ≥β˚R≥rkθϵθu−(1+θ)

Then the cumulative distribution function of β˚̃R is:

Fβ˚̃R(β˚)=1+1u−(β˚/rkθϵθ)−(1+θ)u

and the mass function for β˚̃R is

fβ˚̃R(β˚)=Fβ˚̃R(β˚+1)−Fβ˚̃R(β˚)

where the subscript is dropped from breadth β˚R to indicate an arbitrary quantity for one of the u terms indexing D.

Proof: Follows the form of theorem 2.

3.1.2 How much information is contained in an index?

How much information is contained in the index database for D_R? Assume that a particular η, a consumer search need realization (i.e. a search), induces a partitioning of the set of webpages D into D_R and D − D_R. The fixed set D_R may be considered to contain a fixed amount of information which is the maximum information that may be communicated using a search term list.

A consumer’s information needs, with the intent of satisfying them through search, determine two factors of importance for search: (1) it determines the set of webpages that are relevant and (2) it determines the maximum effort that the consumer is willing to expend in executing the search, and satisfying those need. Consumers are presumably aware of this trade-off, but can only incompletely communicate it. Search engines are similarly aware of these trade-offs. The search engine itself can be viewed as a transformation or mapping of the consumer’s articulation of search needs projected on the search term list (keywords) and mapped into a set of indexed webpages at a particular point in time. The indexed database is always changing, because of updates by the web crawler and also by producer updates intended to improve the retrieval experience and customer satisfaction – i.e. the relevance of the returned links.

Although there are many definitions of information, the one that is most appropriate in a Boolean framework is the (Shannon, 2001) information metric. Adopting Shannon’s perspective, let there be a set of probabilities p_i associated with a set of search terms w_i that reference a randomly selected webpage d ∈ D. Within the context of this specific index database (i.e. one that is a snapshot at a given point in time) the amount of Shannon information in the set of search terms is:

H=−∑i=1upilog2(pi)

Khinchin (2013) interpreted Shannon’s metric as a measure of reduction in uncertainty resulting from obtaining a particular piece of information – e.g. the content of a retrieved webpage. Following the Bayesian statistical method, prior information can be segregated from new information and there may be a quasi-absolute measure of information corresponding to the Bayesian likelihood function. In the current research, I make the weak assumption that this measure is monotonic increasing in the Shannon metric H. This information is (1) embedded in the indexing of webpages on the search engine’s index database, drawn from vocabulary W; and (2) embedded also in the search queries with keyword-terms drawn from the consumer’s vocabulary V.

Now consider the value-added service offered by the search engine, which adds information to the index set, but is hugely costly; in most commercial Internet searches, these costs are typically covered by advertising revenues. Consider a hypothetical situation to describe the addition of information in terms of the Shannon metric. Assume that you are visiting a rather strange library where all of the books have indistinguishable bindings, devoid of titles, authors or other markings. Assume also that you are not allowed to browse through the pages of these books. You have a lot of information in front of you, but no way to differentiate one information receptacle for another. This is the problem faced at start-up by the search engine facing an Internet full of webpage links, and a relatively limited budget with which to make sense of this. In this situation, a randomly selected book may be relevant to any search term in a consumer’s vocabulary with equal probability. Since there is no way of differentiating books from each other, indexing would provide no information, because no linkages can be established between a search query and “relevant” books. More formally, make the simplifying assumption that v_i ≡ w_i and W ≡ V. A link scheme with u terms would be w1,w2,…,wu1/u,1/u,…1/u and would have H = log₂(u). Let some physical, book-indexing version of our search engine place author, title and so forth on all of the bindings. This would be equivalent to our web crawler’s synopsis of each webpage. The link scheme would be updated to w1,w2,…,wup,p2,…,pu and our post-indexing Shannon metric becomes H*=−∑i=1upilog2(pi)≥log2(u)=H and the search indexing has added information as measured by the Shannon metric.

Theorem 3 formalizes the concept of the information contained in an indexing scheme. Theorem 3 ’s measure is an exponential of H, which is order-preserving and preference preserving. It describes the maximum amount of information which may be embedded in the consumer’s search query or in the index database and sets an upper limit on the amount of information that a search may communicate. The amount is a maximum because search queries reflect information needs that can only partially be satisfied by the search engine – for most searches, the consumer is likely to want information that is not included anywhere on the Internet. Some examples of searches that are unlikely to be satisfied with Internet webpages might be anything in top secret documents, the elixir of life, the truth behind numerous conspiracy theories and so forth. Search engines can only deliver information contained in the indexable Internet.

Theorem 3 assumes that webpages may contain variable numbers of meaningful chunks of information, each chunk being information that might satisfy the consumer.

Theorem 3.

The amount of information contained in the indexing of D_R with vocabulary W is 2kHR, where k is the number of chinks of information in D_R and k ≥‖D_R‖, i.e. the average number of chunks in an indexed document is more than 1, and H_R is computed:

HR=−∑i=1uβˆilog2(βˆi)

where βˆi is the proportion of information chunks in D_R described by w_i. Note that βˆi∀i⇒HR>0.

Proof: Assume that there exist a total of k information chunks in D_R ∈ D and let k_i be the number of chinks represented by the index term w_i. The total chunk count of relevant information on the index database is k, while the total webpage count of relevant information on the index database is r. The number of different subsets of D_R characterized by the same set of term repetitions k_i is

k!∏i=1uki!

where tu is the total number of search term-types in W. Let

βˆi=β˚R,i∑i=1uβ˚R,i

As D_R becomes large ki/k→βˆi. Applying Stirling’s approximation to the factorials

(2π)2/(u−1)kk+1/21∏i=1ukiki+1/2

Take the base 2 logarithm of this expression

2/(u−1)log2(2π)+(k+1/2)log2k−∑i=1u(ki+1/2)log2ki

Divide through by k, placing that term in front of the entire expression and not that as k increases the term involving 2π tends to zero and the 1/2’s in the former exponents become insignificant. Therefore, we can approximate this as

klog2k−∑i=1u(ki/k)log2ki

and since ∑i=1u(ki/k)=1 rewrite as:

k−∑i=1u(ki/k)log2ki−log2k→−k∑i=1uβˆi(log2βˆi)

Raise 2 to the power of this exponent (since we originally took the log₂) and you derive the asymptotic approximation

2−k∑i=1uβˆi(log2βˆi)=2kHR

where H_R is the information in a single chunk of information embedded in the index set of D_R described by vocabulary W. (Q.E.D.)

3.1.3 Consumer’s choice of search term list length

Given the time, energy and motivation to sift through an entire collection of retrieved webpages, a consumer could precisely identify the set D_R of webpages relevant to need η. This assumption is unrealistic, and in fact, consumers will hold some “reservation price” – i.e. an amount of effort that represents an upper limit to the amount of time, energy, etc. that they are willing to expend on a search – call this k_q. The count of terms in the search term list may be expected to be a monotonic increasing function of the level of effort that the consumer expends, and that expenditure is represented by the cost (disutility) function Z_Q(ρ) where ρ is the length of the search term list. Theorem 4 calculates how the consumers cost-benefit considerations translate to search term lists.

Theorem 4.

The probability that the user chooses a search term list of length ρ is

fρ̃∝ZQ(ρ)1−ln(kq)

This is Pareto distributed in the cost Z_Q and k_q is the maximum amount of effort that the user is willing to expend in formulating a query.

Proof: Assume that the user has a total of Y possible search query term lists with which to represent the relevant webpage set D_R composed of the u search terms in V. Construct a set of u + 1 ′trunks for the potential search term list “tees”, with one for each search term-type plus a null “terminator” that marks the end of a search term list. Then add u + 1 similar branches to each trunk except for the terminator trunk and so forth onto the branches so constructed. Because the branches extending forward from any particular branch in this structure form a structure that is self-similar, i.e. it duplicates the entire structure, the distribution of fρ̃(y)(ρ(y)) must scale for portions or concatenations of search term lists.

Define ρ(y) as the length of the search term list of rank y from the trunk to the terminator. The cost or disutility associated with a search term list of length ρ(y) is ZQρ(y). In the subsequent arguments, assume that the terminator is chosen with probability 1 − π and each search term in V has a probability of π ≤ 1/(u + 1) of being chosen in a randomly selected search query. Adopting an “urn model” terminology, let these probabilities be consistent with a process in which search terms (balls) are selected from a vocabulary (urn) and which stops the section of search terms based on a process independent from selection (e.g. the process is terminated because of consumer fatigue or boredom). In this model, rank will uniquely identify each possible search term list. The size of a search term list will depend on consumer effort. The rank of a search term list is related to the length of that list in the following manner. Given u terms in the vocabulary, there are u^ρ(y) search term lists of exactly ρ(y) terms. The rank y of a term list of size ρ(y) is

y>∑i=0ρ(y)−1ui=up−1u−1

which is equivalent to

ρ<ln(u−1)y+1lnu

Additionally, note that a search term list with one additional term must always be of rank greater than y

y≤∑i=0ρ(y)ui=up+1−1u−1

which is equivalent to

ρ≥ln(u−1)y+1lnu−1

therefore approximate

ρ(y)≈ln(u−1)ylnu−1

Let fỹ(y) be the probability that the consumer chooses the search term list of rank y. The consumers' task is to choose the Y values fỹ(y) such that the expected amount of information measured by the Shannon metric on rank probabilities and communicated to the search engine, is maximized

maxfỹ(y)−∑y=0Yfỹ(y)ln(fỹ(y))

such that (constraints).

(1)
Effort expenditure

−∑y=0Yfỹ(y)ZQ(ρ(y))≤kQ

(2)
Summation to 1

∑y=0Yfỹ(y)=1

(3)
Maximum total information is kH

−∑y=0Yfỹ(y)ln(fỹ(y))≤kH

I use Lagrange multipliers to optimize this objective function with constraints, starting with the Lagrangian. Here I assume that fỹ(y) fully uses the available information and thus equivalence holds in the third constraint. Under this assumption, the Lagrangian term for this constrain is zero, and it can be dropped from the maximization. The Lagrangian is:

L=−∑y=0Yfỹ(y)ln(fỹ(y))+L1−∑y=0Yfỹ(y)ZQ(ρ(y))−kQ+L2∑y=0Yfỹ(y)−1

The first-order conditions are computed by

∂L/∂fỹ(y)=0

which yields Y equations (one for each y) of the form

−1−logfỹ(y)−L1ZQ(ρ(y))+L2=0

Eliminating L2 and substituting into −∑y=0Yfỹ(y)ZQ(ρ(y))=kQ yields

∑y=0YZQ(ρ(y))eL1ZQ(ρ(y))∑y=0YeL1ZQ(ρ(y))

which yields Y equations

L1=(1−ln(kQ))ln(ZQ(ρ(y)))ZQ(ρ(y))

Substitute into the previous equation and note that all terms in their probability density function depend on ρ(y)

fρ̃(y)(ρ(y))=e1−ln(kQ)⁡lnZQ(ρ(y))∑i=0ρ(Y)e1−ln(kQ)⁡lnZQ(i)

The kernel of the density of ρ̃ can thus be written

fρ̃(y)(ρ(y))∝ZQ(ρ(y))1−ln(kQ)

This is a Pareto distribution in the cost (disutility) function Z_Q(⋅) (Q.E.D.)

Theorem 4 provides a mathematical basis for the observation of Pareto distributions in the choice of search terms. Such distributions have been repeatedly verified in search term lists and in indexing, most extensively in Zipf (2016) which is a compendium of examples of indexing and word choice that follow Pareto distributions.

3.1.4 Consumer’s choice of search query terms

The previous arguments defined an economically optimal amount of information for the consumer to communicate to the search engine; it did not specify what vocabulary to use, i.e. which member of V should be chosen to communicate that information. Our second assumption provides a measure of the the strength or appropriateness of a particular vocabulary search term in referencing the set of relevant webpages D_R. It states that the most appropriate search terms to specify D_R will be the ones that individually reference the largest number of webpages in D_R. This is consistent with the assumption that the user knows his information needs precisely but fails to fully include that knowledge in the search query terms because of effort aversion. It is also consistent with the user’s desire to improve the hit rate in search retrieval by manipulating the choice of query terms. The following theorem formulates the distribution of the most appropriate terms that the consumer may select in constructing the query term list.

Theorem 5.

The breadth of the j^th “most appropriate” term choice referencing the relevant web[ages D_R has the probability density function of the j^th-order statistic of term breadth βR˚̃ where j = 1, …, ρ. The probability density function for the j^th-order statistic of βR˚̃ is:

fβ̃R[j](β̃R[j])=u⋅u−1j−1FβR˚̃(β̃R[j])j−11−FβR˚̃(β̃R[j])u−jfβR˚̃(β̃R[j])

and the cumulative distribution is

Fβ̃R[j](β̃R[j])=∑k=juukFβR˚̃(β̃R[j])k1−FβR˚̃(β̃R[j])u−k

Which in turn is equal to the incomplete Beta function:

IFβ̃R[j](j,u−j+1)

where 1≤βR[j]≤r and Fβ̃R[j](β̃R[j]) is described in Theorem 3.

Proof: The derivation of distributions for order statistics appears in David and Nagaraja (2004), chapter 1.

4. Conclusions and discussion

This paper has developed a stochastic model of search based on well-articulated mathematical models, as opposed to empirically optimized computer algorithms. Such models provide assessments of consumer search behavior that are much more generalizable and less sensitive to local or data-specific situations that create problems for empirically tested computer algorithms. This model fills a gap in the current literature, though hopefully it will provide a model that can be extended and applied to develop better industry business models for search.

Search query tasks are generally perceived as situation-specific and highly variable we show that the application of linguistics can lend some order to understanding the way that information consumers craft queries and the way that these can be projected onto search engine algorithms and index bases, even where the query content may be extended by a linguistic generative pre-trained transformer (e.g. ChatGPT). In the past, where there have been purportedly “economic” studies, these have typically not been comprehensive economic models, rather they are anecdotal, case studies or limited empirical analysis through particular narrative perspectives, or they are proposed algorithms. These may be important, but they are entirely different from the problem we are addressing which is to create a generalized comprehensive economic model of internet search that can be used as a revenue model or a business model incorporating costs and architectures. The complexity and inefficiency of Internet search and retrieval of relevant information provide strong incentives for seeking ways to improve the cost-effectiveness of running a search engine.

In this study, we started with the assumption that the consumer’s search query arises from a process which is completely independent of the process of webpage crawling that acquires the indexed web page links that comprise the index database. As a result, the distribution if indices over D_R is independent of the distribution indices over D. As r/n → 0, the contents of D_R become less and less representative of the contents of D with respect to the indexable chunks of information; otherwise, the user would always choose a search query term list consisting of the lower ρ unrestricted order statistics, βR˚̃[i]. This would be irrational and not in the information consumers interest, and thus we do not see this in practice.

Any search term corresponding to the i^th breadth order statistic βR˚̃[i] over the set D_R will tend to correspond to an arbitrary, randomly selected order statistic βR˚̃[j] over the set D as r/n → 0, assuming i ≠ j ∈ [1, u]. The ρ lower extreme order statistics βR˚̃[i] restricted to the set D_R are independent of the unrestricted β˚̃ as (u−ρ)u−1→1 (David & Nagaraja, 2004). summarizes several proofs that show that the lower-order statistics of distribution are independent of the central and upper-order statistics. This independence becomes important in the calculation of performance statistics for search.

To give an idea of the scale of the Internet in context (Kunder, 2024), reports Google has discovered around 200 trillion web pages, and indexes around 30 trillion of those. Around 30% of web pages are spam, and a much larger portion is redundant or copies to place their storage and use closer to consumers.

The size of the search term vocabulary that is used in indexing the index database affects performance through both the search query length and breadths of search terms (i.e. keywords). This collection of indexes is spanned by the set of index terms that are retrieved by the web crawler that is used to build the index database.

Consider the search consumer’s personal vocabulary. A native English speaker is likely to have a vocabulary of 20,000 to 100,000 words, and the effective vocabulary for search will be a subset of that vocabulary. Subsets may be restricted to nouns and adjectives or to topic-specific terms. The model developed in this paper assumes an indexing vocabulary that is the same as the searcher’s vocabulary. It is not clear how significantly this restriction impacts the results of the model. I believe the impact will be minimal, because on average the searchers of websites, and the editors of websites are likely to share approximately the same vocabularies, particularly if we are generalizing over the entire Internet.

Should the congruence of the indexer’s and consumer’s vocabularies be low, this may result in lowered retrieval rates and potentially less consumer satisfaction. My modeling could be extended using distinct indexer and consumer vocabularies, making assumptions about four sets of terms – the 2 × 2 matrix of consumer and indexer search terms. I make an argument here that this extension of the model is unlikely to improve the explanatory model, specifically because of the Pareto distribution (i.e. the long tail) of search terms.

As a result of the Pareto rank-frequency distribution of search term breadth, the vocabulary size acquires significance for several of the formulas in this analysis. The upper limit of the sum in the coefficient of normalization for the query length distribution has the value ρ(Y)=ln[(u−1)Y]ln(u). This value quickly approaches a constant as vocabulary size u becomes large. Thus the size of the vocabulary has very little practical effect on the choice of search term list length. The specificity of indexing is determined by the size of u and thus has little influence over the user’s choice of search terms. This is intuitively correct because the consumer probably has little awareness of the actual value of u.

The value of u most likely affects the breadths of relevant webpages indexed and referenced by a query βR˚̃[i]. As more terms are introduced to the vocabulary, indexing should become more specific and the number of webpages indexed by an average search term should fall.

On the other hand, u affects the distribution of breadth of indexing of chosen search terms – i.e. the size of the subsets of d_i ∈ D indexed by a specific w_i ∈ W. Because the size distribution is Pareto distributed and has very little probability weight in its rightmost (long) tail, these probabilities tend to be insensitive to changes in u after u exceeds a certain value. In practice, this occurs because large vocabularies consist mainly of rarely used words, but definitions, reference are only small subsets of D.

The distribution of search term breadth is Paretian. Therefore, if the search engine indexer selects vocabulary W large enough to index n webpages, the size ‖W‖ = u(n, θ) ≈ i: min_i[(nkθϵi)^−(1+θ) ≤ 1] which implies ‖W‖ ≈ [(nkθϵ)^−(1+θ)]. This may be restated more directly that when the expected number of webpages drops below a specific search term-type rank, there is no need for search terms at or above that rank. The formula is somewhat circular in that k depends on u, but u has very little effect on the value of k and k quickly approaches 1 as k grows large. Similarly, an arbitrary randomly selected subset D_R of size r will be referenced by a subset of the search term-type vocabulary of expected size u(r, θ) ≈ [(rkθϵ)^−(1+θ)].

This discussion highlighted each point at which vocabulary size enters into the model. At no point is vocabulary size a particularly significant parameter, nor does assuming an infinite vocabulary, ceteris paribus, affect the conclusions of the modeling. Thus, vocabulary size may be ignored except where it is explicitly entered into a function. In such cases, vocabulary referencing D can be approximated by u ≈ [(nθϵ)^−(1+θ)] and vocabulary referencing D_R can be approximated by u ≈ [(rθϵ)^−(1+θ)].

One feature of search queries which was omitted in the analysis is the Boolean relationship between terms. Although the consumer specifies a set of operators between each search term drawn from ∩,∪,∼, (, and), it is difficult to characterize these statistically. The function of the model is solely to compute performance statistics given specific parameter values of the search engine. These performance measures are computed from the breadths of the index terms either unrestricted over the entire index database or restricted to just the set of relevant webpages. Any combination of terms and operators must select either an unrestricted or restricted breadth which lies between a totally disjunctive search query, i.e. all terms related by ∪ and a totally conjunctive query, i.e. all terms related by ∩. This is the subject of ongoing research that will be reported in a future research paper.

Figures

Figure 1

Term-tokens clustered into collections of term-types

References

Al-Shabi, M. (2020). Evaluating the performance of the most important lexicons used to sentiment analysis and opinions mining. IJCSNS, 20(1), 1.

Anderson, D. (2005). Global linguistic diversity for the internet. Communications of the ACM, 48(1), 27–28. doi: 10.1145/1039539.1039562.

Argenton, C., & Prüfer, J. (2012). Search engine competition with network externalities. Journal of Competition Law and Economics, 8(1), 73–105. doi: 10.1093/joclec/nhr018.

Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–12. doi: 10.1126/science.286.5439.509.

Barkovich, A. (2019). Informational linguistics: Computer, internet, artificial intelligence and language. In 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC) (pp. 008–13). IEEE.

Blair, I. V., Urland, G. R., & Ma, J. E. (2002). Using internet search engines to estimate word frequency. Behavior Research Methods, Instruments, and Computers, 34(2), 286–90. doi: 10.3758/bf03195456.

Bruder, I., Düsterhöft, A., Becker, M., Bedersdorfer, J., & Neumann, G. (2000). Getess: Constructing a linguistic search index for an internet search engine. In International Conference on Application of Natural Language to Information Systems (pp. 227–38). Springer.

Chen, Y., & He, C. (2011). Paid placement: Advertising and search on the internet. The Economic Journal, 121(556), F309–28. doi: 10.1111/j.1468-0297.2011.02466.x.

Chervany, N. L., & Dickson, G. W. (1974). An experimental evaluation of information overload in a production environment. Management Science, 20(10), 1335–44. doi: 10.1287/mnsc.20.10.1335.

Crystal, D. (2005). The scope of internet linguistics. In Proceedings of American Association for the Advancement of Science Conference; American Association for the Advancement of Science Conference, Washington, DC (pp. 17–21).

David, H. A., & Nagaraja, H. N. (2004). Order statistics. Hoboken, NJ: John Wiley & Sons.

Dinerstein, M., Einav, L., Levin, J., & Sundaresan, N. (2018). Consumer price search and platform design in internet Commerce. American Economic Review, 108(7), 1820–59. doi: 10.1257/aer.20171218.

Falkenberg, M., Lee, J.-H., Amano, S., Ogawa, K., Yano, K., Miyake, Y., … Kim, C. (2020). Identifying time dependence in network growth. Physical Review Research, 2(2), 023352. doi: 10.1103/physrevresearch.2.023352.

Gulla, A.J., Auran, P. G., & Magne Risvik, K. (2002). Linguistics in large-scale web search. In International Conference on Application of Natural Language to Information Systems (pp. 218–22). Springer.

Harari, M. B., Parola, H. R., Hartwell, C. J., & Riegelman, A. (2020). Literature searches in systematic reviews and meta-analyses: A review, evaluation, and recommendations. Journal of Vocational Behavior, 118, 103377. doi: 10.1016/j.jvb.2020.103377.

Hill, B. M. (1974). The rank-frequency form of Zipf’s law. Journal of the American Statistical Association, 69(348), 1017–26. doi: 10.2307/2286182.

Hundt, M., Nesselhauf, N., & Biewer, C. (2007). Corpus linguistics and the web. In Corpus Linguistics and the Web (pp. 1–5). Brill.

IBIS (2024). Search engines in the US. Available from: Https://Www.ibisworld.com/United-States/Market-Research-Reports/Search-Engines-Industry/

Ijiri, Y., & Simon, H. A. (1975). Some distributions associated with Bose-Einstein statistics. Proceedings of the National Academy of Sciences, 72(5), 1654–57. doi: 10.1073/pnas.72.5.1654.

Jolivet, G., & Turon, H. (2019). Consumer search costs and preferences on the internet. The Review of Economic Studies, 86(3), 1258–1300. doi: 10.1093/restud/rdy023.

Khan, M. T., Durrani, M., Ali, A., Inayat, I., Khalid, S., & Khan, K. H. (2016). Sentiment analysis and the complex natural language. Complex Adaptive Systems Modeling, 4, 1–19. doi: 10.1186/s40294-016-0016-9.

Khinchin, A. Y. (2013). Mathematical foundations of information theory. New York: Courier Corporation.

Khoo, C. S. G., & Johnkhan, S. B. (2018). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 44(4), 491–511. doi: 10.1177/0165551517703514.

Koto, F., & Adriani, M. (2015). A comparative study on Twitter sentiment analysis: Which features are Good?. In Natural Language Processing and Information Systems: 20th International Conference on Applications of Natural Language to Information Systems, NLDB 2015, Passau, Germany, June 17-19, 2015 (Vol. 20, pp. 453–57). Springer, Proceedings. doi: 10.1007/978-3-319-19581-0_46.

Krapivsky, P. L., Redner, S., & Leyvraz, F. (2000). Connectivity of growing random networks. Physical Review Letters, 85(21), 4629–4632. doi: 10.1103/physrevlett.85.4629.

Kunder, M.de (2024). The size of the world wide web. Available from: https://www.worldwidewebsize.com/

Lancaster, F. W. (1971). The cost-effectiveness analysis of information retrieval and dissemination systems. Journal of the American Society for Information Science, 22(1), 12–27. doi: 10.1002/asi.4630220104.

Levin, J. D. (2011). The economics of internet markets. Cambridge, MA: National Bureau of Economic Research.

Liu, B. (2010). Sentiment analysis and subjectivity. In Handbook of Natural Language Processing (2, pp. 627–666). Boca Raton, FL: CRC Press.

Mandelbrot, B. B. (1983). On the quadratic mapping z → Z2-μ for complex μ and z: The fractal structure of its m set, and scaling. Physica D: Nonlinear Phenomena, 7(1-3), 224–39. doi: 10.1016/0167-2789(83)90128-8.

Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–51. doi: 10.1080/00107510500052444.

Peitz, M., & Reisinger, M. (2015). The economics of internet media. In Handbook of Media Economics (Vol. 1, pp. 445–530). Elsevier. doi: 10.1016/b978-0-444-62721-6.00010-x.

Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55. doi: 10.1145/584091.584093.

Sharoff, S. (2006). Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics, 11(4), 435–62. doi: 10.1075/ijcl.11.4.05sha.

Varian, H. R. (2016). The economics of internet search. In Handbook on the Economics of the Internet (pp. 385–94). Edward Elgar Publishing.

Zipf, G. K. (2016). Human behavior and the principle of least effort: An introduction to human ecology. Ravenio Books, Addison-Wesley Press.

Corresponding author

Jian Mou is the corresponding author and can be contacted at: jian.mou@pusan.ac.kr

About the authors

James Christopher Westland is Full Professor in the Department of Information and Decision Sciences at the University of Illinois – Chicago. He has a BA in Statistics and an MBA in Accounting from Indiana University and received his PhD in Computers and Information Systems from the University of Michigan. He has professional experience in the USA as a certified public accountant and as a consultant in technology law in the USA, Europe, Latin America and Asia. He is the Editor-in-Chief of Electronic Commerce Research (Springer) and has served on editorial boards of several other information technology journals including Management Science, ISR, ECRA, IJEC and others. He has served on the faculties at the University of Michigan, University of Southern California, Hong Kong University of Science and Technology, Tsinghua University, University of Science and Technology of China, Harbin Institute of Technology and other academic institutions. In 2012, He received High-Level Foreign Expert status in China under the 1000-Talents Plan and is currently Overseas Chair Professor at Beihang University.

Jian Mou is Professor in the School of Business, Pusan National University, South Korea. His research interests include big data analysis, social media, trust and risk issues in e-service and information management. Dr Mou’s research has been published in Technological Forecasting and Social Change, Journal of the Association for Information Systems, Information and Management, Internet Research, International Journal of Information Management, Electronic Commerce Research, Information Processing and Management, Computers in Human Behavior, Information Technology and People, Behaviour and Information Technology, International Journal of Human-Computer Interaction, Journal of Retailing and Consumer Services, Information Development, International Journal of Mobile Communications and The Electronic Library.