Social choice using moral metrics

Kenneth Halpern (Independent, Cambridge, Massachusetts, USA)

Asian Journal of Economics and Banking

ISSN: 2615-9821

Article publication date: 13 July 2021

Abstract

Purpose

This paper aims to develop a geometry of moral systems. Existing social choice mechanisms predominantly employ simple structures, such as rankings. A mathematical metric among moral systems allows us to represent complex sets of views in a multidimensional geometry. Such a metric can serve to diagnose structural issues, test existing mechanisms of social choice or engender new mechanisms. It also may be used to replace active social choice mechanisms with information-based passive ones, shifting the operational burden.

Design/methodology/approach

Under reasonable assumptions, moral systems correspond to computational black boxes, which can be represented by conditional probability distributions of responses to situations. In the presence of a probability distribution over situations and a metric among responses, codifying our intuition, we can derive a sensible metric among moral systems.

Findings

Within the developed framework, the author offers a set of well-behaved candidate metrics that may be employed in real applications. The author also proposes a variety of practical applications to social choice, both diagnostic and generative.

Originality/value

The proffered framework, derived metrics and proposed applications to social choice represent a new paradigm and offer potential improvements and alternatives to existing social choice mechanisms. They also can serve as the staging point for research in a number of directions.

Keywords

Citation

Halpern, K. (2021), "Social choice using moral metrics", Asian Journal of Economics and Banking, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/AJEB-10-2020-0080

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Kenneth Halpern

License

Published in Asian Journal of Economics and Banking. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode.


1. Introduction

An ideal social choice mechanism is both fair and perceived as fair. Arrow famously demonstrated that it is impossible to accommodate even three basic tenets of fairness in ranked preference systems (Arrow, 1950), and similar results hold for other systems. Even if anomalies are unavoidable, we can seek to reduce unfairness by minimizing their prevalence and severity. Of equal importance, we can seek mechanisms that reduce the perception of unfairness.

We offer a tool to address both parts of the equation. By inferring the moral systems of individuals and constructing a suitable distance function between them, it is possible to construct a moral geometry with attendant notions of proximity, neighborhoods and clustering.

A metric is a multidimensional structure and far more versatile than the linear orders often employed for social choice. It opens the door to a variety of new approaches, but also can beget new linear orders for use with existing social choice mechanisms. A metric can embody the relationship between entire sets of views, and the very use of a precisely quantified moral geometry may help foster a sense of inclusion and fairness.

We begin by reifying the notion of a “moral system” (MS), equating it with a computational black box that issues responses when presented with situations. Under reasonable inference assumptions, such a black box can be represented by an asymptotic conditional probability distribution (CPD). We subsequently also consider inferred or estimated CPDs as representatives.

After formally defining our assumptions, we turn to the question of metric construction. For a metric among MSs to be useful, it must reflect our intuition in some fashion. Direct assertion of such a metric is untenable, and it must inherit meaning from simpler structures through which we plausibly can codify our intuition.

The natural semantic objects are situations and responses. In a given problem, we understand these and can characterize them in a sensible fashion. We argue that the appropriate a priori structures are a probability distribution (PD) over situations and a metric among responses. It is from our intuition for these that the metric among MSs must derive its behavior and meaning.

After discussing the specification of our a priori structures, that of the MSs themselves, and a few related issues, we propose several applications of this framework to social choice. We next introduce a number of related concepts, corresponding to notions of hypocrisy, judgment, worldview and moral trajectory, and consider some additional social choice applications involving these.

We also present several concrete, well-behaved derived metrics among CPDs and conclude with a discussion of the use of Euclidean embeddings for selection and specification of the a priori metric among responses.

We will refer to both metrics and pseudometrics as “metrics,” only drawing the distinction for emphasis or when necessary. Recall that a metric is a nonnegative function d: X × XX such that (1) d(x, y) = d(y, x), (2) d(x, y) = 0 iff x = y and (3) d(x, z) ≤ d(x, y) + d(y, z). A pseudometric relaxes this to allow d(x, y) = 0 when xy. For most purposes, the distinction is immaterial. Although metric-derivation procedures almost invariably produce pseudometrics among CPDs, these usually restrict to metrics among MSs.

We will not delve into questions of empirical measurement or experimental construction. Observation most likely would entail case histories, surveys or interviews, carefully curated and with due regard for unreliability.

Note that our use of the term “metric” is topological, and we speak of “geometry” in reference to distances, neighborhoods and clusters. This should not be confused with Riemannian metrics or notions of curvature.

2. Framework

2.1 Central premise

A “moral system” (MS) embodies how an individual, institution or group responds to situations. The central dogma of our approach is that MSs correspond to computational black boxes, which, under reasonable inference assumptions, have CPDs as mathematical proxies. Through these means, the otherwise ill-defined problem of constructing a metric among MSs becomes mathematically well defined.

2.2 Situations and responses

A “situation” is a stimulus provided to a subject, and a “response” is a reaction of a subject to a situation. When working with surveys, questions would be situations, and answers would be responses. When working with judicial sentencing, cases could be situations, and sentences could be responses.

We will denote by S the set of all possible situations, and by R a set containing all possible responses. An MS generates a response in R to any given situation in S. Situations and responses have meaning and are the primary sources of semantics in a problem.

S is not a theoretical universe of situations. It is finite and chosen to capture the behavioral aspects we care about. There is no concept of a basis that spans behaviors (our spaces are not linear), but S sometimes can serve in a similar capacity.

The set of “accessible responses” RA consists of every response that can arise from the MS under consideration with nonzero probability for some s ∈ S. It is not known to us a priori, though we can attempt to infer it.

We require that RAR is finite, though R need not be. It is not unreasonable to assume a priori knowledge of R, and that RAR, without knowing which subset it is. Consequently, we can expand R as needed to admit simple parametrization or other convenient properties.

2.3 Moral systems as black boxes

The only way to probe an MS is through its responses to situations. From our perspective, it is an opaque machine for determining responses to situations.

We do not assume that an MS is deterministic. A person may not always respond the same way to a given situation, perhaps reflecting a true stochastic element, imperfect information or stateful evolution. Because the decision-making process is hidden from us, we cannot attribute apparent randomness to any specific source.

We will refer to a “sample” as a single observed response by a given MS to a specific situation. We assume that MSs act independently of one another, each MS only responds and evolves based on the sequence of situations it encounters, and we are not privy to its initial state. There is no notion of synchronous sampling, and we cannot meaningfully compare the responses of two MSs to a given situation in a single trial. We only can compare statistical behaviors.

We have no notion of time or computational complexity or computability. As machines, MSs are assumed to always halt and to operate in constant time. We only consider discrete sequences of samples.

2.4 Inference assumption

Without inference assumptions, we can say nothing useful, even in the presence of unlimited data. We adopt a form of ergodicity.

Given an MS and some PD P(S) s.t. P(s) > 0 ∀s ∈ S, we assume that (1) regardless of the unknown initial state, the histogram of any sequence of samples with situations drawn from P(S) will asymptotically converge to a unique P(R|S) for the MS, and (2) P(R|S) encompasses everything relevant to us about the MS's behavior. We will refer to it as the “true CPD” of the MS.

This says nothing about the rate of convergence, and we implicitly also assume (3) we have (or can produce) adequate sample data for satisfactory inference in the given application. When studying moral trajectories in Section 4.4, we will weaken these assumptions to allow adiabatic variation of the CPD.

Under our inference assumption, P(R|S) is the natural mathematical proxy for an MS. We will denote by CS,R the space of all such CPDs, infinite even for finite S and R. We will denote by X the specific set of MSs under study. This need not be fixed a priori, and may expand. For example, new individuals could be surveyed.

Note that a CPD obtained from finite sample data is not the true CPD of an MS, and we will term it an “inferred CPD.” For reasons to be discussed, we often confine ourselves to a model subspace BS,RCS,R. Rather than the true CPD or an inferred CPD, we estimate an element of BS,R. An estimated CPD obtained with unlimited data will be termed the “asymptotic estimate” for the MS, while any finite data estimate will be termed an “inferred estimate,” also in BS,R.

2.5 A priori structures

It is on X that we seek to derive a metric. We do so by first building a metric DM on CS,R, then restricting it to the model subspace BS,RCS,R and finally pulling it back along the indexing map XBS,R, which associates each MS with its inferred estimate. This is just a fancy way of saying the metric on CS,R induces one on X in the obvious manner.

For convenience, we will use DM interchangeably for the derived metric on X, BS,R, or CS,R. For example, DM(x, x′) on X implicitly means DM(Px, Px) on CS,R or BS,R, where Px denotes whichever CPD we associate with x. Because X is discrete, we will sometimes write DijDM(xi, xj).

Note that we are not simply trying to find some metric on CS,R. That could be accomplished using the Fisher–Rao metric (Rao, 1945) or a variety of other approaches, but the resulting geometry would be uninformative. The choice of DM determines the utility of our framework, and it must embody the behavioral aspects we care about.

Directly positing a metric among MSs or CPDs is very difficult. These are complicated objects, and we generally have no intuition for distances between them. We need something simpler and more intuitive. R and S are endowed with semantics, and it is to these we must turn. The sets themselves do not suffice, and we require some sort of structures on them.

Our approach is to require a PD P(S) over S and a metric (or pseudometric) dR on R. It is from these structures that DM must derive its behavior. We will now motivate these choices.

Note that we are not simply replacing the problem of defining a metric on CS,R with a comparably difficult one on a different space. R is much smaller than CS,R and is more likely to admit a simple, intuitive metric. We are building DM from tractable components.

2.6 P(S)

Without a PD over situations, we must confine ourselves to per-situation analysis, and this is inadequate for our purposes. We require some form of aggregation over S, and summation is the natural choice. P(S) provides the necessary measure. It may represent an estimated likelihood of occurrence, an importance weight or a bit of both. For certain purposes, the interpretation of results is easiest when P(S) is a likelihood.

Specifying P(S) usually is straightforward. For example, we could empirically measure observed frequencies of occurrence. We will assume that P(S) is strictly positive on all of S (it is easy to introduce nominal support if not). Note that P(S) may suppress the probability sSP(s)P(r|s) of an accessible response r to near zero. Any derived quantity (such as DM) effectively ignores such responses.

2.7 dR

We require an a priori choice of metric (or pseudometric) dR on R. There are reasons this is a natural structure to employ.

P(S) only tells us how to weigh each situation when computing distances, but offers no conduit to comparison of MSs. Any meaningful distance between MSs must derive from some comparator on R. The triangle inequality is very difficult to prove from scratch, but sometimes can be inherited. We are more likely to derive a metric on CS,R from a metric on R than from some other structure.

Another reason concerns the type of information present. Unlike MSs, responses often have independent, objective meaning. Distances between them are something people are more likely to agree on.

Section 6 offers a means of parlaying intuition for distances on R into an actual metric. As a rule, dR is less mutable than P(S). We may change P(S) to reflect new priorities or updated frequency statistics, but dR rarely would be modified once chosen (except perhaps to test robustness to such changes).

Note that dR is the core of DM, and its source of semantics. It must be chosen carefully and reflect our intuition. Although responses can be codified as finite strings, “edit distances” such as the Hamming distance (Hamming, 1950) or Levenshtein distance (Levenshtein, 1966) are devoid of semantics and not suitable for our purpose.

2.8 Specification of moral systems

MSs in X must be specified in some manner, either up front or as they arise. Often, they all naturally sit between S and R, but this need not be the case. Though S, R and X each carry semantics, their generative mechanisms may not be the same. Normalization may be required.

2.8.1 Normalization

If an MS has natural input space I and output space O, we include in its definition two maps: α : SI tells us how to encode S for the MS, and β : OR tells us how to decode O. They need not be injective or surjective. If no normalization is needed, I = S, O = R, α = idS and β = idR.

Requiring α and β for each MS is not pointless or pedantic. It would be impossible to compare MSs without a common meaning for inputs and outputs. α and β plug an MS into the semantics of our framework and attach this common meaning to I and O.

P(S) induces an effective PD P(i)xα1(i)P(x) on each I, and we can compare outputs of different MSs via dRβ1(o1),β2(o2). However, dR is not a metric in this capacity, because o1 and o2 are elements of distinct sets (we have pulled back dR:R×RR along β1 × β2 : O1 × O2R × R to d*:O1×O2R, which is a pseudometric only if O1 = O2 and β1 = β2, and a metric if β1 also is injective). Any CPD P(O|I) for an unnormalized MS induces an effective CPD Pˆ(R|S) via Pˆ(r|s)oβ1(r)P(o|α(s)).

When working with provided data, unnormalized samples may be unavoidable. The use of O rather than R is not an issue, and we just apply β(o). However, I may pose a problem. If α is not injective we may be unable to determine which s ∈ α−1(i) to adopt, and if α is not surjective, there may be no corresponding s at all. In the latter case, we could discard the sample, and in the former, we could randomly draw from a uniform distribution over α−1(i). However, this constitutes an additional assumption.

2.8.2 Modes of access

An MS is associated with something real: a person, an institution, a decision system. We require some form of access to it, a way to acquire knowledge of its behavior. We will consider three such modes: (1) full knowledge of the true CPD, (2) a fixed set of sample data and (3) the ability to actively acquire sample data.

Rarely do we have direct access to the true CPD for an MS. It is large and can be difficult to store or work with. Instead, we generally work with samples. We will not distinguish between access modes (2) and (3). Though (3) allows efficient sampling strategies, we remain limited to a relatively small data set.

We compute a distance between two MSs by first inferring the relevant CPDs, then plugging these into DM. Direct inference of the distance would be preferable from a statistical standpoint, avoiding the undesirable inference of large intermediates (as advised against in Vapnik, 1999). However, devising an algorithm for direct inference of distances is impractical in most cases. The use of estimation is a good compromise, reducing the size of intermediate objects while remaining conceptually simple.

2.9 Estimation

CS,R is very large, and attempting to infer the true CPD is inadvisable. Inference with limited sample data would lead to noisy results and huge hidden correlations. We typically model CPDs using a parametrized subspace BS,RCS,R, or perhaps a discrete set of representatives. Standard dimensional reduction techniques (such as regression) can be employed to estimate a point in BS,R from sample data. Estimation of a few model parameters is more tenable than inference of an entire CPD.

Practical considerations must govern the choices of BS,R and estimation procedure. We take both as part of a problem's a priori structure. Any sensible procedure will be agnostic to the order in which sample data are processed.

There are reasons other than sound inference to employ estimation. The elements of BS,R could represent idealized or canonical MSs, or we could use BS,R to isolate relevant behavioral factors.

Note that there are two types of approximation at play. The estimation procedure confines consideration to BS,R, but we also estimate with limited data. We obtain only an inferred estimate in BS,R, approximating the asymptotic estimate.

Statistical learning theory has much to say about the bounds of plausible inference (see Mitchell, 1997; Vapnik, 1999; Mohri, 2018), and we will not digress into such matters here.

2.10 Aggregation

It sometimes is useful to combine individual MSs into larger ones, either because the aggregates are of direct interest or to improve our statistics. There are two ways to accomplish this.

We could treat a set of MSs as a single MS and collate the underlying samples into a single sequence. For example, surveys from everyone in a town could contribute to a single town-wide aggregate. This is the cleanest approach, but rather inflexible. Even a simple weighting of the underlying MSs is difficult to efficiently implement.

Another approach is to aggregate the CPDs representing underlying MSs, and we can do so in many ways. This type of aggregation is more expensive, because inference/estimation must be performed on each underlying MS. However, it has advantages as well. Once those underlying calculations have been performed, there is little cost to adopting or modifying an aggregation scheme. For a model, we may directly aggregate parameters rather than CPDs.

2.11 Units and scaling

It may be tempting to think of distances as taking units, much as Euclidean distances do. However, this need not be the case for a general metric.

For units to make sense, a distance of fixed numeric value must have the same meaning everywhere. A mile in Michigan is the same as a mile in Florida. This amounts to translation invariance, which derives from a linear structure. CS,R is not a vector space, and R need not be. Translation invariance on such spaces must be inherited, and this is accomplished through isometric embedding. If the metric on CS,R or R has a Euclidean embedding, we may define units on that space. These units only make sense in the specific embedding coordinates (or those related by Euclidean isometries), which may not be intuitive or natural for us.

We also may wish to consider the relationship between DM and dR. If dR is a metric, so is cdR for any c > 0. Ratios of distances will be unchanged (though if dR is not translation invariant, a given numeric ratio value will not have the same meaning everywhere).

DM depends on dR via some derivation procedure, and we can ask whether it scales with dR. To do so, DM must be homogeneous to some fixed degree in dR. This need not be the case, but often is in practice. The DM candidates presented in Section 5 all are homogeneous in dR to degree 1. In that two-step procedure, the metric D among PDs over R is homogeneous in dR, and the metric DM is homogeneous in D. Many common operations such as integration preserve homogeneity.

When dR and DM both take units and DM is homogeneous in dR of degree n, units of [L] for dR induce units of [L]n for DM. If both take units but DM is not homogeneous in dR, their units are unrelated. We must be cautious interpreting results in that case. The use of unrelated units can be quite counterintuitive, and a change of scale for dR could affect comparisons, ratios or induced linear orders on DM.

Our framework is of greatest utility when dR and DM both admit units and DM is homogeneous in dR, and we will assume this going forward. This constrains the admissible methods of deriving DM from P(S) and dR.

In Section 6, we will discuss the use of Euclidean embeddings for visualization and metric construction. The present requirement that dR and DM take units is different. All we need are isometric embeddings in some metric vector spaces. We do not require Euclidean embeddings or low-dimensional ones, though these may yield more intuitive coordinates. When an exact isometric embedding of DM is not possible, an approximate one may suffice. In that case, the approximate DM is translation invariant and should be used for calculation. We speak here only of embedding for units, not visualization. The latter is just a nicety and does not affect calculation with DM.

Note that we only require an embedding of DM on X, not of DM on all of CS,R or BS,R. Nonetheless, an embedding of BS,R is preferable when possible. A single element added to X could drastically alter its embedding, but would not affect that of BS,R.

2.12 Proximity, neighborhoods and clusters

DM endows X with meaningful notions of proximity and neighborhood. An ϵ-ball (or ϵ-neighborhood) of x ∈ X is {y ∈ X|DM(x, y) < ϵ}, and these form the basis for a nontrivial topology on X.

The notion of neighborhood brings a wide array of mathematical tools. We have a geometry of MSs, and it sometimes may be visualized using an approximate low-dimensional Euclidean embedding (Section 6).

We also may identify clusters, sets of MSs whose intra-cluster distances are small relative to inter-cluster ones. For example, we could use a cutoff ratio r ∈ (0, 1) and say that {x1xn} form a cluster iff DM(xi, xj)/DM(xi, y) < r for all xi, xj in the putative cluster and all y outside it. r is dimensionless, and clusters embody a notion of nearness independent of units. Not all metric spaces exhibit useful (or any) clustering.

The presence of a cluster of MSs does not imply its members form a group in any non-statistical sense. They may be unaware of one another, unaffiliated or comprise many different social groups. In fact, statistical clusters may prove entirely incongruous with preconceived notions of political or ideological alignment.

Because X is finite, distances on it are bounded, and we can derive a number of natural length scales. Any of these may be chosen as the unit, or explicitly serve as the divisor in a dimensionless ratio. Examples are the mean and maximum distances between distinct points: Davg1|X|(|X|1)iSjSDij and Dmax ≡  maxi,jS Dij. Note that CS,R and BS,R need not be compact, and we generally cannot do something analogous on them.

2.13 Participants and indexing

Most systems we care about have a notion of “participants”: individuals, judges, institutions, etc. We will denote the set of these Y, and it may grow if X does.

In the simplest case, each y ∈ Y is associated with a single MS via a labeling map YX. However, it sometimes makes sense to assign multiple labeled MSs to each participant. We will denote the labeling set J and the labeling map g: Y × JX. We always assume g is bijective.

For example, suppose we have two surveys per person, the first asking what they believe and the second asking what they think other people believe. Then, J = {self, other}, and g would identify the “self” and “other” surveys for every person. This information must be available to us, perhaps as part of the survey label. The map g (and any procedure for adapting it if J or X expand) is part of the specification of a problem.

The notions of participants and indexing will find use when we define hypocrisies and related concepts in Section 4.

3. Social choice

Deferring the question of how to derive DM, let us consider how our framework could apply to questions of social choice. In this and Section 4, we will not concern ourselves with which CPDs are used to represent MSs. The discussion applies equally well to true CPDs, inferred CPDs, asymptotic estimates or inferred estimates.

There are two primary modes of application to social choice: (1) DM serves as a diagnostic tool for existing social choice mechanisms, ascertaining whether broad acceptance is attainable, the degree to which compromise is possible, which groups likely would be alienated and the anticipated extent of disaffection; and (2) DM spawns new social choice mechanisms, for use in lieu of or conjunction with existing ones. We offer a few examples below, and there are myriad others.

Our examples are illustrative but simplistic, and any real application must address nontrivial questions of measurement and feasibility. Let us suppose there is a representative set of major political issues, and that S, R, P(S) and dR have been chosen to sensibly model individuals' views on these issues, perhaps through surveys or interviews.

An MS embodies this behavioral information in some fashion, as does the CPD that represents it. We will assume the true and perceived social choice mechanism are the same and fully visible to all participants. Without yet designating what constitutes a social choice in this context, we will use the terms “approval” to refer to an individual's degree of happiness with a particular outcome and “acceptance” to refer to an individual's perception of the fairness of the mechanism by which it was reached.

3.1 For social choice diagnostics

A geometry of MSs can offer a variety of insights. The clustering or diffuseness of points can signal whether broad approval is possible through any social choice mechanism.

If MSs are arranged in two distant clusters, any outcome either moderately displeases everyone or strongly displeases one cluster, while a diffuse cloud of MSs admits a greater range of compromises. Though the total disapproval may be similar in both cases, the degree of acceptance may differ. Almost any social choice mechanism risks disaffection in the presence of strong clusters, while almost any sensible mechanism is likely to find acceptance in the diffuse case.

This example may seem trite, but without a metric, we could at best speak in terms of a single issue. A well-constructed DM allows us to incorporate all issues into a unified geometry.

We implicitly treated social choice outcomes as MSs (or perhaps points in BS,R). This sometimes makes sense, but often does not. Let us consider an example of each.

Suppose we have a set V of candidates for office. These have MSs, and we will just treat them like a subset of voters VX. We have a small set of points in BS,R representing candidates, and a much larger set representing voters. DM encapsulates all relevant issues, not just one. If certain issues are expected to be paramount in the election, P(S) could be adjusted so that DM reflects that emphasis.

A quantity like Dc ≡  maxi,jV Dij measures the dispersion of candidates, and we could calculate their diversity relative to voters via Dc/Davg. Candidates tightly clustered relative to voters do not offer much choice, and the election would feel pointless regardless of the voting mechanism. The absence of such clustering does not guarantee acceptance. Candidates still must be suitably distributed relative to the voting population.

Let f(r) denote the fraction of voters within radius r of any candidate, with inverse r(f) denoting the minimum radius r at which a fraction f of voters would be within range of some candidate. These could serve as measures of available choice, or to furnish threshold criteria. For example, we could demand that 80% of voters have a candidate within 0.2Davg of them (r(0.8) < 0.2Davg or f(0.2Davg) > 0.8), and no more than 5% of voters must pick a candidate 0.5Davg away (r(0.95) < 0.5Davg or f(0.5Davg) > 0.95). Note that such constraints only address the variety of candidates. Other facets, such as the social choice mechanism itself or aforementioned voter clustering, may play a major role in acceptance.

Let us now consider social choice involving a single issue, perhaps via referendum or legislation. Let O be the set of possible outcomes, and suppose that any element of BS,R favors a single outcome as reflected in some known f : BS,RO. This could be a complicated function, or as simple as argmaxoO P(o|s) if OR and some s ∈ S directly probes that issue.

Any well-behaved f partitions BS,R into subspaces, each representing adherents to a particular outcome. Let lo,iminPf1(o)DM(P,Pi) denote the geometric distance from a voter's MS (embodied in CPD Pi) to the surface representing outcome o. A quantity like the mean distance of voters from a given surface (lo ≡ (iXlo,i)/|X|) could furnish a quality measure for any given outcome o. This in turn could be used to rate the actual performance of various social choice mechanisms.

3.2 As a mechanism for social choice

We also may use DM to build novel social choice mechanisms. Each diagnostic example above has a corresponding metric-based choice mechanism. In fact, we could use the departure of existing social choice mechanisms from these as a diagnostic tool in itself.

As before, choices involving candidates or schools of thought have outcomes represented by points in BS,R. We only can compute quantities such as centroids in the presence of a Euclidean embedding of BS,R (rather than just X), and we will not assume one.

One approach would be to define a utility function u(i ∈ V) ≡ jXf(Dij) representing displeasure, where f is some function that maps distance to displeasure (e.g. f(d) = d2), and select the outcome that minimizes it via argminiV u(i).

We may want to impose constraints and could implement these in several ways. Hard constraints, such as Dij < t (i ∈ V, j ∈ X) for some distance t and fraction r of voters, can prevent undesirable scenarios, but risk excluding all outcomes. An alternative is to adjust u(i) via a penalty term. These are just two examples, and most of the playbook of general optimization theory can be brought to bear.

Returning to our single-issue example, the corresponding social choice mechanism would pick argminoO lo, the surface that minimizes the mean distance to voters.

In principle, it may be possible to entirely replace voting with a metric-based social choice mechanism. The MSs encompass full sets of views on major issues. Once we know them (with possible adiabatic adjustment as needed), each election differs only in the choice of candidates and the relative prominence of issues. We once again could account for the latter via changes to P(S), if accomplished in a manner that defies objection.

Given a snapshot of the voters' and candidates' MSs, the selection process then becomes automatic. We have ignored obvious practical concerns (such as how to obtain the MSs and potential gaming of the system), and this approach would prove utterly impractical in real elections. However, it may have other uses. Comparison of derived outcomes with actual election results serves as an additional diagnostic tool and can help identify whether an existing social choice mechanism is fair or representative.

4. Related concepts

Many applications have some additional structure that allows us to define quantities reflecting notions of hypocrisy, judgment of others, worldview and moral trajectory. These can serve as aids to social choice or furnish additional mechanisms. Throughout this section, we will assume the concept of participants, as discussed in Section 2.13.

4.1 Hypocrisies

The purpose of our framework is not to judge MSs as better or worse than one another, but to measure distances between them. In this sense, it is agnostic to the MSs involved. Even within the confines of this moral relativism, an individual still may be judged against himself. Given a nontrivial indexing set J and map g : Y × JX, we can define a set of |J|(|J|1)2 distances between the MSs of any given participant y ∈ Y. We will term these the “hypocrisies” of y, denoted hij(y) ≡ DM(g(y, i), g(y, j)) for i, j ∈ J.

For example, suppose each person has three associated MSs (J = {p, a, b}): (p) that they claim, (a) that they exhibit and (b) that they believe in or aspire to. We will ignore how one practically would ascertain (p) or (b).

Loosely speaking, hpa corresponds to a notion of true hypocrisy (“Do as I say, not as I do”), hpb could be termed superficial hypocrisy (“Do as I say, but what I say differs for you and me”) and hab relates to courage of one's convictions (“I do as I do, not as I should”). This is a vast oversimplification, but vast oversimplifications often prove useful.

The presence of hypocrisies allows us to define various ratios and linear orderings. We can (1) compare two hypocrisies for a given participant hij(y)?hkl(y), (2) compare the same hypocrisy for two different participants hij(y)?hij(y), (3) induce a weak linear order on Y for each hypocrisy using (2) thus ranking participants despite the absence of a linear order on X, (4) compute the dimensionless ratio hij(y)/hkl(y) (suitably controlled for zeros), (5) compute the dimensionless ratio hij(y)/hij(y′) (suitably controlled for zeros), (6) construct a pseudometric DJ,y on J for each y ∈ Y by pulling DM back along g(y, ⋅) : JX (unsurprisingly, DJ,y(i, j) = hij(y)).

Note that it does not matter where we get Y, J and g. If we have those components, we may define a set of hypocrisies. The essential element of hypocrisies is that they are defined for each participant without regard to any other.

4.2 Judgment

Hypocrisies constitute an inward-facing view of a person. We cannot judge those MSs in isolation, but can judge their constellation for a given participant. Let us now consider an outward-facing view. We will initially assume J is trivial (so g : YX).

There is no preferred MS or participant in our framework, but we can ask how the world appears to any given MS or participant. MSs and participants are equivalent here, but will not be when we consider nontrivial J, so we will consider them both.

MS x sees x′ at distance DM(x, x′), and we have function Kx:XR given by Kx(x′) ≡ DM(x, x′). This is what the world looks like to x, and defines pseudometric D̃M,x(x,x)Kx(x)Kx(x) on X, the pull-back of the Euclidean metric along Kx. There is nothing special about the Euclidean metric here, but with other metrics on R, the resulting D̃M,x would not be as intuitive.

To see how D̃M,x differs from DM, consider level sets. The level sets of DM relative to x are (indexed by l ≥ 0) {x′ ∈ X|DM(x, x′) = l}. They partition X, and D̃M,x is a metric among them. D̃M,x does not care about the directions of x′ and x′′ relative to x, just their distances from it under DM.

Analogous definitions hold with respect to Y. We define K̃y(y)Kg(y)g(y). For given y ∈ Y, we have an induced metric on Y given by D̃Y,y(y,y)K̃y(y)K̃y(y). K̃y is how participant y sees the world.

Taking a cue from this, we define a “judgment” to be any non-negative map K̃:YR. A judgment induces a linear order on Y via K̃(y)?K̃(y) and a distance on Y via K̃(y)K̃(y). A choice of K̃ is a preferred standard of judgment and constitutes extra information. One of our K̃y's could serve, but that requires choosing a specific y. Note that we equally well could define a judgment as K:XR.

Suppose we have a preferred judgment K̃y and denote by FY the space of non-negative functions YR. Every K̃yFY, as is K̃. For any function A:FYR, we can define a metric on FY as DFY(K̃,K̃)A(K̃)A(K̃). We also could compare K̃y and K̃y for two participants this way. As an example, we could define A(K̃)|Y|1yYK̃(y). In this case, DFY(K̃,K̃y) would represent how closely aligned y's perception of the world is with that of K̃, through the lens of the arithmetic mean. Note that we could do the same with functions A:FYRn for any n > 0.

Let us now generalize to nontrivial J. Judgments defined in terms of X are unchanged, and K:XR remains the same under the new definition. In terms of participants, things are a little different. The equivalent map now is K̃:Y×JR, and this defines a “judgment.” K̃y is replaced with K̃y,i, and to prefer one requires a choice of both the participant and index. It also is possible to deal with nontrivial J by confining ourselves to a preferred choice of j ∈ J, but this is tantamount to a trivial J with restricted Xj ≡{g(y, j)|y ∈ Y}.

In the presence of nontrivial J, any choice of judgment K̃ yields an alternate set of hypocrisies h̃ij(y)K̃(y,i)K̃(y,j), those seen through the lens of K̃ rather than DM. If K̃ is chosen to be K̃y,i for some y and i, then h̃ij(y)=hij(y) and h̃jk(y)=hij(y)hik(y).

4.3 Worldview

The K̃y are views of the world implied by DM and may not reflect participants' actual views. We do not know those actual views, and they would have to be supplied. A judgment K̃ for each participant will be termed a “worldview.” It is a map η : YFY, assigning a judgment to each participant. Any map A:FYRn induces a map Aη:YRn. We will denote by ηˆ the worldview induced by DM. Clearly, ηˆ(y)=K̃y. For simplicity, we will once again assume J is trivial.

A choice of η allows us to speak not only of a participant's MS, but how they view other MSs. Though we are unlikely to be supplied with an explicit worldview, a distinct metric DM would generate a different worldview ηˆ, and for purposes of comparison, it may be useful to treat ηˆ as externally imposed relative to DM.

ηˆ is symmetric and represents the way two participants see one another through the lens of DM. But worldviews need not be symmetric in general, and participants' views of one another may not be reciprocal. In that case, it may make sense to define a “difference in mutual esteem” as something like η(y,y)η(y,y).

4.4 Moral trajectory

It sometimes is useful to relax our inference assumptions and allow slow variation of an MS. In place of ergodicity, we require that (a) our sample data be divisible into cohorts and (b) within each cohort, we have adequate data to infer the relevant CPD to our satisfaction. The cohorts need not be disjoint. We will define a “moral trajectory” to be a sequence of MSs obtained from cohorts of data for a single participant.

Time has not played any role in our framework so far, but this does not matter. We only require a means of segmenting our data into cohorts. External time intervals can serve, but so could other criteria. As long as our relaxed inference assumption applies to the cohorts, we are fine.

As an example, consider judges issuing criminal sentences. Rather than simply comparing MSs of different judges, we may wish to study the evolution of a given judge's MS over time. We could break our case history into year-long intervals and treat each as an independent MS. This opens the door to a variety of time-series tools, and we could study correlations between moral trajectories of different judges, etc.

Suppose we have a timeframe [0, nΔ] broken into intervals of length Δ and have sufficient sample data in each interval to adequately infer a CPD. Denoting the sequence of inferred CPDs (v1vn), there is a sequence of distances (DM(v1, v2), , DM(vn−1, vn)). From this, we could compute various moments, autocorrelations, etc. Given two such sequences, we also could compute correlations, etc.

4.5 Applications to social choice

Let us briefly consider a few of the many ways these concepts could be applied to social choice theory.

Any single hypocrisy induces a linear order among participants, and this could power any order-based social choice mechanism (e.g. ranking candidates).

We also could use hypocrisy statistics from a limited subset of participants to adjust our global geometry. For example, consider two MSs per person: xc(y) is that claimed, and xa(y) is that observed. Let us assume access to xc(y) for everyone, perhaps through surveys or interviews, but access to xa(y) only for a small subset of public figures (those with voting records, judicial histories, etc.). We could construct a PD P(h) over hypocrisy from the known subset and apply it to the unknown subset. Note that we cannot infer or sample the unknown xa(y) itself, only its CPD. Denote by Pc(y) and Pa(y) the CPDs representing xc(y) and xa(y). In lieu of Pa(y), we have a PD over points in BS,R. To sample it, we first draw h according to P(h), then sample uniformly within the level set {DM(Pa(y), Pc(y)) = h}.

We could perform Monte Carlo analysis by sampling each Pa(y) in this fashion, either in service of our own social choice mechanism or to test the robustness of another mechanism to such fuzziness. If demographic or other labeling data l(y) are present, we could infer P(h|l) rather than P(h) from the known hypocrisies. Sampling each Pa(y) from P(h|l(y)) could mitigate some of the (substantial) selection bias in our example, but at the cost of noisier inference.

Known hypocrisies of politicians also could be used to penalize candidates when a utility function is employed for social choice. More generally, hypocrisies offer a useful measure of fuzziness and may warn us of potentially unreliable results. Quantities such as judgment, worldview and mutual esteem can be employed to measure polarization and the likelihood of disaffection or to test the robustness of results to geometric fuzziness as in the hypocrisy scenario above.

They also can be used to emulate individual decision-making. Any given DM generates a worldview, a judgment for each participant. Though this incorporates information about the individual's MS, it may not reflect their true judgment. This is not a matter of inference or hypocrisy. A worldview contains far more information than a metric, and DM necessarily distills out certain aspects. Survey-based knowledge of MSs is plausible, but knowledge of true judgments is not. DM may be all we have to work with, and a noisy DM at that. However, this is not a dealbreaker. A metric-based emulation mechanism need not precisely reflect each individual's voting choice. It only must do so statistically and arrive at the correct overall election result. For example, we could emulate individual voting by ranking candidates according to each individual's K̃y. In principle, we could replace voting altogether with a mechanism for MS acquisition and maintenance. This likely would be an awful idea in practice, but suggests that automated mechanisms along these lines may be worth exploring.

Moral trajectories could be used to detect convergent or divergent judicial behaviors, the impact of structural changes to mechanisms informing or effecting social choice, or sudden changes in behavior. Large changes in the geometry of voters or candidates or (most alarmingly) the two relative to one another could signal tectonic social shifts, which merit careful examination and possibly even reconsideration of the mechanism of social choice.

5. Metrics among conditional probability distribution

We now offer several methods of deriving a metric (or pseudometric) on CS,R from P(S) and dR. In this section, we will be careful to distinguish metrics from pseudometrics. Our approach is to break the problem into two parts: (1) from dR derive a metric or pseudometric D on PR, the space of PDs over R, and (2) from P(S) and D, derive a pseudometric DM on CS,R.

5.1 Pseudometric vs metric

Note that almost any plausible methods, including our own, result in pseudometrics rather than metrics on CS,R. We almost always must sum, integrate or average over P(S) in some fashion, and this introduces degeneracy with near certainty. It turns out this is not a problem for two reasons.

A pseudometric suffices for most applications of our framework. Coincident points do not pose a problem for the techniques mentioned, and rarely do at all. It also turns out that we almost always end up with a metric on X, the set we actually care about. This is because X is small. DM on X is just the restriction of DM on CS,R to the particular set of CPDs representing X. Unless DM is enormously degenerate on CS,R, or some aspect of a given problem conspires to retain degeneracy, the probability that degeneracies will survive restriction to X is tiny. The same holds if dR is a pseudometric, and this is one reason why a pseudometric is sufficient in that role.

5.2 Some metrics on CS,R

The central obstruction to deriving metrics or pseudometrics ab initio is the triangle inequality. This is part of what motivated dR as an a priori structure. Though we will not include proofs here, we observe that they depend heavily on two principles: (1) the pull-back of a metric is a pseudometric or metric and (2) the weighted average of a family of metrics is a pseudometric or metric.

In addition to being metrics or pseudometrics, our candidates pass certain sanity tests. For strongly peaked distributions, we require that D resembles dR, and DM resembles D. In what follows, w can be any strictly positive weight on R.

The following two candidates for D are pseudometrics. Here, P, Q are PDs over R (i.e. elements of PR), and the ± are in tandem.

D(P,Q)RdxRdyw(x)w(y)dR(x,y)(P(x)±P(y))2(Q(x)±Q(y))2

Given P(S) and a choice of D, there are two straightforward choices for DM. Here, f, g are the function forms of CPDs P(R|S). For example, f : SPR yields a PD over R for each s ∈ S.

DM(1)(f,g)sSP(s)Df(s),g(s)
DM(2)(f,g)DsSP(s)f(s),sSP(s)g(s)

When D is a metric, DM(1) is a metric, and DM(2) is a pseudometric. Note that the dependence on dR is implicit in D.

6. Euclidean embeddings of dR and DM

An isometric embedding of metric space (Z, d) in metric space (Z′, d′) is an injection i : ZZ′ that is metric-preserving (d(z1,z2)=di(z1),i(z2)), and a Euclidean embedding is an isometric embedding in Rn (endowed with the Euclidean metric). We will refer to a Euclidean space or embedding as “low-dimensional” if n = 1, 2, or 3. Our ability to visualize is limited to low-dimensional Euclidean spaces, and it is easiest to work with these.

Our framework features two metrics: dR and DM. A Euclidean embedding can assist in the a priori choice and specification of dR and in the visualization of a derived DM. The assumption that both metrics take units implies they have isometric embeddings in metric vector spaces, but these need not be low-dimensional or Euclidean.

6.1 Euclidean embeddings

Young and Householder identified the criterion for a Euclidean embedding to exist (Young and Householder, 1938). Let Z = {z1zn} and dij be the distance matrix for (z1zn−1) relative to zn. A Euclidean embedding of d exists iff the (N − 1) × (N − 1) matrix Bij12din2+djn2dij2 has only nonnegative eigenvalues, in which case rank B is the minimal embedding dimension. More efficient methods exist for actual calculation (see Crippen, 1978).

Exact Euclidean embeddings are rare, and low-dimensional ones are rarer. In most cases, an approximate embedding must suffice. Metric multidimensional scaling (MDS) is a method that replaces Young and Householder's B matrix with a lower-rank surrogate in a manner closely resembling principal component analysis (PCA). Details can be found in Eckart and Young (1936), and an alternate approach is offered in Matousek (2002, 2013). We can measure the quality of an approximate embedding in a variety of ways, such as the fraction of absolute eigenmass captured.

6.2 Visualization of DM

DM is derived rather than chosen, and we cannot expect it to have an exact low-dimensional Euclidean embedding. An approximate low-dimensional Euclidean embedding is possible, but may be of low quality. If the top three eigenvalues do not comprise most of the eigenmass, then too much information may have been lost. Since our purpose is visualization, this determination is subjective. We can produce a picture, but it may not be representative or useful.

6.3 Specification of dR

The selection of dR often is the most difficult aspect of our setup. It is less mutable and more critical than the choice of P(S), and there is no obvious way to go about it, except in the simplest cases.

dR is not just a pretty face. It is the core structure from which DM derives, and the utility of the framework relies on dR embodying a sensible intuition for distances on R.

Rarely does a natural dR present itself, and R may be a complicated space. We often must leverage piecemeal intuition for distances into a precise metric, and a Euclidean embedding can help.

Were there a single correct dR, this would not be the case. A general dR is unlikely to have a reasonable-dimensional exact embedding or a sufficiently high-quality approximate one, and the crucial role of dR will not brook lower quality.

However, our imperfect intuition bestows a degree of flexibility. We are doing something akin to heuristic embedding, much as an artist may render vague visual concepts into a cogent scene. Rather than a single correct dR, there usually is a set of plausible candidates that fit our intuition. Practical or other considerations may further restrict this set, but it generally remains well populated. We will denote it Sd. Absent other criteria, any element of Sd may be selected as dR. Robustness to that choice is a good test of the framework.

The larger Sd, the more likely there exists a reasonable-dimensional exact (or high-quality approximate) Euclidean embedding of at least one metric d ∈ Sd. This may not be low-dimensional, but sometimes can be constructed from low-dimensional component embeddings in a fashion we now describe.

To avoid excess verbiage, we will define a “Euclidean proxy,” to be either an exact Euclidean embedding or a sufficiently high-quality approximate Euclidean embedding (one that does not lose relevant information). We do not assume Euclidean proxies are low-dimensional, but do require them to be of manageable dimension (i.e. not intractably large).

6.3.1 Subdivisible Euclidean embedding of dR: example

It sometimes is possible to construct a higher-dimensional Euclidean proxy from easily visualized pieces. Certain systems, including many that arise in practice, have an R that naturally decomposes into semantically distinct components. We can try to construct a low-dimensional Euclidean proxy for each component, and then glue these together.

Consider a judicial sentencing framework where MSs arise from judges, S is a set of crimes and R is a set of punishments. Judges are presented with crimes, and they issue fines and/or jail terms. A point in R has natural coordinates (f, p), where f is in dollars and p is in years. Note that dollars and years are not units. R is not a vector space, and we have not posited translation invariance. f and p happen to be numeric labels, but have no more structure than lexical labels would.

We may not have direct intuition for the distance between ($5000, 4y) and ($30000, 1y), but we do have a sense of distances between two fines or two jail terms. Among other things, fines and jail terms each have a meaningful linear ordering.

Let us assume that fines are translation invariant in coordinates of dollars, corresponding to an exact embedding in R via h(f) = f (i.e. assigning the numeric label its numeric value), with corresponding metric d1(f1, f2) = |f1 − f2|. It now makes sense to refer to dollars as the “unit” for fines.

For jail terms, let us suppose this is not the case. Perhaps a one-year difference in jail term does not have the same marginal impact on a one-year sentence as on a ten-year sentence. Instead, we will assume a doubling of sentence has uniform significance (an unlikely perspective but suitable for illustration). This corresponds to an embedding in R via h2(p) = (c1 + c2  ln x) for constants c1, c2 (in practice, we would probably employ something like   ln(p + 1) to avoid singularities near the origin). Choosing c1 = 0 and c2=1ln2, we get d2(p1,p2)=|log2p1p2|.

Translation invariance only holds in the embedding coordinates, and |p1 − p2| has no universal meaning. Only |log2p1p2| has the same meaning everywhere, allowing the use of units. We will define unit 1T (for “term-doubling”) to be Δ log2 p = 1. Taking f = 0 and log2 p = 0 as the coordinate origins, ($5000, 4y) becomes ($5000, 2T).

To obtain a Euclidean proxy for dR, we must relate the scales of the two coordinates. If we deem $20000 equivalent to one term-doubling, we can write our point as ($5000, $40000) in unified units of dollars.

We now have an embedding in R2. In terms of our original coordinates, it is dR([f,p],[f,p])=(ff)2+200002(log2(p/p))2.

6.3.2 Subdivisible Euclidean embedding of dR: general case

Suppose in a problem, (1) every response can be decomposed into n distinct conceptual components: R1nRi, (2) we have clear intuition for distances within each Ri and (3) we have some sense of how much each Ri should contribute to dR. Note that R need only be separable semantically, not statistically or structurally.

For each Ri, we attempt to build a low-dimensional Euclidean proxy hi:RiRni (with ni ≤ 3). If the Ri are small, simple spaces, such proxies are quite plausible. The corresponding metrics then are di(x, x′) ≡|hi(x) − hi(x′)|.

To combine the di into dR, we require a set of distance conversion factors. Let cij > 0 denote the distance in Rj corresponding to unit distance in Ri. These must satisfy cik = cijcjk, cii = 1 and cij = 1/cji, and they comprise n − 1 independent values. Though their effect is simply to scale the embeddings hici1hi for i = 2…n (with di adjusted accordingly), they are not superfluous. The Euclidean proxies for the Ri are built in isolation, and their scales are arbitrary. We must adjust them to reflect our intuition for relative contributions, and the cij provide the necessary lever.

The resulting metric is dR([r1rn],[r1rn])i=1nci12(hi(ri)hi(ri))2 where ri ∈ Ri and riRi are expressed in unembedded coordinates.

This approach still is very restrictive, and we only can represent a small fraction of metrics this way. Aside from the need for a semantic decomposition of R, and low-dimensional Euclidean proxies for all the Ri, the conversion factors also impose a big constraint. They require that the relative meaning of distances in Ri and Rj be the same everywhere. Otherwise, we could not glue them with a simple, global scale factor. If such a constraint is unacceptable, this method cannot be used.

Fortunately, conceptual decomposition is organic to many problems. In applications where we have the flexibility to choose R, this may motivate our choice. Also, we always can try to expand an existing R into a suitable space.

Although we could try something similar with non-Euclidean embeddings, we made implicit use of a special property of Euclidean spaces: Euclidean metrics can be combined using a Euclidean metric. All p-norm metrics have this property, but most other families of metrics do not.

7. Conclusion

The framework described has broad applicability. Although our discussion centered on MSs and social choice applications, any decision system that can be framed in suitable terms may be analyzed using our methods. Examples could include customer satisfaction, political intelligence, judicial analysis and business planning.

There are many possible directions of future research. Our exploration of derived metrics (distilled to the selection presented in Section 5) was by no means exhaustive. Each metric captures certain facets of behavior, and additional candidates would mean greater flexibility.

Questions of stability relative to changes in underlying assumptions and components are important and deserve attention in any real application. Lack of robustness of DM to small changes in S, R, P(S) and dR can impair its utility. It also should be stable in the face of minor changes to BS,R, the estimation method or aggregation procedure. Conceptually small changes in the framing of a problem should not drastically alter results.

We have said little about practical issues of data acquisition, cleaning or curation. These are of critical importance in any application, as is the relevance of those data. In addition to standard empirical issues, there may be specific ones surrounding our particular combinations of inference, estimation and Euclidean embedding. Our earlier comments notwithstanding, direct inference of distances also may be worth exploring.

The potential applications to social choice are myriad. We mentioned a few, briefly and imprecisely. Each of these could prove beneficial or interesting. The idea of using static or adiabatic knowledge of MSs to automate decisions may have applications in diverse fields, replacing frequent, burdensome social choices with upfront data acquisition and some periodic maintenance. A great deal more can be said about optimization of utility functions, constraints and cluster analysis as well. All these topics may provide fruitful avenues of inquiry.

References

Arrow, K.J. (1950), “A difficulty in the concept of social welfare”, Journal of Political Economy, Vol. 58 No. 4, pp. 328-346.

Crippen, G. (1978), “Rapid calculation of coordinates from distance matrices”, Journal of Computational Physics, Vol. 26, pp. 449-452.

Eckart, C. and Young, G. (1936), “The approximation of one matrix by another of lower rank”, Psychometrika, Vol. 1 No. 3, pp. 211-218.

Hamming, R.W. (1950), “Error detecting and error correcting codes”, The Bell System Technical Journal, Vol. 29 No. 2, pp. 147-160.

Levenshtein, V.I. (1966), “Binary codes capable of correcting deletions, insertions and reversals”, Soviet Physics–Doklady, Vol. 10 No. 8, pp. 707-710.

Matousek, J. (2002), Lectures on Discrete Geometry, Springer-Verlag, New York, NY, available at: https://link.springer.com/book/10.1007/978-1-4613-0039-7#about.

Matousek, J. (2013), “Lecture notes on metric embeddings”, available at: https://kam.mff.cuni.cz/∼matousek/ba-a4.pdf.

Mitchell, T.M. (1997), Machine Learning, McGraw-Hill, New York, NY, available at: https://www.worldcat.org/title/machine-learning/oclc/36417892.

Mohri, M. (2018), Foundations of Machine Learning, 2nd ed., MIT Press, Cambridge, MA, available at: https://dl.acm.org/doi/10.5555/2371238.

Rao, C.R. (1945), “Information and accuracy attainable in the estimation of statistical parameters”, Bulletin of the Calcutta Mathematical Society, Vol. 37 No. 3, pp. 81-91.

Vapnik, V.N. (1999), The Nature of Statistical Learning Theory, 2nd ed., Springer-Verlag, New York, NY, available at: https://link.springer.com/book/10.1007/978-1-4757-3264-1#about.

Young, G. and Householder, A. (1938), “Discussion of a set of points in terms of their mutual distances”, Psychometrika, Vol. 3, pp. 19-22.

Acknowledgements

The author would like to thank Don Bamber for proposing social choice as an application.

Corresponding author

Kenneth Halpern can be contacted at: khalpern@alum.mit.edu