Quality Assessment in Crowdsourced Classiﬁcation Tasks

Purpose: Ensuring quality is one of the most signiﬁcant challenges in microtask crowdsourcing. Aggregation of the collected data from the crowd is one of the important steps to infer the correct answer but the existing study seems to be limited to the single-step task. This study looks at multiple-step classiﬁcation tasks and understands aggregation in such cases, hence is useful for assessing the classiﬁcation quality. Design/methodology/approach: We present a model to capture the information of the workﬂow, questions, and answers for both single-question and multiple-question clas- siﬁcation tasks. We propose an adapted approach on top of the classic approach so that our model can handle tasks with several multiple-choice questions in general instead of a speciﬁc domain or any speciﬁc hierarchical classiﬁcations. We evaluate our approach with three representative tasks from existing citizen science projects in which we have the gold standard created by experts. Findings: The results show our approach can provide signiﬁcant improvements to the overall classiﬁcation accuracy. Our analysis also demonstrates that all algorithms can achieve higher accuracy for the volunteer- versus paid-generated datasets for the same task. Furthermore, we observed interesting patterns in the relationship between the performance of diﬀerent algorithms and workﬂow speciﬁc factors including the number of steps, and the number of available options in each step. Originality/value: Due to the nature of crowdsourcing, aggregating the collected data is an important process to understand the quality of crowdsourcing results. Dif-ferent inference algorithms have been studied for simple microtasks consisting of single questions with two or more answers. However, as classiﬁcation tasks typically contain many questions, our proposed method is able to to be applied to a wide range of tasks including both single-question and multiple-question classiﬁcation tasks.


Introduction
Microtask crowdsourcing has attracted interest from researchers, businesses and government as a means to leverage human computation into their activities in a fast, accurate and affordable way.In the last ten years, we have seen it applied to anything from spotting sarcasm on social media to discovering new galaxies and helping digitise large cultural heritage collections.The underlying model is relatively straightforward: a problem is decomposed into smaller chunks that can be tackled independently by several people.
Their individual outputs are then compared and consolidated into a final solution (Shahaf and Horvitz, 2010).However, none of these steps is actually easy: some problems are less amenable to microtasking and need to be turned into bespoke microtask workflows (Bernstein et al., 2010;Kulkarni et al., 2011;Kittur et al., 2011); the performance of the crowd varies across tasks (Mao et al., 2013;Redi and Povoa, 2014); and determining which answers are the most useful ones can be both complex and computationally expensive (Kittur et al., 2008;Snow et al., 2008;Vickrey et al., 2008;Demartini et al., 2012;Wiggins et al., 2011).It is on this last aspect, determining the correct answers, that we focus on in this paper.The aggregation method proposed in this paper is able to infer the correct answer for a range of tasks involving either single-step or multiple-step classifications when gold answers are not available.It also serves as a proxy to help task requesters to assess the quality of the crowdsourced results when they already have some gold answers, such as piloting specific multiple-step task design before putting it online for a larger scale.
Quality assessment in microtask crowdsourcing refers to the evaluation of quality of the workers' work.First, quality can be assessed based on different criteria, as it has many dimensions (Kahn et al., 2002;Batini et al., 2009).Under the crowdsourcing context, it depends on the type of the data, which is decided by the task type (Malone et al., 2010;Gadiraju et al.,2014Gadiraju et al., , 2015)).The most common quality metric we have seen is to calculate the accuracy (Bernstein et al., 2010;Gelas et al., 2011;Hung et al., 2013;Zhang et al., 2017aZhang et al., , 2017b) ) with available gold standards.However, in lots of the cases the gold standard is not available.This is where different inference algorithms come into picture, which helps to infer or predict the correct (gold) answer.Second, quality assessment can be done either on the fly (Ipeirotis et al., 2014) during the task running that can be used to optimise task assignment hence reduce cost, or in the post aggregation (Whitehill et al., 2009;Ipeirotis et al., 2010;Bachrach et al., 2012;Difallah et al., 2015a) to assess the overall quality of the classification.This work focus on aggregating the result after the crowdsourcing task has been completed, so that accuracy can be calculated based on the gold standards we have.
There are many different types of tasks where microtask crowdsourcing are applied (Eickhoff and de Vries, 2011;Difallah et al., 2015b;Yang et al., 2016;Zheng et al., 2017a).We focus on inferring the correct answer for a classification task which is one of the most popular type of crowdsourcing tasks.We are by no means the first to do so; previous research has proposed a range of methods to infer and predict the quality of crowd answers (Bachrach et al., 2012;Dawid and Skene, 1979;Difallah et al., 2015a;Hare et al., 2013;Ipeirotis et al., 2010;Karger et al., 2011;Loni et al., 2014;Paulheim and Bizer, 2014;Hung et al., 2013;Rosenthal and Dey, 2010;Simpson et al., 2013;Whitehill et al., 2009).Whilst all methods have their benefits, they work on relatively simple task models that consist of single questions with one or more answers (Sheshadri and Lease, 2013;Hung et al., 2013;Zhang et al., 2017a;Zheng et al., 2017b).The scenario we are targeting is different.We take a close look at existing classification tasks from Zooniverse, and notice a large percentage of these tasks are multiple-step tasks, as shown in Figure 1.In fact, in a random sampling of 20 tasks, only 20 per cent has a single question.Consider the example in Figure 2, which is taken from a labelled citizen science project in which pictures taken in the Serengeti national park in Tanzania are analysed online by thousands of volunteers [1].The crowd is asked to answer a series of related, independent questions about what they see in the image, including the types and number of animals.
Our work is motivated by a range of online crowd science classification projects.Each of them uses a slightly different type of task to classify an object, for example, an image, according to a number of criteria.For a relatively complex task, it is split into several steps, typically in the form of multiple-choice answers.Sometimes there are dependencies between

Crowdsourced classification tasks
steps as the answer chosen for one questions prompts other questions to be displayed.For instance, in the Cities at Night project, which uses microtask crowdsourcing to analyse night-time photographs taken by astronauts onboard the ISS[2], seven different Options are provided for the first question to identify what the given image contains, a city, stars, aurora, astronaut, black image, no photo or none of these, and only when "city" is identified, two more independent questions will be asked to classify cloudiness (three Options: cloudy, someclouds, clear) and sharpness (two Options: sharp, blurry).In the GalaxyZoo[3] project, several different questions were asked in sequence depending on the answers to previous questions, and questions and answers are arranged in a decision tree.It has a more complex workflow in which more questions are involved, and questions vary based on what has been chosen in previous classification step.For instance, the first question is "Is the galaxy simply smooth and rounded, with no sign of a disk?" and three options are provided: "Smooth", "Features or disk", and "Star or artifact".When choosing "Smooth", a new question will be asked "How rounded is it?"and available options are "Completely round", "In between" and "Cigar shaped".If "Features or disk" is chosen as the answer to the first question, a different set of subsequent questions will be asked.Other times, workflows are rather sequences of independent, though related questions, such as what we see in Snapshot[1] (Figure 2).Determining the correct answer for such complex classification task can be tricky and has not been fully studied yet.Existing research also does not investigate how inference methods could affect the classification accuracy when using different crowd types for complex classification tasks.As a result, there is the need to understand whether different algorithms and aggregation strategies are required for different crowd contexts.
To tackle the issue of determining the correct answer from crowd produced annotations for the classification task with multiple questions, we model the problem of complex classification tasks that span over multiple, related questions as a graph.To the best of our knowledge, we are the first to propose using the structure of a microtask crowdsourcing workflow as an additional feature to support inference algorithms in making decisions about correct labels, using output data produced by the crowd.We look at three inference algorithms (majority voting [MV] [Paulheim and Bizer, 2014;Hung et al., 2013], message passing [MP] [Karger et al., 2011] and expectation maximisation [EM] [Dawid and Skene, 1979;Whitehill et al., 2009]), which have been commonly used in answer inference in microtask crowdsourcing previously.We adapt these algorithms to work on the graph modelled from crowdsourcing tasks with multiple steps.We perform a large-scale evaluation of the performance of these algorithms on six data sets across two crowd contexts from three image classification tasks: Darkskies[2], GalaxyZoo[3] and Snapshot Serrengeti [1].The rationale behind choosing data sets from both volunteer and paid crowd context is that algorithms may perform differently in these contexts.The experiments show that our aggregation strategy achieves significantly better performance than the current approach of naively applying individual algorithms on each node level.The result also indicates that MV, despite its simplicity, compares well with more sophisticated approaches that consider additional factors such as user performance and hence need more computation time.Sophisticated algorithms such as expectation maximisation, however, can complement MV for relatively complex tasks.We also prove that each algorithm obtains better inference accuracy in the volunteer context compared to paid crowdsourcing context.This rest of this paper is structured as follows: Section 2 provides the foundations of existing algorithms which we have adapted to handle answer inference in classification tasks with multiple questions, and illustrate how this aggregation fits in the quality assessment process.In Section 3, we explain our graph model and notations used in the graph, formalise the classification problem, and elaborate our aggregation approach.In Section 4, we perform large-scale evaluation and demonstrate the performance of different algorithms.Section 5 discusses our findings.Section 6 reviews existing work which has inspired our research, and Section 7 summarises our result and future work.

Foundations
A classification task generally has one single question and a few options to choose from, such as the one shown in Figure 3.It looks like a simple tree structure where the classification starts with a root node which refers to the object to be classified and has a few branches which represent the available options.In this section, we present three existing algorithms, MV, MP and EM, that have been used in inferring the true label for a single-step multiple-choice classification task.

Crowdsourced classification tasks
These are the foundations to understand our proposed adapted approach.Notations used in elaborating these algorithms are defined in Table I.For the sake of explaining the individual algorithms and our method, we use following notations throughout this paper.

Majority voting
Due to its simplicity, MV has been used in many microtask projects (Hung et al., 2015;Liu et al., 2012) and is the standard aggregation method in some existing crowdsourcing platforms [4].Given the list of options for a labelling task and an object, the MV algorithm chooses those options with the highest number of votes from the crowd.Formally, it takes as input an object o and the crowd labels L o and outputs the resulting candidate label lo that received the most votes from the users.The set of all labels from the crowd for object o L u The set of all labels from user u l u o The label for object o from user u lo The inferred label for object o IJCS 3,3

Expectation maximisation
EM is another algorithm that has been widely used and involves two steps to infer the true label for a given object.In the first step, the true label for the current object is estimated using simple MV, where the input of all users is considered equally.Estimate error rate for user 5:

Message passing
MP is an algorithm that takes into account both the labels and the performance of the users.MP constructs object and user-specific messages to represent the reliability of the particular user, and iteratively updates the object and the user messages.More specifically, at each object update, it adds up more weight to labels that come from more trustworthy parts of the crowd, and at each user update, it adds more trust (a confidence value) to the user if the labels they give for other objects are in line with the current estimates of object labels.The iterative updates continue until the algorithm converges or a specified threshold is hit.The threshold for the stopping condition is

Crowdsourced classification tasks
a parameter that has to be empirically determined.It takes as input an object o, a label a 2 A, all labels received from the crowd L and a threshold k max .MP computes the object message by firstly iterating all previous labels from the users who have been assigned the object o and then looking at whether each label is the same as the given one.In a next step, it uses the object message x oÀ>u (2 L) to update the user message y uÀ>o (2 L), which is computed by iterating over the labels they have submitted.Until convergence, the object message for object o is aggregated by weighing the user messages (confidence) for that object and the computed sign is stored in E ou .MP outputs the candidate label l for o and the sign of whether the label applies or not.A detailed description of the algorithm can be found in Karger et al.'s (2011) study.Whilst providing accurate estimations, MP is also known for its high computational costs as the number of labels and users increase.

Quality assessment
In the microtask crowdsourcing context, achieving a good quality result is one of the major goals, and when we talk about quality, it generally means the quality of the data collected from the crowd.For the classification microtasks, existing work in quality assessment mostly use the accuracy metric (Khattak and Salleb-Aouissi, 2011;Hung et al., 2013;Zhang et al., 2017a).Some research also uses precision/recall (Hung et al., 2015;Zhang et al., 2017) or F1 score (Zheng et al., 2017a), while other work use ROC (Zheng et al., 2017b) or RMSE (Bachrach et al., 2012).For classification, the quality of the result refers to how good the overall collected classifications are, which is a data-value centric dimension to reflect how accurate the classifications are.In this work, if not specially specified, when referring to quality of the input/answer/data/result, it means Accuracy -"The degree to which data values correctly represent the real-world facts" (Zaveri et al., 2013); definition in science (JCGM, 2008) as "closeness of agreement between a measured quantity value and a true quantity value of a measurand".We can look at individual crowd worker's work to evaluate whether its work is of good quality, or we can look at the overall result from all the workers to see how accurate they classify the given objects.The later one which involves aggregating the input from different crowd workers in a multiple-step classification task is the focus of this paper.

IJCS 3,3
In the crowdsourcing context, the ground truth is not usually available.To assess the quality of the result, we need to understand what algorithms or mechanisms can be used to infer or predict the correct answer based on all the input from the crowd workers.Correspondingly, each existing different algorithm has been studied by researchers and evaluated its performance in various contexts (Section 6.2).This work mainly takes a look at three popular existing algorithms elaborated above and investigates how the adaptation of these algorithms can be used for aggregating the crowdsourced data and help to assess the quality of the classification result.The whole process, in a nutshell, includes three major phases, data collection (microtask design and task execution) from the crowd which is available to this study, aggregation to infer the correct answer/label, and evaluation of the quality (in this work is the Accuracy metric) by comparing the inferred result to the gold standards we have.This research focuses on the aggregation and evaluates the accuracy accordingly.

Our approach
In this section, we first illustrate the range of classification tasks we address via a set of examples: classification tasks with a single question and multiple-questions.We then introduce a set of notations and formalise the classification problem as a path searching problem in a graph.Following that, we present our aggregation method by illustrating how existing established algorithms can be adapted to handle more complex cases.

Multi-level workflow model and problem formalisation
A classification task, as shown in Figure 3, is generally considered as a simple task as it contains only one question.A relatively complex task normally involves more than one question and hence more options.It will be more like a tree with branches which has further branches and leaves.
If we draw such a 'tree' for the three tasks we are exploring in this paper, we can see each of them uses a different type of workflow consisting of several independent/interdependent steps.Each step in the workflow is associated with a Question to classify an object according to a criterion.To answer the question the crowd needs to choose among a set of Options. Figure 4 involves minimum one step and maximum three steps for the classification task.Figure 5 has  a fixed two steps to complete a classification task and each step has more than ten options.For the GalaxyZoo[5] task, it can involve minimum one step and a maximum of nine steps to complete a classification, as shown in Figure 6.It is notable that these different tasks do present a tree-like structure each of which has a number of questions,with various number of available options, however, there are indeed cases where some nodes have more than one parent node which means it can not be considered as a tree.As a result, the workflow can be modelled as a directed acyclic graph (DAG), where the root node is the object under consideration and all other nodes are classification options.Each node can be reached via multiple paths from the root, which prompts the first question of the workflow [6].For a given object o, the crowd is asked to carry out a labelling task, which implies answering a series of (independent or dependent) classification questions with a set of labels which identify the outstanding features of the object being classified.We define this task as a path search problem in a workflow W f modelled as directed acyclic graph (DAG) with a root entry point and levels (similar to tree levels, representing the number of questions in the task), each corresponding to a set of options as depicted in Figure 7.Each node in such a graph represents a particular labelling option.The labelling finishes when a leaf in the graph is reached, that is a label that does not lead to any further questions.In our definition, the level corresponds to classification question(s) and the level of a node is serialised and counted at the lowest level.We use level exchangeably with depth of a node which is indicated by the number of edges from the node to the root node.A directed edge represents a label chosen for the corresponding question related with that node level.Table II has a summary of the definitions we use.
On top of the notations we defined in Section 2, we also define the notations which are specific to our workflow graph model in Table III.The problem we are solving in the paper can be defined as follows:

Adapted aggregation
In the classic approaches, it does not look at the dependency between node levels hence naively putting inferred result from each node level together does not guarantee a valid result.It is obvious that producing a valid path with possible choices should improve the accuracy of the users.As such, a basic adaptation of the classic algorithms should show some improvement over multiple level workflows.We show such a basic adaptation in Algorithm 4.

Table II. Definitions Term Definition
Task A general term referring to an action or a series of action need to be executed Classification task Task classifying objects into given categories, it could be a simple task (one question) or a relatively complex task (more than one question) Microtask A task is decomposed into smaller unit making it easier for the crowd.One microtask is equivalent to one question in classification task Workflow Microtasks are arranged/chained in a way to automatically complete the task Question Classification task asked of the user to elicit/assign a label to an attribute of the object to be classified Option The set of possible labels Chosen option An option user chooses per question Correct label The correct label for a question Chosen path A user chooses a set of labels for entire workflow Correct path The correct set of labels for entire workflow Workflow graph The workflow can be modelled as a directed acyclic graph (DAG), in which the root node represents the object under consideration and all other nodes are classification options Node A representation of an option in our model Node level The sequence that the question is presented to the user within a workflow Represents the available options at node level n a n j ð Þ Represents the individual option at node level n, where j 2 f1; . . .
Represents the label chosen by user u at node level n for object o.Thus, the labelling result

Lo
Represents the inferred label path for object o. p l count l ð Þ Ä jL o j ⊳ percentage of l being the true label for object o (l 2 A); 13: while not converged do 14: Estimate error rate for user u: 15:

16:
Estimate confusion matrix: 17:  for (o, u) 2 L do 30: Our adapted approach assumes that labels at different levels in the workflow are independent, then assemble the label path from each node level based on the workflow graph.In the adapted approach, not only we reward partially correct answers from the crowd by applying each of the algorithms at each node level in the graph and compute scores for each individual labels, but also we consider the valid path when inferring the correct path.We also specially choose two algorithms that take into account the performance of the crowd in their computations, EM and MP.The EM algorithm sums up all node probabilities along each path to determine the ranking score.The MP algorithm returns true if that particular label at the node level is relevant or false otherwise.This means that we assign the score for the candidate paths correspondingly either as 1.0 or 0.0.By studying it, we want to allow MP and EM to be able to better identify those users who, while not doing so well overall, are very skilled at a particular sub-task (question) in the workflow.

Evaluation
To evaluate the three algorithms and our adapted approach, we compare the classic approach where algorithms are applied on each node level and simply put together (we call it "naiveapproach" here) with our "adapted-approach" which uses classic approach while strives to infer a valid correct path by considering the workflow graph.Thus, we have six different approaches: mv_adapted, mv_naive, mp_adapted, mp_naive, em_adapted, em_naive.Each inference algorithm was applied to six data sets with different microtask crowdsourcing workflows.We start with the evaluation setup of the data in Section 4.1 and the evaluation metrics in Section 4.2.Then we present the evaluation of inferred result in Section 4.3.

Data
First, we used three existing data sets.The first one is from the Snapshot Serengeti[1] project and consists of all crowd classifications within the time span from 10 December 2012 until 17 July 2013.It contains 7,800,896 labels from 890,280 volunteers for a total of 66,892 objects.For our evaluation, we used a gold standard with curated labels for 4,149 objects, which was created by professional scientists working on the Snapshot Serengeti project.To evaluate our approach we took all labels received from the crowd for the  Figures 4, 5 and 6, respectively.To explore the effects of volunteers/paid context on the results, the tasks are also setup on paid crowdsourcing platform to mimic the tasks done by volunteers.

Metric
To measure the performance of our aggregation approach, we employ the Accuracy metric which has been commonly used in classification evaluation in previous work (Khattak and Salleb-Aouissi, 2011;Kamar et al., 2012;Sheshadri and Lease, 2013;Hung et al., 2013;Zhang et al., 2017a;Zheng et al., 2017b).Accuracy is a measure allowing us to understand the percentage of correct answers (inferred by algorithms).The accuracy is defined as the percentage of objects that have been correctly inferred.Higher accuracy indicates better performance.
The above equation is by default for calculating the accuracy for the inferred label path.
Bernoulli L gold o ¼¼ Lo indicates the outcome (either 0 or 1) of comparing gold category with the category predicted by different predictor.As we use the adapted node-level based implementation, it makes sense to also evaluate how accurate the inferred label is on each node level.In such context, L gold o n ½ represents the ground truth for object o at node level n and Lo n ½ represents the inferred true label at node level n.Hence, the accuracy at node level n for the top answer can be calculated by: To understand whether our adapted approach is significantly better, we will also run significant testing for all algorithms chosen.We will use standard 5 per cent significance level.
For each data set, we will randomly select 100 objects and select 50 times.The accuracy for each selection is calculated for MV, MP and EM for both naive and adapted approach.We will use the function scipy.stats.ttest_indfrom Python[8] to perform the two-sided test for naive and adapted samples in all six cases (three workflows, each has two contexts: volunteer and paid).

Results
Table IV shows the accuracy of each algorithm on each data set for the inferred answer.
Considering the overall classification accuracy (by path), our adapted methods have better performance than the naive approach in both volunteer and paid crowd context; at the same time, each algorithm generally has higher accuracy for volunteer context compared to the paid crowd.Note that the best accuracy achieved increases as the depth of the workflow increases for the paid crowd context, where Serengeti with two questions achieves 45.9 per cent, darkskies with three Crowdsourced classification tasks questions achieves 53.0 per cent and galaxyzoo with maximum of nine questions achieves 57.9 per cent.Similar pattern is not observed for the volunteer context.If looking at the accuracy breakdown by node level (Figures 8, 9 and 10), it is notable that for multiple-questions task with more steps, adapted method of MP and EM generally shows better accuracy at most of the node levels.For the data sets from a task with fewer steps in its workflow (less number of levels in the graph), such as the Serengeti task in Figure 8, MV performs better.Meanwhile, from the Table IV we can see MV shows an acceptable accuracy for most of the volunteered data sets (mostly over 75 per cent, except for GalaxyZoo data set), but has poor accuracy (less than 60 per cent) in the paid crowd context though it performs better than other individual algorithms we tested, which suggests it need to be complemented by other methods which might be good at specific objects where MV cannot perform well.Looking at the accuracy by level results, it does not seem to suggest that as the depth of the task (number of levels) increases, accuracy has a tendency to consistently increase or decrease.The accuracy of each level is more relevant to its intrinsic character (e.g.number of options in that level, and ambiguity

Crowdsourced classification tasks
or subjectivity of the corresponding object).For instance, the darkskies task asks the user to evaluate the sharpness and cloudiness of the image, which can be subjective to some degree.This is also why the result by node level seems to show an interesting picture that on different node level for different workflow, sometimes em has the best result (such as level 4 and 5 of GalaxyZoo), sometimes mp has the best result (such as level 1 of Serengeti in volunteer case), other times mv has the best result (level 1, 2, 3 of Darkskies in both volunteer and paid context).
Notice that MP for the darkskies paid crowd context, it is the only case we observe that the naive approach has higher overall accuracy (by path) than adapated, which is due to the fact that both the level 2 and 3 (determining cloudiness and sharpness of the image) of darkskies workflow are in essence independent questions of the first node level (whether it is a city, or stars or anything else) though the task workflow made it a subsequent question only when" city" is chosen as the label for first node level.Similarly, the accuracy by level result from mp_adapated is lower than mp_naive on a few other occasions at different node level, but in those occasions, there is always one node level mp_naive has considerably poor accuracy, such as in Galaxyzoo node level 2, which subsequently leads to the very low overall accuracy considering the whole path.The reason that the mp_adapted approach could have lower accuracy at certain level is that mp approach actually only returns 1.0 or 0.0 to indicate whether that is the predicted label, but our adapted approach tried to assemble/infer a most probable valid label path (as shown in Algorithm 4) based on the candidate of predicted labels from individual node level.Therefore, for the mp case, the randomness of ranking the combinations might not do well for the corresponding node level, however, the overall accuracy has shown to be better than the naive approach which completely neglects the validity of a label path.
Notice that though our adapted approaches achieve higher accuracy for the first node level in most case, mv_adapted has slightly lower accuracy comparing to mv_naive for GalaxyZoo workflow under volunteer context, which is because the way we assemble the result is based on the overall possibility (percentage of voting at each node level multiplied) of a path instead of assuming the top voted label at node level 1 is correct (and then traversing subsequent node based on that assumption).Our main purpose is to obtain the most possible valid label path, which has been shown effective in Table IV.We have run the significant testing for all algorithms chosen.The result is statistically significant for all our adapted approach as the pvalue is smaller than the pre-defined significant level (5 per cent) in all cases.

Discussion
In this section, we expand on the key findings of the evaluation results introduced earlier.

Crowd context matters
We have deliberately chosen three representative tasks each presenting two data sets produced by volunteers and paid crowd.Based on our results, there is a distinctive difference in performance for the same algorithm applied in these two different contexts.For all algorithms, the accuracy it can achieve under the volunteer context is evidently higher than the paid crowd, without any exception.For the same workflow, the overall accuracy (by path) it can achieve in volunteer context is normally around 30 per cent higher than the paid crowd context for workflows with two to three questions.However, this does not seem to be the case when workflow involves more questions, such as in the galaxyzoo case where the best accuracy all the algorithms can achieve is only around 5 per cent higher in volunteer context compared to paid crowd context.

Workflow counts
From the representative tasks we have shown so far, there are two main factors that need to be taken into account when designing a classification crowdsourcing workflow especially when IJCS 3,3 classification steps are interdependent: the number of questions (determining the depth of the graph) and how many answer options per each question (width of the corresponding node level, affecting cognitive efforts required for passing that node level with correct chosen options).In our evaluation, we found evidence that both depth and width impact on overall performance of the inference algorithms.One visible pattern is for the paid crowd data sets.In this setting, overall accuracy (by path) increases as the depth of the graph increases (for both mv_adapted and mp_adapted), which suggests that it might be a good idea to have more classification questions each with fewer options rather than having fewer questions and giving many options to choose from, particularly for the case where the crowd's skill level is uncertain.The other notable aspect is for volunteer context, the mp algorithm has a comparative performance with mv in Serengeti workflow, but not in the other two workflows with more levels.

Heuristics-based aggregation as an addition
On observing the result in Section 4.3, it seems to be a promising way if we consider combining output from these algorithms using a heuristic strategy to perform better inference.We want to use results from mv_adapted, em_adapted and mp_adapted in combination to exploit their strengths and weaknesses for complex classification tasks.To do so, we could have an aggregator which is based on following intuitions: the number of unique classifications of an object (defined by u) shows the degree that the crowd workers agree/disagree on the classification where the higher number indicates higher degree of disagreement and normally imply the object is either a bit difficult or ambiguous to be classified; the ratio (defined by r) between the unique number of classifications/answers collected from the crowd and the total number of classifications/judgments also demonstrates how diverse the answers are for the corresponding object and hence similarly; As three-sigma rule (Pukelsheim, 1994) in the empirical sciences suggests that almost all values should lie within three standard deviations of the mean in a normal distribution, and theoretically mean plus one, two or three standard deviation(s) covers 68, 95 and 99.7 per cent of the data.In the case where MV might potentially fail (where workers tend to disagree), the number of unique classification or the ratio of the number of unique to the total number of classification for an object falls within the higher range of the distribution.Thus, a heuristic aggregation strategy we could consider: Look at the intrinsic characteristics of collected classifications for each object, such as the number of unique classifications and the ratio of that against the total number of classifications.Then, based on the third intuition above, we can use the skewness (defined by s below) of the distribution for number of unique (U$N u m ; u s ð Þ ) and ratio (defined by R$N r m ; r s ð Þ ) respectively to heuristically chosen bound where MV can be potentially complemented by other approaches.However, choosing an optimal threshold is not straightforward and need to be explored in future work.

Related work
Our approach is informed by existing work on microtask crowdsourcing and quality assurance in crowdsourcing, which we review in section.

Microtask crowdsourcing and workflows
In crowdsourcing, a problem needs to be sometimes decomposed into smaller, fine-granular microtasks and then arranged in a workflow for more effective processing.In general, a workflow consists of a set of microtasks; the microtasks are sometimes of different types and can be dependent or independent of each other.For instance, the find-fix-verify workflow proposed by Bernstein et al. (2010) uses microtask crowdsourcing to proofread and shorten text in three steps: finding areas of improvement in the text; fixing or improving them; and verifying the quality of the changes.In each step, the crowd is asked to carry out

Crowdsourced classification tasks
the same type of microtask, sometimes iteratively.In Kittur et al. (2008Kittur et al. ( , 2013) ) and Acosta et al.'s (2013) studies, researchers have proposed to group the same or similar microtasks into batches as a means to facilitate learning effects.Previous studies have also shown that task performance can be improved as a function of several factors, including the design of tasks and workflows, motivation and incentives and training (Bernstein et al., 2010;Demartini et al., 2012;Kittur et al., 2008;Wiggins et al., 2011).
In the citizen science platform such as Zooniverse[9], most of the classification projects are not simple tasks with one-question, instead is multiple-questions chained together.Zooniverse i uses workflow to "group a collection of tasks into a logic unit"[10] which is, in essence, referring to the relatively multiple-questions task which need to be finsihed in several steps.In Snapshot Serengeti [1], classifying an image means answering a set of independent questions, sometimes several times when more than one animal is present in the image.In Cities at Night[2] and Galaxy Zoo[3], questions are inter-related and the answers given in one step determine the questions in the subsequent steps.In the context of such classification task, a workflow is used to refer to the logical organisation of each classification questions and corresponding options.
Most previous studies around crowdsourcing workflows have focussed on the design of the workflows and have shown that a particular type of workflow can be crowdsourced effectively (in terms of the accuracy of outputs, budget, time etc.) (Little et al., 2009;Bernstein et al., 2010;Tran-Thanh et al., 2015).In some cases, researchers have proposed bespoke quality assurance methods for their workflows (Lintott et al., 2011;Willett et al., 2013).Our work proposes a strategy which can be applied to determine the correct label path for a whole range of classification tasks, spanning over several steps with independent or dependent multiple-choice questions, which is different than existing research that mainly focus on the result for the final step (no matter how many other previous steps exist in its workflow).

Inference algorithms
Researchers have proposed inference algorithms, mathematical models that can automatically infer the correct solution to a given problem from a solution space defined by the crowd.For example, Ipeirotis et al. presented an algorithm that assesses the performance of crowd workers and exploits this information to estimate the quality of answers on Mechanical Turk (Ipeirotis et al., 2010).Karger et al. proposed to use MP to infer correct answers from worker's answers (Karger et al., 2011).Bachrach et al. (2012) used a Bayesian graphical model to grade test answers in scenarios where the ground truth cannot be made available.Whitehill et al. (2009) followed an expectation maximisation approach to identify correct classifications, depending on the expertise of the workers and the level of difficulty of the task.In the citizen science project Galaxy Zoo Supernovae, crowd answers were analysed using a Bayesian generalisation of the same expectation maximisation idea (Simpson et al., 2011).More recently, Difallah et al. (2015b) compiled a set of features that can be used to predict answer quality, based on an analysis of Mechanical Turk logs.Several studies have shown that it is possible to combine automatic prediction methods (such as Bayesian or generative probabilistic models) with additional input from the crowd to further improve the accuracy of the predictions (dos Reis et al., 2015;Hare et al., 2013;Ipeirotis et al., 2010;Loni et al., 2014;Simpson et al., 2013).Other studies have analysed and compared different algorithms (Zheng et al., 2017a;et al., 2015;Sheshadri and Lease, 2013), emphasising the need for more research to understand the interplay among different sets of design parameters on the overall performance.
All these existing methods have considerably advanced the state of the art.However, they cannot be applied to every type of microtask crowdsourcing workflow without restrictions.Moreover, most of the research carried out so far in this space has looked at rather simple binary or multiple-choice classification tasks with the aim to identify a single, IJCS 3,3 correct answer.This class of microtasks, albeit important and widely used, is not always the norm.As we have seen in the examples from the previous section, there are cases where a problem cannot be easily decomposed into independent microtasks, or where different, related microtasks should be grouped into more complex workflows for efficiency reasons.Although there are a few recent works looking into the relatively complex multiple-step classification tasks, each of them has a domain-specific or problem-specific focus (Parameswaran et al., 2011;Kim et al., 2002;Wu et al., 2012;Bragg et al., 2013;Kamar and Horvitzm, 2015;Otani et al., 2016).Bragg et al. (2013) and Otani et al. (2016) both researched the entity classification that normally involve categorising the given entity into parent-child classes in different steps but have very different perspectives.Bragg et al. (2013) focus on improving the workflow for generating taxonomy, as well as inference methods to induce the parent-child relationship, while Otani et al. (2016) focus on the task where a parent-child relationship exists between two adjacent classification steps, and propose label aggregation methods that adapt from existing GLAD method (Whitehill et al., 2009) by considering the hierarchical class-subclass structure.In addition, Wu et al. (2012) investigate the sequential data labelling scenario and present Sembler to ensemble crowd sequential labellings by leveraging the statistical correlation and dependency among multiple instances/sentences which is domain specific and not applicable to other multiple-step classification where no such statistics can be exploited.Parameswaran et al. (2011) and Kamar and Horvitz (2015) particularly look at the multiple-step image classification tasks while both took the approaches that are not easy to be generalised to suit for other multiple-step classification.Parameswaran et al. (2011) explicitly formulate the classification task as human-assisted graph search problem, presenting the dimensions characterising the different type of classification and developing algorithms to optimise the questions to be asked (at the different node) which is evaluated with simulation.On the other hand, Kamar and Horvitz (2015) focus on optimising worker allocation in the hierarchical classification task (HCT) and develop answer models and evidence models for HCT consensus while both models are constructed with supervised learning, assisting with the Sloan Digital Sky Survey (SDSS) features identified by machine visions available for GalaxyZoo c data set.There is also a few research particularly dedicating to automatic hierarchical classification where an taxonomy is given and a parent-child relationship among classes exists, but all are bound to a certain domain.For instance, Dumais (2000) investigate automatic hierarchical classification using Support Vector Machine with existing web pages whose category are known as training data.Su et al. (2006) present an automatic method to classify structured web databases by leveraging probing queries, the returned count of query result and the SVM classifier.Such automatic hierarchical classification not only needs existing labelled data as training data but also focus on the classification where answers to further classification step down the line (child classes) are always a sufficient condition to confirm the answer to the previous classification step (parent classes).
Our approach differs from existing work mainly in the fact it is not restricted to a specific type of multiple-step classification and does not need additional information such as the machine identified features of the image or frequency/correlation among word usage, neither does it rely on the parent-child relationships between classification steps.Our method is general and intuitively easy to be applied in any multi-step classifications.We discussed the three main individual algorithms in Section 2 and noted that whilst all three algorithms can be used to infer the correct answer for a multiple-choice question, they differ in terms of the inputs and outputs.In our approach, we devised a new strategy to use existing algorithms to achieve higher classification accuracy.

Conclusion
Ensuring quality is one of the grand challenges of microtask crowdsourcing.While previous research has looked at inferring correct answers for microtasks consisting of single binary or multiple-choice questions, our research proposes a model that can be applied to both single-question and multiple-question scenarios, filling the gap for understanding how to aggregate in the multiple-question scenarios.We propose a graph model and an "adapted" aggregation method that can improve the accuracy in inferring true label path in complex workflows with several interdependent questions.Though a few previous works tried to address similar multiple-step classification, they are either limiting it to the hierarchical classification scenarios where a parent-child relationship exists between classification steps or restricting the method by having to involve additional information.We propose using the graph to model a microtask crowdsourcing workflow and to support inference algorithms in making decisions about correct labels for classification tasks with multiple-questions, where the answer to one question does not have to be the sufficient condition to or imply the answer to the previous question is correct.We believe this is the first work that investigates aggregation in a multiple-step classification task with interdependent questions to infer the correct label path and assess the classification accuracy accordingly.
To this end, we explored three inference algorithms, MV, MP and EM, each with proven benefits in quality assurance in crowdsourcing.We compared the performance of our adapted approach and the existing naive approach, using six representative data sets.We evaluate the performance of individual algorithms for overall accuracy where a full labelling path is considered as an atomic, correct answer and a more refined measure which looks at accuracy in individual node level of the workflow graph.The results have shown that our adapted approach has significantly improved the accuracy compared with the naive approach.The result also demonstrates that while MV does well in overall accuracy, a deeper analysis of the accuracy in each node level revealed a more interesting picture.Hence, a heuristic-based aggregation approach might be a potentially better solution by combining results from multiple algorithms leveraging the strength of each other.This suggests the need for more dynamic inference approaches that can adapt to the complexity of the crowdsourcing workflow.
In future work, we plan to devise inference methods that take other, more workflowspecific factors into account.Our current method assumes independence between labels from different levels when inferring the answer for each level.It can be potentially improved to consider the possible correlation between labels in different node levels.For instance, it can consider giving different weight to labels based on the inferred result from the previous level.Such method requires a top-down traversal process which might bring side-effects since it counts heavily on the inferred result from the previous level, and carries on the effect (weight) to subsequent levels even the choice in the previous levels may be incorrect.As the correlation between labels in different node level is complicated, the feasibility of incorporating such correlation information into the aggregation process needs further investigation.Meanwhile, the number of options and the length of possible paths in a workflow deserve more in-depth experiments.One promising direction will be to employ other machine learning approaches for truth inference.For instance, using the workflow properties along with the crowdsourcing generated data to learn and explore features automatically [Huynh et al. (2013)], and produce decision tree to help choose the proper inference algorithm.Alternatively, certain properties from crowd-collected data could be further exploited to train machine learning algorithm(s) with selective labels to directly infer true label path.IJCS 3,3 Figure 2. Example classification paths collected from 20 workers for a given photo Figure 3. Representation of a task with a single question Figure 4. Representation of dark skies workflow from cities at night Figure 6.Representation of GalaxyZoo workflow from Zooniverse Figure 7. Graph representation of an example classification workflow W f vs the corresponding classic way of looking at the classification with multiple questions will represent the ordered list of nodes (the traversal path) visited by user 1 when classifying o, which is called as a label path L o u The label path chosen by user u for object o L o n ð Þ All labels for object o at node level n Figure 8. Accuracy by node level (Serengeti) Represents the graph based on the workflow of classifying object o, it has node levels to indicate the questions to classify the corresponding attributes of the given object, and nodes to represent the options available for each attribute A (n)