Ontology enhancement using crowdsourcing: a conceptual architecture

Purpose – This paper aims to investigate the use of crowdsourcing in the enhancement of an ontology of taxonomic knowledge. The paper proposes a conceptual architecture for the incorporation of crowdsourcing into thecreationof ontologies. Design/methodology/approach – The research adopted the design science research approach characterisedby cycles of “ build ” and “ evaluate ” untila re ﬁ nedartefact was established. Findings – Data from a case of a fruit ﬂ y platform demonstrates that online crowds can contribute to ontologyenhancement ifengaged in a structured mannerthat feeds into a de ﬁ nedontology model. Research limitations/implications – The research contributes an architecture to the crowdsourcing body knowledge. The research also makes a methodological contribution for the development of ontologies using crowdsourcing. Practical implications – Creating ontologies is a demanding task and most ontologies are notexhaustive on the targeted domain knowledge. The proposed architecture provides a guiding structure for the engagement of online crowds in the creation and enhancement of domain ontologies. The research uses a case of taxonomic knowledgeontology. Originality/value – Crowdsourcing for creation and enhancement of ontologies by non-experts is novel and presents opportunity to build and re ﬁ ne ontologies for different domains by engaging online crowds. The process of ontology creation is also prone to errors and engaging crowds presents opportunity for corrections andenhancements.


Introduction
The creation of domain ontologies is a daunting task and one of the major reasons ontologies are not widely adopted in knowledge representation is the difficulty of creating them. This difficulty is partly due to the complexity of domain knowledge being modelled and the limitations of the tools available for ontological modelling. More recently tools have emerged that has made the creation of ontologies easier by providing more friendly interfaces and checking for knowledge consistency as axioms are added into the ontology.
An example of a widely used editor for building ontologies in the OWL syntax is Protégé ontology editor (Protege, 2000) together with its built-in reasoner (Horridge et al., 2011).
Ontology creation using these tools requires the ontology developer to capture all facts in the ontology. In the taxonomic knowledge domain, the axioms on every taxonomic grouping are closely related and the complexity increases as refinements are made to the species level. The ontology creation exercise is, therefore, very involving and is prone to errors of associating a feature with the wrong taxonomic group, as well as the omission of important features from a group. As stated, the ontology development tools mainly check for logical consistence in the knowledge. It is thus possible to have logically correct knowledge, which is not accurate -"inaccurate logically correct knowledge". The current tools, therefore, do not guarantee accuracy and require intense expert involvement in the checking and verification of the represented knowledge.
In this research, we seek to explore how crowdsourcing techniques can be used to enhance and hasten ontology development. In this paper, we present an architecture of how crowdsourcing could be used to enhance the creation of ontologies, and a case of taxonomic knowledge ontology is used.
The rest of the paper is organised as follows. In Section 2, background and related work is presented covering what taxonomic knowledge is and ontologies of biological knowledge. Relevant background on ontology development and crowdsourcing is also presented. Section 3 presents research approach adopted, which is the design science research (DSR). In Section 4, the case study used in the development of the architecture is presented together with an analysis of the crowdsourced data. Section 5 presents the architecture itself consisting of the components and the ontology enhancement steps. Section 6 discussion of the results and Section 7 presents the conclusion.
2. Background and related work 2.1 Taxonomic knowledge Taxonomy, in biological sciences, is the sub-discipline that deals with the identification, naming and classification of living things according to common properties. Taxonomic knowledge, therefore, constitutes the properties of the different taxonomic groupings. The properties used are mainly the apparent morphological features of the organisms. Taxonomy is, thus, a discipline that enables other sub-disciplines of biological sciences, as other forms of biological knowledge is organised around the taxonomic groupings described in the taxonomic knowledge (Clark et al., 2009;Godfray, 2002).
Although different weaknesses have been identified regarding the use of taxonomy as a premise for the organization of biological knowledge, it still remains a popular approach, and a lot of organisms and biodiversity knowledge is organised around taxonomic groupings. The taxonomic knowledge, therefore, needs to be represented in appropriate formats to support the identification, naming and classification needs of biological sciences, as well as for the general public.
Creation of such knowledge bases would best be done by taxonomists, who are experts in the classification and identification knowledge of different organisms. Experts in taxonomy are on the decline, and the few taxonomists in different institutions are already overstretched attending to their core functions of identification and classification of organisms (Dar et al., 2012;Kim and Byrne, 2006;Walter and Winterton, 2007). With advances in web technologies, the possibility of involving a broader community online to create biological knowledge bases has been recognised as a potential solution (Canhos et al., 2004;Hardisty and Roberts, 2013;Parr et al., 2014) and this research is anchored on this proposition.

Ontologies of taxonomic knowledge
Exploration of ontologies in the representation of knowledge in the biological domain has resulted in creation of multiple ontologies. The ontologies represent different categories of biological knowledge and some of these ontologies are hosted at the OBO Foundry ontologies website (www.obofoundry.org/). Examples of ontologies that focus on organism knowledge include the vertebrate trait (VT) ontology, which contains biological traits of vertebrates (Park et al., 2013); flora phenotype ontology (FLOPO), which contains knowledge on traits and phenotypes of flowering plants (Thessen et al., 2015); Afrotropical bee ontology, which contains taxonomic knowledge on different species of Afrotropical bees (Gerber et al., 2014) and the ontology of fruit flies of economic importance in Africa (Kiptoo et al., 2016a) among others.

Ontology development
An ontology is "a formal, explicit specification of a shared conceptualization" (Gruber, 1993). Formal means the specifications are encoded in a logic-based language; explicit specification means concepts are given explicit names and definitions; shared means the knowledge in the ontology can be shared and re-used by different users who subscribe to it; conceptualization means the way people think in a given domain (Corcho et al., 2006;Guarino et al., 2009;Uschold and Gruninger, 2004). An ontology has to meet these properties and development of ontologies is, therefore, a demanding task requiring input from logicians, as well as the target domain experts.
Different methodologies for development of ontologies have been developed over the years (Grüninger and Fox, 1995;Uschold and King, 1995;Fernández-L opez, 1999;Noy and McGuinness, 2001;Horridge et al., 2009;Suárez-Figueroa et al., 2012). Until the late 1990s, ontology development was seen as more of a craft and researchers mostly adapted some ontology or blended methodological components from various ontologies. In the recent years, software engineering-related methodologies have emerged, providing detailed guidelines on how ontologies should be conceived, specified, developed, adapted, deployed, used and maintained (Baclawski et al., 2013).
In this research, the ontology development methodology by Horridge et al. (2011) was adopted. The process of creating an ontology can be summarised into a number of steps, namely, define classes and subclasses; define the object properties; and define the individuals. These steps are done repetitively until the ontology is sufficient for its purpose.
Development of the fruit fly ontology used in this research followed this methodology and a description using an example is provided in Section 4.1.

Crowdsourcing
The term crowdsourcing has been defined by several researchers, and the commonality in the definitions is the use of the open call format to enlist a multitude of humans to participate in solving a problem. In this research, we adopt the definition of crowdsourcing as "tapping into the collective intelligence of the public to complete a task" (Yuen et al., 2011). The crowdsourced tasks can be categorised into two, namely, "microtasks" where a task is split into small puzzles whose results are aggregated to solve the main puzzle and "megatasks" where a task is solved in its main form (Good and Su, 2013).
Crowdsourcing research challenges are currently in four dimensions, namely, contributor recruitment, task design, data aggregation and management of abuse (Doan et al., 2011;Hosseini et al., 2014). This research looked at the aggregation of data to generate interpretations. The task design used in the crowdsourcing for the data is reported in Kiptoo (2017).

Ontology enhancement using crowdsourcing
Task design in crowdsourcing often require design of custom workflows that breaks down complex tasks into pieces that can be done independently then combined (Chilton et al., 2013). Crowdsourcing tasks come in different forms, including playing games, solving puzzles, transcribing, image tagging, translation, writing, etc. In this research, the crowdsourcing is done using image tagging tasks. Image tagging using crowdsourcing has been explored by many researchers (von Ahn and Dabbish, 2004;Chen et al., 2013;Mavandadi et al., 2012;Qin et al., 2011). The tagging in this research is done using axioms from an ontology and not generic tags like the others. The participants are guided to make keen observations, as the nature of features to be tagged are in the fine details of the image.
Techniques for the aggregation of crowdsourced data have been studied by different researchers. Hung et al. (2013) classified the aggregation techniques into non-iterative and iterative. Non-iterative techniques compute a single result and examples include majority decision (MD) (Kuncheva et al., 2003), honeypot (HP) (Lee et al., 2010) and expert label injected crowd estimation (ELICE) (Khattak and Salleb-Aouissi, 2011). Iterative techniques perform a series of multiple iterations of input given by each worker also adjusts the expertise level of the worker based on performance. Examples of iterative techniques include the expectation maximization (EM) algorithm (Ipeirotis et al., 2010) and ITER (Karger et al., 2011).

Approach
The DSR approach was adopted. DSR is premised in the pragmatic philosophical view, which believes that the main reason for doing research is to improve human life through creation of interventions. The DSR approach takes a problem in a selected domain and solves it through introduction of an artefact that alters the way things are done thereby solving the problem (Benbasat and Zmud, 1999).
Different processes for conducting research in the DSR approach are documented in literature (March and Smith, 1995;Purao, Rossi and Bush, 2002;Peffers et al., 2006;Hevner, 2007;Niederman and March, 2012). The general principle behind the processes is that of creating knowledge through a cyclic process of doing something (build) and assessing the results (evaluate) as documented in Owen (1998).
In this research, the research framework presented in Hevner et al. (2004) is adopted. The framework presents DSR as located in the middle of two categories of activities of Rigor and Relevance as shown in Figure 1.
The rigor activities are concerned with ensuring reference to appropriate knowledge bases and that the generated knowledge is grounded on relevant theories. The relevance activities are aimed at ensuring that the resulting theories are usable in practical environments.

Case study
Theory creation can be done at different stages during the build and evaluate cycles of design science research Vaishnavi, 2012, 2008). In this research, the architecture was developed from evaluation of a case of ontology-driven fruit fly identification through crowdsourcing.

Creation of the case ontology
The ontology of fruit fly knowledge used in this study is presented in Kiptoo et al. (2016b). The ontology is based on a model whose overarching principle is to represent the morphological features of any taxonomic grouping as summarised in Box 1. To represent knowledge on "a beetle with a redhead", a number of assertions must be made as shown in Box 2.  To build an ontology that has identification knowledge up to species level requires the taxonomic groupings to be specified, all body parts to be defined as subclass of body part (BodyPart), properties to be modelled as object properties, generic features (hasColour, hasTexture, hasSmell, etc.) to be modelled as feature, morphological features to be modelled by associating BodyPart with a Feature, and then finally associating the morphological feature with the taxonomic grouping. This process may seem straightforward, but when the taxonomic groupings and features are many, one can commit multiple errors resulting in an incomplete or erroneous ontology.

Crowdsourcing using the fruit fly ontology
Using the fruit fly ontology outlined above, a crowdsourcing platform for identification of fruit flies was developed. The objective of the platform was to enable non-expert online crowds to participate in the provision of species identification services with the support of the ontology Kiptoo et al. (2016aKiptoo et al. ( , 2016b.
The crowdsourcing task was for the crowds to tag images with features from the ontology as shown in Figure 2. A sketch image showing the different body parts was provided to guide participants on the names of the fruit fly body parts used in the identification. Taxonomic knowledge is often described using anatomical body parts that may not be obvious to the participating online crowds. In this task, the participant is expected to tick the features they can see in the sample and submit. Upon submission, another sample is loaded with the same task.
The crowdsourcing activity generated a data set that was then used in assessment of crowd's capability to provide the identification services. In this research, the same data set was analysed and the results demonstrate that ontology enhancement can be performed from the ontologydriven crowd sourcing activities. The architecture for this enhancement is presented in Section 5.

Analysis of data
The crowds generated a data set of features that they observed on the images presented to them. The research used seven samples of fruit fly images already identified by experts. In total, 30 online users participated in tagging the images with features from the ontology and close to nine thousand feature tags were made on the images. The MD aggregation technique was used through a simple count of the number of times a tag was made on a sample.
Analysis of the data together with the ontology led to the two key findings that are useful in the enhancement of the ontology, namely: (1) There are identification features that the crowds identified but are missing in the ontology; and (2) There are features that are described in the ontology but are not used by the crowd.
We describe what these two findings mean to ontology enhancement. 4.3.1 Features identified by the crowd yet missing in ontology. The crowd associated several features with some samples, and yet these features were not in the ontology. For instance, in Sample 1, the crowd tagged the feature "fore femur yellow" 24 times. A check at this feature in the taxonomic key found it present and this means that the feature was omitted in the creation of the ontology. The validity status of features that are missing from the ontology yet was tagged over 20 times by the crowd is as summarised below (Table 1).
From these results, it can be seen that the crowd can tag features that were omitted from the ontology. The important thing is to have clearly stated observable features and highresolution images, especially when dealing with taxa that are small in size like the fruit fly case used in this research.
4.3.2 Features in ontology not observed by the crowd. The crowdsourced data showed a number of features present in the ontology yet minimally or not tagged by the crowd. The ontology had "Ceratitis Bremii" having the feature "Face Transverse Band Dark Yellow" associated with it yet no one tagged this feature. A sample of other features not tagged is summarised below (Table 2).
A check at the features not tagged by the crowd revealed two major possibilities; either the image is not clear enough and therefore, the feature not easily visible or the crowd participants are not able to interpret the feature. This brings out an important factor when creating a crowdsourcing platform. The task should be very clear, and the imagery used should be clear so that the features are readily visible. The names of the anatomic body parts are not obvious to the crowd and therefore, a detailed key of body parts must be provided to support the crowd in spotting the features. The sample had the feature but was omitted during ontology creation Ontology enhancement using crowdsourcing 5. Proposed conceptual architecture As stated in Section 2.1, creation of ontologies involves the description of the desired domain knowledge through the declaration of axioms. In OWL ontologies, the axioms consist of classes, individuals and properties that are present in the domain of interest (Horridge et al., 2011). Depending on the domain knowledge and the questions the ontology is expected to answer, the ontology designer decides, which items are categorised as classes, individuals and properties.
Depending on the source of knowledge, crowdsourcing platforms can be used to engage online crowds to register their observations which upon aggregation could result in axioms, which are added to the ontology. The knowledge sources include knowledge with domain experts, knowledge recorded in different documentations and the knowledge observable in our environments. Online crowds can be engaged in creating the classes, individuals and properties based on the model of the ontology. It is up to the developer of the platform to decide how to design the crowd tasks to capture data that can be synthesised for generation of ontology axioms. The proposed architecture for crowdsourcing for ontology enhancement is as shown in Section 3.

Architecture components
The conceptual architecture has three components, namely, ontology creation, algorithms and crowdsourcing platforms, described in the sections below.
5.1.1 Ontology creation. As illustrated, the different activitiesclass, properties and individuals descriptionof the ontology creation can be achieved through the synthesis of data generated using crowdsourcing. The lines joining the algorithm and the ontology creation activities are dotted, and this means that the implementer of this architecture can target any of the ontology development activities depending on the levels at which the developer wants to engage online crowds. If the developer wants the crowd to identify classes, then crowdsourcing activities that support this are designed and once generated by the crowd, they can be added to the ontology. For example, the taxonomic ontologies, if the designer wants to use crowdsourcing to identify the colours present in different body parts of a sample, a crowdsourcing task to ask for colours the crowd can see. The common colours can then be added as subclasses to the colour class. 5.1.2 Algorithms. Algorithms consist mainly of the tools used to analyse the crowdsourced data for purposes of adding axioms to the ontology or scheduling more crowd activities when needed. As shown in Figure 3, the inputs to the algorithms component are the stable version of the ontology and the crowdsourced data.
5.1.3 Crowdsourcing platform. The crowdsourcing platform component contains the tools designed to enable engagement of online crowds through the performance of clearly defined tasks. The platforms must be designed in ways that the data generated can be analysed to produce axioms for the ontology. A key input to the crowdsourcing platform is the stable version of the ontology. Axioms from this ontology is used to solicit for input from online crowds. This input can then be aggregated to generate more axioms for the enhancement of the ontology. In the fruit fly example, using the stable version of the ontology led to identification of omitted and incorrect axioms.

Ontology enhancement steps
The ontology enhancement steps can be summarised as follows: Use the stable version of the ontology. Create crowdsourcing platform that can generate a data set for ontology enhancement.
Analyse the data set for omissions and generate potential axioms. Analyse the data set for committed errors and generate potential axioms for deletion. Confirm validity of potential axioms. Enhance the ontology accordingly. Ontology enhancement using crowdsourcing 6. Discussion of results From the analysis above, we see that online crowds can be engaged to verify the axioms in the ontology. Creation of ontologies is a repetitive exercise and therefore, prone to errors. Engaging online crowds can aid in the correction of assertions made in the ontology. It is also possible to structure the ontology creation problem in a way that the ontology expert creates the high-level structures based on a model then the task of populating the ontology with repetitive features can be delegated to online crowds. In this case of taxonomic knowledge, the experts can focus on defining the features, then the role of associating those features with samples can be delegated to online crowds. It is, however, important to note that the quality of the crowdsourcing environment and samples used are key in getting high-quality results. The crowdsourcing platform must ensure the crowd is adequately guided so that they can effectively tag the features. The samples used must also be highquality images with features clearly visible.

Conclusion and future work
In this paper, an architecture for ontology enhancement using crowdsourcing is presented. The results of the case used in the study demonstrate that this architecture can be adopted for online crowds' participation in ontology enhancement. The architecture is an addition to the existing crowdsourcing models in literature and forms basis for interrogation and enhancement by other researchers. The architecture is also an artefact that can be adopted by application developers when building applications for ontology enhancement.
In future evaluation of the architecture can be conducted with respect to aggregation algorithms used and design of crowd tasks. This research used a simple aggregation algorithm -MDin the analysis of the crowd sourced data. More sophisticated aggregation algorithms could be employed so that different weights can be assigned to the contributors thus attaining trustable thresholds even with few participants. The crowdsourcing platform can also be improved to support the use of multiple images for the same sample thereby allowing observation of features from different dimensions.