Ontology-based approach to enhance medical web information extraction

Nassim Abdeldjallal Otmani (Mouloud Mammeri University of Tizi Ouzou, Tizi Ouzou, Algeria)
Malik Si-Mohammed (Mouloud Mammeri University of Tizi Ouzou, Tizi Ouzou, Algeria)
Catherine Comparot (University Toulouse – Jean Jaurès, Toulouse, France)
Pierre-Jean Charrel (University Toulouse – Jean Jaurès, Toulouse, France)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 11 December 2018

Issue publication date: 8 August 2019



The purpose of this study is to propose a framework for extracting medical information from the Web using domain ontologies. Patient–Doctor conversations have become prevalent on the Web. For instance, solutions like HealthTap or AskTheDoctors allow patients to ask doctors health-related questions. However, most online health-care consumers still struggle to express their questions efficiently due mainly to the expert/layman language and knowledge discrepancy. Extracting information from these layman descriptions, which typically lack expert terminology, is challenging. This hinders the efficiency of the underlying applications such as information retrieval. Herein, an ontology-driven approach is proposed, which aims at extracting information from such sparse descriptions using a meta-model.


A meta-model is designed to bridge the gap between the vocabulary of the medical experts and the consumers of the health services. The meta-model is mapped with SNOMED-CT to access the comprehensive medical vocabulary, as well as with WordNet to improve the coverage of layman terms during information extraction. To assess the potential of the approach, an information extraction prototype based on syntactical patterns is implemented.


The evaluation of the approach on the gold standard corpus defined in Task1 of ShARe CLEF 2013 showed promising results, an F-score of 0.79 for recognizing medical concepts in real-life medical documents.


The originality of the proposed approach lies in the way information is extracted. The context defined through a meta-model proved to be efficient for the task of information extraction, especially from layman descriptions.



Otmani, N.A., Si-Mohammed, M., Comparot, C. and Charrel, P.-J. (2019), "Ontology-based approach to enhance medical web information extraction", International Journal of Web Information Systems, Vol. 15 No. 3, pp. 359-382. https://doi.org/10.1108/IJWIS-03-2018-0017



Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited

1. Introduction

An effective Patient–Doctor communication is crucial for a successful delivery of medical care. The patients’ ability to express themselves and the doctors’ willingness to explain medical concepts in layman terms insure a fluent exchange of information hence an improved overall quality of healthcare (Ha and Longnecker, 2010). Apart from the traditional health care system, online medical resources, such as forums, question answering systems, symptoms checkers, etc. are proliferating and Internet users are increasingly referring to them. As reported in the survey (Fox and Duggan, 2013), almost 60 per cent of American adults use them, because they are easy to access and, when well used, they provide a useful source of knowledge. Among these resources, there are some which provide a wide range of medical articles written mostly in layman terms, such as WebMD.com or MedlinePlus.gov, designed to educate consumers and improve their access to medical knowledge. However, because of the vocabulary and knowledge discrepancy, navigating and retrieving medical information can be rather complicated and time consuming for laypersons (Nie et al., 2013). On the other hand, there are other solutions such as HealthTap.com or AskTheDoctor.com which allow patients to overcome the barrier of the expertise knowledge by asking directly doctors free-text questions on the Web rendering online Patient/Doctor conversation the new challenge to overcome in this new setting. In fact, one of the major elements that determine one’s ability to construct and understand messages during a conversation is the mastery of the knowledge and its respective vocabulary (Schneider, 2009). Regular patients commonly have less mastery of the medical knowledge compared to health care professionals. Therefore, they tend to formulate questions which contain sparse, incomplete or sometimes even misleading descriptions reducing their chances of getting an appropriate answer soon from a potential doctor (Nie et al., 2015a).

This study proposes a structured approach to improve the acquisition’s quality of the exchanged messages by bringing the consumers’ common sense knowledge and basic vocabulary closer to the doctors’ knowledge. The designed meta-model caters an abstract high-level view of what doctors look for as primary information to establish a diagnosis before ordering any medical exam. It is aligned with both the domain Ontology SNOMED-CT[1] to access the medical expert vocabulary and WordNet[2] to ensure a wider coverage of the consumers’ terms. In addition, this paper presents the implementation steps of the approach. The first step is the natural language processing of the text for the information extraction using StanfordParser[3] and syntactical extraction patterns. The second step is the mapping strategy of the meta-model with SNOMED-CT and WordNet. The third step is the ontology-driven modeling which allowed the reasoning processes to be performed to evaluate the coherence and improve the overall quality of information extraction. The final step is the presentation of the resulted interpretation models. Furthermore, to evaluate the prototype, two experimental studies were performed. In the first one, the gold corpus provided by ShARe/CLEF eHealth (2013) was used to examine the quality of the information extraction in medical documents. The standardized measures (Precision, Recall and F1-score) were calculated and compared with the results of other works. In the second one, the prototype was tested with ontologies of different sizes to assess its performance and scalability.

This paper is organized as follows: in Section 2, it presents the studies related to online patient/doctor conversations and the attempts to link the consumer’s vocabulary to medical concepts. Afterwards, in Section 3, it presents a detailed view of the ontology-driven approach to enhance the information extraction process. Then, it delineates all the steps of the implementation in Section 4. After that, it presents an experimental study followed by a result discussion section in Section 5. Finally, authors conclude with a general overview of the work in Section 6.

2. Related work

2.1 Online patient/doctor conversations

As mentioned previously, an effective patient/doctor communication is crucial for a successful delivery of healthcare. Online health resources provide a springboard for a successful patient/doctor communication (Nie et al., 2015a). They enable consumers to have an easier access to medical knowledge and when they know more about their medical condition, they tend to communicate and understand better the advice or the prescribed treatment (Teutsch, 2003).

According to Makovsky (2013), “In a year, the average American visits the doctor 3 times, but spends more than 52 hours on the Internet looking for health information”. Online healthcare resources are no substitute to the traditional healthcare. Nevertheless, they have their benefits; they provide an instantaneous, affordable and fast access to medical knowledge and specialized medical doctors. Being able to communicate efficiently in this new setting is essential for their advantageous usage.

HealthTap is an example of community-based health resources. It is an online platform where users can log in to post health-related questions and receive answers from general practitioners or specialists for free. In 2017, HealthTap counted more than 100,000 available doctors ready to answer voluntarily the questions at their will. It is beneficial for both doctors who can gain in popularity, attract potential patients, and provide help to a wider range of people and for patients who can get answers and more diverse recommendations from specialists.

Unfortunately, it is not easy for doctors to cope with the ever growing amount of questions, and the average waiting time for an answer can take up to three days (Nie et al., 2015a). In addition, most of the time, patients are not quite acquainted with the healthcare jargon and lack the medical knowledge which impedes their ability to construct concise descriptions of their medical conditions.

Here is an example of a question posted on HealthTap[4]: “I haven’t been able to eat in 6 days. I am getting very tired and dizzy. I’m very nervous about it because every time I try to eat I get very sick.” This question lacks so many details that it is just difficult to give a precise answer. The best recommendation given by the doctors was to dispatch her/him to the Emergency Room.

When consumers are left alone to ask questions online, they lack the assistance of doctors and they do not necessarily know what information is relevant to construct a meaningful question. There are many pieces of information that can be added to improve its quality such as age range, gender, daily activities, previous problems and conditions, social context, symptoms, duration, affected body areas and others.

2.2 Consumer health vocabularies

As an attempt to solve the wide layman/expert vocabulary discrepancy, resources were designed to establish bidirectional links between expert terms and layman terms (Grabar and Hamon, 2014; Kandula et al., 2010; Smith and Fellbaum, 2004; Vydiswaran et al., 2014; Zeng and Tse, 2006; Zheng and Yu, 2016). For instance, a resource, called consumer health vocabulary (CHV), was created during a collaborative work (Consumer Health Vocabulary Initiative, 2013). It links common terms (or expressions) such as “heart attack” or “tumour” to their equivalent medical terms, in this case “Myocardial Infarction” and “Metastasis” respectively.

CHV can be used to improve information retrieval by converting the layman terms into medical jargon queries and also to translate specialized medical concepts in healthcare documents into layman terms. However, understanding consumers’ questions relying only on CHV is not enough, as stated by Vydiswaran et al. (2014), “Vocabularies providing consumer-oriented health terms are relatively less mature. This fact diminishes the performance of named-entity recognition tools for processing community-generated text as well as the potential for building applications that could translate professional language into layperson terms to improve readability and facilitate comprehension.”

Instead of using a CHV, our approach relies on the domain ontology SNOMED-CT to access most of the medical jargon and manually defined lexico-syntactical patterns for a personalized information extraction from layman expressions and overcome the annotation problem, pertaining to sentences written in layman terms, which most tools face.

2.3 Questions’ analysis

Understanding medical questions requires first a natural language analysis of the text to extract the relevant information necessary to provide a suitable answer. Annotation (Suominen et al., 2013), Concepts extraction (Tseytlin et al., 2016), Information extraction (Antolík, 2005; Deleger et al., 2014; Roberts et al., 2007), Named Entity Recognition (Fleuren and Alkema, 2015; Zhang and Elhadad, 2013) aim at extracting information, however they perform better when the input contains expert vocabularies. Because of the scarceness of expert medical terms in patients’ descriptions, it is difficult for such solutions to extract the real meanings behind the layman terms accurately. This study shows the beneficial usage of the ontology driven approach that alternates between text analysis and language interpretation to build progressively an accurate interpretation of the question.

2.4 Medical Q/A systems

Automatic medical question-answering systems attempt to provide answers to medical questions by extracting the relevant concepts retrieve the right documents or information accordingly (Jacquemart and Zweigenbaum, 2003). Medical Q/A are designed to answer questions like: What are the symptoms of the disease X? What treatments are available for disease Y? What is the cause of symptom X? What is the dose of drug X? (Ely et al., 1999; Yu and Cao, 2008) by using NLP (Natural Language Processing) tools to analyze the question lexically and syntactically then extract the concepts and the relationships to generate a SPARQL query to run on a medical RDF repository (Ben Abacha and Zweigenbaum, 2015). Other solutions extract concepts from the question to retrieve medical resources (Electronic Health Records, Health Articles, Dictionary Entries, etc.) (Beez et al., 2015; Guo and Zhang, 2008; Nie et al., 2015b).

Interactive solutions play a major role in online healthcare. As highlighted in a study pertaining to classifying medical relationships and confining the inequalities between online health seekers and providers, “The principal online health services suggest an interactive standard, where health seekers can ask health oriented questions while clinicians provide the knowledgeable and trustworthy answers from forums such as Medline, WebMD, and HealthTap.” (Roberts and Demner-Fushman, 2015).

However, as mentioned previously, in this setting, patients’ and doctors’ knowledge and vocabulary need to be balanced. According to Roberts and Demner-Fushman (2015), “Textual replies are felt as ideal ones by the online health seeker and the only limitation is the medical jargons gap among health seeker and doctor”.

2.5 Context and intelligent systems

Intelligent systems may model the context to use it as an additional layer of support and intelligence to effectively reason and understand the user’s needs. The definition of context may vary slightly depending on the type of the system. For instance, cookies for search engines, location for map and geolocation systems, or user’s history for recommendation systems. However, in this study, we adopt the definition given by Bauer and Dey (2016) where the context is any information that can be used to characterize the situation of an entity. The studies (Kayes et al., 2015a, 2015b, 2017) showed that ontologies and ontological inference rules can be leveraged to reason and enrich the context. They successfully applied it to systematically grant doctors access to resources in health services. This study proposes an intelligent system that models the users’ input and treats it as a context to better acquire the user’s questions.

The following section presents an approach that relies on domain ontologies as a way to reduce the vocabulary and knowledge discrepancy between health seekers and providers.

3. Approach

The first step doctors rely on to get an insight about the patient’s state is the physical examination, checking the vital signs, and an interactive conversation called anamnesis (van Tellingen, 2007). Anamnesis represents a description of the patient’s state in his or her own words (“Anamnesis,” n.d.). The goal of this step is to obtain enough useful information necessary to establish a diagnosis and provide the required medical care.

The nonverbal communication cues: eye contact, posture, voice tone, body language, mannerism are crucial to the success of an office visit (Teutsch, 2003). However, in this online asynchronous sitting, where the conversation is not in real time, doctors may rely solely on the words expressed by the health seeker. This requires an additional effort from the health seeker to include beforehand as much important and relevant information as possible to reduce the delay caused by any follow-up questions because of any inadequacy of the provided information.

This section presents a framework for specific information acquisition oriented via the notion of meta-modeling and the ensued approach to improve the extraction information of medical concepts from consumers’ questions or descriptions. The medical concepts are retrieved from SNOMED-CT; the latter will be briefly presented in this section.

3.1 Architecture of the approach

The ultimate objective through computational knowledge is to make computer systems understand natural language questions and provide concise answers (Lim et al., 2011). Following this process, this paper proposes an ontology-driven approach to interpret medical questions or descriptions using Natural Languages Processing (NLP) tools combined with the medical expertise through domain ontologies.

The main steps are summarized as follows:

  • Analysis of the prerequisite information: During a task-oriented information acquisition process there is usually a specific set of information to seek. In this study, authors analyzed a set of questions posted online by regular health seekers. Most questions conformed to a common structure which was formalized into a meta-model.

  • Design a meta-model and submit it for expert validation: This task is accomplished either by the domain expert(s) or by the knowledge engineer(s) with the support of the domain expert(s) who can review and check the compatibility with the expertise and propose readjustment when needed to cater to the required information accurately.

  • Align the domain ontologies with the underlying meta-model: At this stage, the knowledge engineer creates the associations between the entities of the meta-model and the concepts of the ontologies to offer an access to the expert knowledge necessary during the interpretation process.

  • Define the syntactical patterns: One of the advantages of having a meta-model is the ability to aim the information extraction task at a specific set of patterns to match with the syntactical trees generated from the questions by NLP tools. Hence, allowing potentially ambiguous and specious information intrinsic to layman descriptions to be acquired.

  • Incorporate the reasoning modules (Induction, Abduction, and Deduction): While the meta-model structures the information acquisition process, its alignment with the domain ontologies enables the reasoning mechanisms to be performed to check the consistency of the extracted information and improve the interpretation.

  • Presentation of the result: the last step is the presentation of the constructed interpretation model enriched with ontological concepts. In this study, the graphical representation is chosen.

3.2 Medical domain ontology SNOMED-CT

Most disciplines, including medicine, have their specific vocabulary used to denominate unambiguously the domain concepts. This vocabulary can be captured and modeled through domain ontologies. As defined by Gruber (1993), an ontology represents explicitly, in a machine understandable format, the set of all possible concepts, including their terminology, and the relationships bounding them. They are used to publish, share and reuse knowledge.

As stated on the website of SNOMED-CT, “SNOMED CT is the most comprehensive and precise clinical health terminology product in the world.” (SNOMED International, 2018). In other words, the domain ontology SNOMED-CT (systematized nomenclature of medicine – clinical terms) provides a thorough list of the concepts that a medical doctor may use. In addition, all those concepts are linked in coherent and structured manner which allows reasoning processes to be performed.

However, most of these concepts are unknown to a layman patient. For instance, a search of the expert word “Scirrhous Carcinoma” in SNOMED-CT returns the concept: Id_254839007 with the Fully Specified Name (FSN) Scirrhous carcinoma of breast (disorder) which is basically a malignant breast tumor. The latter denomination makes more sense for a layperson. Hierarchy path shows the list of the concepts obtained by following the relations (is-a) from the concept 254839007 to the root. The deeper down the hierarchy, the most concise the concepts tend to be and the less known by laymen. In this study, SNOMED-CT was chosen to represent the medical knowledge vocabulary as an intermediate resource to bridge the language gap.

Hierarchy path of the concept Id 254839007 (Scirrhous carcinoma of breast) in SNOMED-CT:

  1. FSN: Scirrhous carcinoma of breast (disorder):

    • FSN: Is a (attribute).

  2. FSN: Primary malignant neoplasm of breast (disorder):

    • FSN: Is a (attribute).

  3. FSN: Malignant tumor of breast (disorder):

    • FSN: Is a (attribute).

  4. FSN: Malignant neoplasm of thorax (disorder):

    • FSN: Is a (attribute).

  5. FSN: Malignant neoplastic disease (disorder):

    • FSN: Is a (attribute).

  6. FSN: Neoplastic disease (disorder):

    • FSN: Is a (attribute).

  7. FSN: Neoplasm and/or hamartoma (disorder):

    • FSN: Is a (attribute).

  8. FSN: Disease (disorder):

    • FSN: Is a (attribute).

  9. FSN: Clinical finding (finding):

    • FSN: Is a (attribute).

  10. FSN: SNOMED CT Concept (SNOMED RT + CTV3).

  11. SNOMED Clinical Terms version: 20150731 [R] (July 2015 Release).

3.3 Electronic health records

Upon admission of a patient to a hospital, practitioners access his electronic health record (EHR) to record relevant medical information. An EHR is the digital version of a patient’s paper files that resides in a system designed to support users through availability of complete, secure, and accurate data (Carter, 2008). An EHR contains valuable and pertinent information including: patient’s vital signs, medical histories, diagnoses, treatment plans, immunization status, allergies, radiology images, laboratory and test results, demographic information such as age, gender, ethnicity, height and weight, medication and prescriptions, admissions notes, discharge summaries and others (MIT Critical Data, 2016).

The structure and content of EHRs vary depending on the system’s implementation (Häyrinen et al., 2008). Most EHR health system records almost the same kind of information but encoded and structured differently (Hoerbst and Ammenwerth, 2010). However, some studies, such as the openehr.org project, proposed solutions to standardize them to allow better interoperability. The information contained in the EHR is used by many actors, hence the usage of a common and standardized codification such SNOMED-CT to denote medical concepts is encouraged. This may reduce the problem of misinterpretation due the naming of concepts. Patients are encouraged to include all these kinds of information in their question to help doctors with their decision-making.

3.4 Components of the approach

A text goes through different stages to get interpreted (Figure 1). The first one is the natural language analysis followed by the information extraction task. The second one is the projection into the meta-model and the execution of the reasoning modules for the interpretation process. The final state is the presentation of the result.

3.4.1 Input/output of the system.

The system takes as input textual health documents (questions, descriptions, reports, or any other medical document). As output, the system returns an interpretation model containing the relevant medical concepts found within the input text.

3.4.2 Information extraction.

The first component of the approach is the information extraction. It performs an automatic extraction of the concepts and relations expressed within the text. Each text is divided into sentences analyzed by a parser to build their syntactical trees. Then, the information is extracted by matching manually defined syntactical patterns against the syntactical trees. The result is represented using RDF like format triplets <subject, predicate, object> or predicate (subject, object).

3.4.3 Meta-Model for health-care description.

We collected a set of questions posted on medical forums specialized in cardiology. Each question was manually analyzed to extract the relevant information. The information was categorized into concepts and relationships were derived from them to build a common structure. This Meta-Analysis of the questions resulted into a meta-model, reviewed by a cardiologist to judge of the relevance of the requested information (Figure 2). Most of the analyzed questions ranged from online diagnosis of a given set of manifested symptoms, prevention recommendations, to simply explanation requests of medical concepts. The well-constructed ones had captured the attention of doctors and benefited from an answer.

During the conceptualization process of the solution, Ontology Design Patterns can be searched and selected to be reused whenever possible to accelerate the process of modeling and save time (Hammar, 2017). An ontology design pattern is a reusable successful solution to a recurrent modeling problem[5]. ODP has been already used in some biomedical ontologies (Mortensen et al., 2012). They have also been used to improve interoperability in EHR (Martínez-Costa et al., 2014). However, in our study we did not use ODPs for the design of our meta-model.

The meta-model will be used as (a basis to define lexico-syntactical patterns for the information extraction). Lexico-Syntactic ODPs are recurrent patterns used in knowledge extraction. Some patterns for common relationships like (is-a, part-of, […]) can be reused to extract part-of relationships, for instance, the pattern [(NP<part>,)* and] NP<part> PART NP<whole>[5] for the part-of relationship. They are recommended in ontology engineering (Hammar, 2012). However, since the users have various ways to express their statements, we needed a certain kind of adaptability in the definition of the patterns, which are not always common. In our case, as the relations defined by the meta-model are quite specific, it is difficult to find reusable patterns. Therefore, we define them ourselves.

The meta-model, represented in (Figure 3), predefines the set of the necessary information to construct cohesive medical descriptions. It can be grasped by a regular health seeker, and at the same time, it answers to the doctors’ requirements.

The meta-model can be designed either by the knowledge engineers then reviewed and refined by the experts or by the experts themselves. Afterwards the knowledge engineers align it with the other resources and define the extraction patterns.

3.4.4 Mapping the Meta-Model with SNOMED-CT and WORDNET.

The purpose of the mapping module is to establish bidirectional links between the meta-model and all the other resources needed during the acquisition process (Table I). In this study, SNOMED-CT was chosen as a medical expertise along with WordNet as a layman terminology to improve the quality of the information extraction. The meta-model defines the structure of the prerequisite information to acquire which, afterwards, will be converted into extraction patterns. The ontological inference will occur inside SNOMED-CT and the meta-model will be used to evaluate the coherence of the extracted information.

The advantage of the mapping is to spare health seekers the intrinsic complexity of domain ontologies yet hinting expert concepts using reasoning mechanisms on the lay words. In addition, the alignment guides during the acquisition process of the information. For example, I feel a pain in <X> X is more likely to be a body part or an organ. Hence, <X> is aligned with the higher concepts of body parts or organs in the domain ontology. This helps to check the consistency of the input and select accurate interpretation concepts.

3.4.5 Instantiated model.

The construction of the instantiated model goes through a cyclic process composed of two sub-tasks: Projection of the extracted information and Interpretation using the reasoning processes. The extracted information will be projected into the meta-model to check its coherence using SNOMED-CT and WordNet. The final result is an ontological model of the input text composed of a set of relationships between medical concepts (Figure 4).

3.4.6 Reasoning: Induction, deduction, abduction.

Three main reasoning mechanisms are used to improve the information extraction. The ontological modeling allows us to perform reasoning like deduction. In addition to that, there are also two other methods of reasoning, induction and abduction through the context. They enable the announcement of new hypothesis and evaluating them, performing term disambiguation, specializing generic concepts and checking the coherence.

Deduction: Given a set of premises, deduce possible concepts.

I’ve been prescribed insulin.

Insulin is prescribed for diabetic people.

Since “Insulin” is “Drug” and “Diabetes” is “Metabolic Disease”.

Then deduce: the person has diabetes.

Induction: Given a conclusion, induce the list of possible concepts.

He is diabetic.

Thirst, Hunger, Fatigue are symptoms associated with diabetes.

Then induce: He is more likely to experience the symptoms of thirst, hunger, and fatigue.

Abduction: based on the context emanates plausible hypothesis.

He has a broken leg.

Then abduct: he has either suffered a trauma to the leg, or he suffers from the Brittle Bone Disease (Osteogenesis Imperfecta SNOMED-CT ID: 78314001).

4. Experimentation and evaluation

4.1 Presentation of the implemented prototype

To validate the effectiveness of the approach, authors designed and coded a prototype using Java. The prototype takes as input a textual health document and produces as output an interpretation model. Each model is composed of a set of triplets Predicate (Subject, Object) extracted automatically from the input text following the meta-model’s structure. Subject and Object represent medical concepts and Predicate is the relation between them.

The prototype is composed of four main modules. A module for the lexico-syntactical analysis of the input text and information extraction. A module for loading Snomed-ct and WordNet as external resources and mapping them to the meta-model. A module for the reasoning, and evaluation of the extracted triplets, and finally, a module for the presentation of the interpretation models.

Figure 5 shows all the components and classes composing the prototype along with the resulted model after an automatic analysis of the sentence “I suffer from severe heart palpitations”. The use of this type of sentences is typical among consumers of online health services and the resulted model reflects accurately the input text.

The following sections thoroughly describe the implementation of the prototype. Afterwards another section provides a more exhaustive evaluation of the system using the standardized real-life corpus proposed by ShARe/CLEF (Suominen et al., 2013).

4.2 Solution architecture

The architecture of the prototype has to satisfy two key features: modularity and efficiency. The implementation of this complex system was divided into different modules to manage its production. In addition, since ontologies are considerably large, a fast and efficient treatment of the input text at different stages is crucial.

The diagram (Figure 6) illustrates the solution’s structure and its different components. The solution uses, at different stages, the following resources:

  • a syntactical parser (StanfordParser) to analyze the input text;

  • SNOMED-CT to access the medical vocabulary;

  • WordNet 2.1 for a better coverage of the layman terminology;

  • manually defined syntactical patterns for information extraction; and

  • a meta-model once instantiated yields an interpretation model representing the input text.

4.3 Information extraction

Before extracting information, each input text is divided into sentences. Then, each sentence is analyzed using StanfordParser to build its syntactical tree. The syntactical trees were encoded as nested lists, where a leaf (a two-elements list [Word:POS]) represents a word and its part of speech, and all other inner elements (multi-elements lists) represent a node in the syntactical tree (VP, NP, etc).

Afterwards, to extract the information, a set of syntactical patterns were manually defined, guided by the designed meta-model similar to the patterns proposed by Hearst (1992). Each pattern aims at extracting a Subject (concept), an Object (concept) and a Predicate (relationship). Table II illustrates the various kinds of concepts and relationships sought in the input text.

Excerpt from the file containing the syntactical patterns:


(Tree Pattern)































4.4 Mapping resources (Meta-Model, SNOMED-CT, WORDNET)

Each entity (concept or relationship) of the meta-model was manually mapped with its corresponding SNOMED-CT high-level concept or WordNet Synset and stored in the file (mappings.txt) (Table III).

The system has a graph-based implementation. It loads SNOMED-CT (snapshot version) into a graph data structure from the files:

  • sct2_Relationship: which contains the relationships between the medical concepts. There are 58 different types of relationships in SNOMED-CT.

  • sct2_Description: which contains the names of the medical concepts and relationships. sct2_Concept: contains a list of 317057 different concepts.

It uses SNOMED-CT to recognize and evaluate medical concepts in the text.

The system also loads WordNet 2.1 as a graph from the files (data.noun, data.adj, data.adv, data.verb). Defining all the extraction patterns manually with all possible synonymous words increases unnecessarily the total number of patterns. With WordNet, synonymous words can be queried to enrich the base of patterns automatically. For instance, instead of writing patterns for each possible word, dwell, live or inhabit, one pattern with LIVE_RELATIONSHIP can be defined and the Synset Live (02624510) is associated to it. Consequently, all synonymous words of live will be covered with a single syntactical pattern.

The manually defined words are used to handle cases where there is not a direct match with WordNet or SNOMED-CT, yet it can be associated to a specific entity of the meta-model. For instance, pronouns like, “she” or “he”, can be associated to the concept Person of the meta-model.

4.5 Reasoning and evaluation of the extracted triplets

The system searches recursively for positive matches of patterns in the syntactical tree of each sentence (see Figure 7). One sentence may have many matching triplets. The system evaluates and keeps only the accurate triplets. This is performed by comparing the words at the leaves of the syntactical tree with the subject, object and predicate entities of the pattern. For example, the system evaluates the concordance between SUFFER_REL. and the word “suffer”, between PERSON_CON. and “I”, and between SYMPTOM_CON and “from severe heart palpitations”. The system accomplishes this using the mapping file following this matching algorithm:

# once the mapping is loaded from the file mapping.txt

– if can be null (nl_true) and the word is null return true

– else If ∃ special case match (ow_word) then return the word

– else If ∃ Snomed (sn_snomedCpt) match, then return snomedCpt

– else If ∃ WordNet (wn_synsetID) match, then return synsetID

– else the word does not match the entity.

When there is not a direct match, the reasoning algorithm is used to evaluate the triplet to decide whether to keep it or not according to a defined threshold value. A lower threshold, yields sparser interpretation, higher threshold gives a more concise yet confined interpretation. Here are two evaluation examples of extracted triplets:

O[SYMPTOM:palpitations], S[PERSON:I], P[SUFFER:suffer]

suffer(I,palpitations) Accuracy: 100%

O[SYMPTOM:palpitations], S[SEVERITY:severe], P[SEVERITY:null]

severity(severe,palpitations) Accuracy: 100%

To handle misspellings, which is common in lay descriptions, the system uses the normalized Levenshtein distance (Yujian and Bo, 2007) to measure the resemblance of two strings. To check if a word introduced by a patient is a medical concept, the word is compared with the nomenclature of each concept in SNOMED-CT. The nomenclature of a concept is composed of a fully specified name and a list of synonymous names. If there is a perfect match, then the distance equals 0, otherwise it tends to one according to the dissimilitude of the two strings. Afterwards, the threshold determines the tolerable margin of error.

Each extracted triplet is evaluated (Figure 8). If there is a direct match, then the system adds it to the model being constructed. If no match is found, the system evaluates its coherence first using the deduction process. This is performed by exploring the hierarchy of the concepts provided in SNOMED-CT following the IS-A relationships. For instance, the concept “Pain” in SNOMED-CT is linked by IS-A relationships to the concept “Clinical Finding” which is a symptom by deduction. The second step is the induction process, which is performed using WordNet. For instance, both sentences “I feel a pain.” and “I suffer from pain.” refer to the same symptom “Pain” but expressed differently. WordNet is used to evaluate the similarity or the closeness of the variant words. The third kind of reasoning is the abduction. For instance, the pronoun “I” can refer to a patient or, in general, to a person. By abduction, since a patient suffers from a symptom, it can be announced that “I” refers to the entity “patient” of the meta-model. If after all three reasoning processes have been performed, yet no match is found, the triplet is judged as not coherent and is disregarded.

4.6 Evaluation

4.6.1 Dataset and task.

To evaluate the system, the corpus “ShARe/CLEF 2013 eHealth Evaluation Lab 2013” (ShARe/CLEF eHealth, 2013) was used as experimental data. According to Suominen et al. (2013), the main goal of ShARe.CLEFeHealth2013 is to address approaches in making clinical text easier to understand and targeting patients’ information needs in search on the Web. They defined a set of tasks and presented a corpus, along with standardized evaluation metrics, to evaluate and compare the quality of the proposed solutions.

Among the tasks, there is Task 1 which targets the identification and normalization of Disease/Disorders and Sign/Symptoms in textual medical resources and mapping them to the standardized SNOMED-CT terminology using UMLS-CUI (Concept Unique Identifiers). For Task 1, they provided a corpus composed of 199 de-identified real world medical files, from US intensive Care, ranging from Discharge Summaries, Radiology Reports, Echo Reports to ECG Reports (Table IV) annotated manually by trained annotators and provided as a gold corpus for future evaluation of systems aiming at extracting information from medical documents.

4.6.2 Evaluation procedure.

We evaluated the effectiveness of the system in recognizing medical concepts in textual documents using the standardized corpus pertaining to Task 1. The goal is to measure the ability of the system to identify all the medical concepts in the corpus.

First, we defined a set of extraction syntactical patterns. Then, we ran the system on the 199 files composing the corpus. After that, for each file, we collected all the SNOMED-CT concept-ids recognized by the system to compare them with the gold standard. However, as the gold standard uses UMLS-CUI codification, SNOMED-CT concept-ids extracted by the system from the documents were converted to their corresponding UMLS-CUI using the file MRCONSO.RRF[6] available in UMLS.

Finally, we computed the standardized evaluation metrics: Recall, Precision, F1-Score as follow:

Precision = TruePositives/(TruePositives + FalsePositives)
Recall = TruePositives/(TruePositives + FalseNegatives)
F1Score = 2 * Recall * Precision/(Recall + Precision)

TruePositives = number of concepts identified by the system and were the same with the gold standard;

FalsePositives = number of spurious concepts returned by the system; and

FalseNegatives = number of missing concepts by the system.

To compare our findings with the results of the campaign CLEF 2013, we also computed the Exact Accuracy and the Relaxed Accuracy as follow.

Exact Accuracy = (the number of concepts with correctly generated code)/(the total number of gold standard concepts).

In this case, the system was penalized for incorrect code assignment for annotations that were not detected by the system.

Relaxed Accuracy = (the number of concepts with correctly generated code)/(total number of concepts with strictly correct codes generated by the system).

In this case, the system was only evaluated on annotations that were detected by the system.

4.6.3 Experimental results.

For each input file, the system returned a list of concepts which were compared with the list given in the gold standard. Then, we measured the rate of the correctly identified concepts, the wrongly identified concepts and the missing concepts using the measures defined previously.

The system obtained an F1-Score of 0.799 with a Precision and Recall of 0.933 and 0.700 respectively, hence a Relaxed Accuracy of 0.933 and an Exact Accuracy of 0.700. Then, these results were compared with the results obtained by the systems presented at the conference “ShARe/CLEF 2013” for Task 1b (Suominen et al., 2013) (see Table V).

Overall, the system performed better than the systems proposed. The meta-model constrained the cases for the information extraction by providing guidance through the contextualization of the medical concepts which improved the Precision.

However, concepts are sometimes described using multiple words positioned apart from each other, hence positioned on different branches of the syntactical tree. For instance, in this sentence, taken from the corpus: “Abdomen is soft, nontender, nondistended, negative bruits.” instead of “Abdominal bruits”, the system returned the general concept “Bruits” reducing the Recall. This problem can be addressed by defining more complex patterns in future work.

Another impediment is that medical reports in the corpus were not always written in full sentences, hence the underlying syntactical trees generated by the parser were not accurate which hindered the subsequent extraction process. This opens the room for even more improvements of the system by training specific parsers for the intended corpus.

4.6.4 Performance evaluation.

As a second experimentation, the prototype is evaluated with ontologies of different sizes to assess its efficiency. SNOMED-CD contains in total 317,057 concepts. We loaded respectively 300, 3,000, 30,000, 100,000, 200,000, then 317,057 concepts and in each case we measured the average response time in milliseconds after analysis of a single normal size document. The results shown in Figure 9 record the size of the ontology in number of concepts on the x-axis and the average response time in (ms) on the y-axis, after analysis of a document. The response time of the prototype grows linearly with regards to the size of the ontology with a temporal complexity of an order of O(N), where N is the number of concepts in the ontology.

5. Results discussion

5.1 Strengths and limitations

Acquiring information from a layman description is a complex task due mainly to the ambiguous usage of the vocabulary and the lack of the specialized jargon. This paper proposes a structured approach which aims at analyzing and extracting information from such sparse descriptions. The approach improves this process by using a meta-model to confine the required information to acquire and enrich it with expert concepts by performing different reasoning mechanisms. The reasoning mechanisms help to acquire, interpret, and check the consistency of the information by referring to expert concepts retrieved from SNOMED-CT.

The adaptability of the approach made it easier to orient and target the information acquisition process. The meta-model defines the specific information extraction patterns used to build models that are relevant to the current task. In this case, answering to the doctors’ needs. Instead of offering a rigid form, it allows a more freely way to express questions or descriptions in natural text then extract the information from it.

Even though, in this study, only SNOMEC-CT was used, other additional ontologies or resources can be added to increase the scope of the acquired knowledge. For instance, anatomy ontology, drug ontology, etc. can be added. In addition, the reasoning mechanisms are separated from the mapping itself which made the approach reusable with additional ontologies as long as the knowledge engineer establishes the mapping of the meta-model with the new resources and manually defines the appropriate syntactical patterns for each new case.

In the proposed prototype, the syntactical extraction patterns where manually defined for different case. This made the extraction of information accurate, but as new cases were encountered in new texts, they had to be added to the base of patterns to be treated.

Ontology alignment can be used for better interoperability between ontologies. Doing the mapping manually may seem time-consuming but it is not. In fact, since the meta-model is small enough to be mapped with the higher-level concepts of the ontologies. Nevertheless, during the reasoning process, jumping from one ontology to another is curial for a better performance, so matching between those ontologies is important, and doing it by hand is certainly time consuming and using existing solutions of pre-matched ontologies such as UMLS will alleviate this task.

5.2 Generalizability and perspectives

While this study aimed at an application in the medical field to demonstrate the effectiveness of the approach, it can be applied in other cases of information acquisition such as car diagnosis or software specification (Mukhopadhyay and Ameri, 2016). The goal is to enable the system to automatically understand a human language through domain ontologies which cater the necessary expertise, combined with the accurate engineered meta-model which defines the minimum requirements of information to acquire.

6. Conclusion and future research

This study presented the challenging problem pertaining to expert and layman knowledge discrepancy in online communication and proposed a framework to address it by using domain ontologies. To demonstrate the practical applicability of our approach, we have developed and implemented a prototype of the framework for an application in health care. The prototype automated the process of medical information extraction from sparse and potentially ambiguous descriptions formulated by most online health consumers. The novelty of the approach lies in the notion of meta-modelling to effectively reduce the gap between the vocabulary of the medical experts (SNOMED-CT) and the consumers of health and medical services. Actually, most modern information extraction tools perform better with technically inclined documents contrary to lay terms based documents. However, our approach significantly differs from the existing in a way that the meta-model establishes the context that defines the required information to extract. In addition, the usage of manually handcrafted syntactical extraction patterns allowed a great margin of adaptability in the task of information extraction.

To evaluate the effectiveness and efficiency of the prototype, two experimentations were performed. The first experimentation evaluated the prototype’s ability to recognize medical information in textual documents using the standardized golden corpus provided in CLEF 2013. The results analysis showed the effectiveness and adaptability of the approach regarding the extraction of information from medical documents. The second experimentation measures the responsiveness of the system with ontologies of different sizes. The prototype implementation offered satisfactory performance results which proved its scalability with additional domain ontologies.

In this work, we focused on the extraction of the main concepts in medical texts. As future work, we plan to extend this work to further consider a complete analysis of the patients’ questions. To handle more complex sentences, additional tasks will be included in the prototype such as the task of recognizing negation and dealing with modalities. We also need to add additional lexical patterns to handle measures such as medication dozes or laboratory test results. The aim is to propose a system that can automatically understand the questions of the online health consumers and propose the appropriate answers accordingly.


Solution’s pipeline

Figure 1.

Solution’s pipeline

Bridging the knowledge discrepancy through meta-modelling

Figure 2.

Bridging the knowledge discrepancy through meta-modelling

A meta-model of medical questions acquisition

Figure 3.

A meta-model of medical questions acquisition

Instantiation model of the sentence: “I feel a pain in my heart”

Figure 4.

Instantiation model of the sentence: “I feel a pain in my heart”

The prototype after analysis of ‘I suffer from severe heart palpitations’

Figure 5.

The prototype after analysis of ‘I suffer from severe heart palpitations’

A diagram of the different components of the prototype

Figure 6.

A diagram of the different components of the prototype

A syntactical tree of a sentence and two syntactical matching patterns

Figure 7.

A syntactical tree of a sentence and two syntactical matching patterns

The reasoning steps with an illustrative example

Figure 8.

The reasoning steps with an illustrative example

Evaluation of the average response time of the system with different sizes of the ontology

Figure 9.

Evaluation of the average response time of the system with different sizes of the ontology

Top-level concepts of SNOMED-CT hierarchy and their respective meta-model alignment

SNOMED concepts (Meta-model entity)
Physical force
Social context (Context)
Qualifier value
Physical object
Body structure (Part of body, Organ)
Clinical finding (Symptom)
Special concept
Linkage concept
Observable entity
Staging and scales
Situation with explicit context (Context)
Environment or geographical location (Context)
Drug (Drug)

Excerpt of the list of relationships to extract

Relation Description Concepts involved
SUFFER The symptom a person might have Person suffer symptom
TAKE A drug a person might be prescribed Person take drug
HAVE An antecedent disease a person might have Person have disease
RELATE A context on which the person is Person relate context
TAKE A medical exam a person might have had Person take medical exam

Excerpt of the manually defined mapping between the meta-model entities, SNOMED-CT concepts and WordNet synsets

Meta-model entity Resource ID
ow_she ow_he …
wn_39364 …
sn_404684003 …

mm_: Meta-Model; wn_: wordnet; sn_: snomed-ct; nl_: it can be null or empty (nothing associated to it); ow_: other words

199 files composing the gold corpus for Task 1

File type Amount
Total: 199

Comparing our result with the results obtained at CLEF 2013

System ID Strict accuracy* Relaxed accuracy System ID Strict accuracy Relaxed accuracy*
Our System 0.7 0.933 (AEHRC.A)0.1 0.199 0.939
NCBI.2 0.589 0.895 Our System 0.7 0.933
NCBI.1 0.587 0.897 CORAL.1 0.41 0.921
(Mayo.A)0.2 0.546 0.86 CORAL.2 0.439 0.902
(UTHealthCCB.A)0.1 0.514 0.728 NCBI.1 0.587 0.897
(UTHealthCCB.A)0.2 0.506 0.717 NCBI.2 0.589 0.895
(Mayo.A)0.1 0.502 0.87 NIL-UCM.1 0.362 0.871
KPSCMI.1 0.443 0.865 (Mayo.A)0.1 0.502 0.87
CLEAR.2 0.44 0.704 KPSCMI.1 0.443 0.865
CORAL.2 0.439 0.902 (Mayo.A)0.2 0.546 0.86
CORAL.1 0.41 0.921 NIL-UCM.2 0.362 0.85
CLEAR.1 0.409 0.713 (UTHealthCCB.A)0.1 0.514 0.728
NIL-UCM.2 0.362 0.85 (UTHealthCCB.A)0.2 0.506 0.717
NIL-UCM.1 0.362 0.871 CLEAR.1 0.409 0.713
(AEHRC.A)0.2 0.313 0.552 CLEAR.2 0.44 0.704
(WVU.SS&VJ)0.1 0.309 0.622 (WVU.SS&VJ)0.1 0.309 0.622
(UCDCSI.B)0.1 0.299 0.509 (AEHRC.A)0.2 0.313 0.552
(WVU.DG&VJ)0.1 0.241 0.477 (UCDCSI.B)0.1 0.299 0.509
(AEHRC.A)0.1 0.199 0.939 (WVU.DG&VJ)0.1 0.241 0.477
(WVU.AJ&VJ)0.1 0.142 0.448 (WVU.AJ&VJ)0.1 0.142 0.448
(WVU.FP&VJ)0.1 0.112 0.252 (WVU.FP&VJ)0.1 0.112 0.252
(UCDCSI.B.2) 0.006 0.035 (UCDCSI.B.2) 0.006 0.035

*Sorted according to this column



Antolík, J. (2005), “Automatic annotation of medical records”, Studies in Health Technology and Informatics, Vol. 116, p. 817.

Bauer, C. and Dey, A.K. (2016), “Considering context in the design of intelligent systems: current practices and suggestions for improvement”, Journal of Systems and Software, Vol. 112, pp. 26-47, available at: https://doi.org/10.1016/j.jss.2015.10.041

Beez, U., Humm, B.G. and Walsh, P. (2015), “Semantic AutoSuggest for electronic health records”, International Conference on Computational Science and Computational Intelligence (CSCI), Presented at the 2015 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 760-765, available at: https://doi.org/10.1109/CSCI.2015.85

Ben Abacha, A. and Zweigenbaum, P. (2015), “MEANS: a medical question-answering system combining NLP techniques and semantic web technologies”, Information Processing and Management, Vol. 51, pp. 570-594, available at: https://doi.org/10.1016/j.ipm.2015.04.006

Carter, J.H. (2008), Electronic Health Records: A Guide for Clinicians and Administrators, ACP Press, Minneapolis.

Consumer Health Vocabulary Initiative (2013), [WWW Document], available at: http://consumerhealthvocab.org/ (accessed 13 March 2017).

Deleger, L., Ligozat, A.-L., Grouin, C., Zweigenbaum, P. and Neveol, A. (2014), “Annotation of specialized corpora using a comprehensive entity and relation scheme”, Proceedings of the Ninth International Conference on Language Resources and Evaluation. Presented at the LREC’14, European Language Resources Association (ELRA), Reykjavik.

Ely, J.W., Osheroff, J.A., Ebell, M.H., Bergus, G.R., Levy, B.T., Chambliss, M.L. and Evans, E.R. (1999), “Analysis of questions asked by family doctors regarding patient care”, BMJ, Vol. 319 No. 7206, pp. 358-361.

Fleuren, W.W.M. and Alkema, W. (2015), “Application of text mining in the biomedical domain”, Methods, Vol. 74, pp. 97-106, available at: https://doi.org/10.1016/j.ymeth.2015.01.015

Fox, S. and Duggan, M. (2013), Health Online, Pew Res. Cent. Internet Sci. Tech.

Grabar, N. and Hamon, T. (2014), “Automatic extraction of layman names for technical medical terms”, IEEE International Conference on Healthcare Informatics, Presented at the 2014 IEEE International Conference on Healthcare Informatics, pp. 310-319, available at: https://doi.org/10.1109/ICHI.2014.49

Gruber, T.R. (1993), “A translation approach to portable ontology specifications”, Knowledge Acquisition, Vol. 5, No. 2, pp. 199-220, available at: https://doi.org/10.1006/knac.1993.1008

Guo, Q. and Zhang, M. (2008), “Question answering system based on ontology and semantic web”, Rough Sets and Knowledge Technology. Presented at the International Conference on Rough Sets and Knowledge Technology, Springer, Berlin, Heidelberg, pp. 652-659, available at: https://doi.org/10.1007/978-3-540-79721-0_87

Ha, J.F. and Longnecker, N. (2010), “Doctor-patient communication: a review”, The Ochsner Journal, Vol. 10 No. 1, pp. 38-43.

Hammar, K. (2012), Ontology Design Patterns in Use – Lessons Learnt from an Ontology Engineering Case, WOP, Jönköping.

Hammar, K. (2017), Content Ontology Design Patterns: Qualities, Methods, and Tools, Linköping University Electronic Press, Linköping.

Häyrinen, K., Saranto, K. and Nykänen, P. (2008), “Definition, structure, content, use and impacts of electronic health records: a review of the research literature”, International Journal of Medical Informatics, Vol. 77, pp. 291-304, available at: https://doi.org/10.1016/j.ijmedinf.2007.09.001

Hearst, M.A. (1992), “Automatic acquisition of hyponyms from large text corpora”, Proceedings of the 14th Conference on Computational Linguistics – Volume 2, COLING '92, Association for Computational Linguistics, Stroudsburg, PA, pp. 539-545, available at: https://doi.org/10.3115/992133.992154

Hoerbst, A. and Ammenwerth, E. (2010), “Electronic health records. A systematic review on quality requirements”, Methods of Information in Medicine, Vol. 49, pp. 320-336, available at: https://doi.org/10.3414/ME10-01-0038

Jacquemart, P. and Zweigenbaum, P. (2003), “Towards a medical question-answering system: a feasibility study”, Studies in Health Technology and Informatics, Vol. 95, pp. 463-468.

Kandula, S., Curtis, D. and Zeng-Treitler, Q. (2010), “A semantic and syntactic text simplification tool for health content”, AMIA Annual Symposium Proceedings, pp. 366-370.

Kayes, A.S.M., Han, J. and Colman, A. (2015a), “OntCAAC: an ontology-based approach to context-aware access control for software services”, Computer Journal, Vol. 58 No. 11, pp. 3000-3034, available at: https://doi.org/10.1093/comjnl/bxv034

Kayes, A.S.M., Han, J. and Colman, A. (2015b), “An ontological framework for situation-aware access control of software services”, Information Systems, Vol. 53, pp. 253-277, available at: https://doi.org/10.1016/j.is.2015.03.011

Kayes, A.S.M., Rahayu, W., Dillon, T., Chang, E. and Han, J. (2017), “Context-aware access control with imprecise context characterization through a combined fuzzy logic and ontology-based approach”, pp. 132-153, available at: https://doi.org/10.1007/978-3-319-69462-7_10

Lim, E.H.Y., J.N.K, L. and Lee, R.S.T. (2011), “Computational knowledge and ontology”, Knowledge Seeker – Ontology Modelling for Information Search and Management, Intelligent Systems Reference Library, Springer, Berlin Heidelberg, pp. 3-12, available at: https://doi.org/10.1007/978-3-642-17916-7_1

Makovsky, I.C. (2013), “Online health research eclipsing patient-doctor conversations”, [WWW Document]. DiD Story, available at: http://thestory.didagency.com/post/62915309979/online-health-research-eclipsing-patient-doctor (accessed 13 March 2017).

Martínez-Costa, C., Karlsson, D. and Schulz, S. (2014), “Ontology patterns for clinical information modelling”, Proceedings of the 5th International Conference on Ontology and Semantic Web Patterns – Volume 1302, WOP’14. Aachen, CEUR-WS.org, pp. 61-72.

MIT Critical Data (2016), Secondary Analysis of Electronic Health Records, Springer International Publishing, New York, NY.

Mortensen, J.M., Horridge, M., Musen, M.A. and Noy, N.F. (2012), “Modest use of ontology design patterns in a repository of biomedical ontologies”, Proceedings of the 3rd International Conference on Ontology Patterns – Volume 929, WOP’12, Aachen, CEUR-WS.org, pp. 37-48.

Mukhopadhyay, A. and Ameri, F. (2016), “An ontological approach to engineering requirement representation and analysis”, Artificial Intelligence for Engineering Design, Analysis and Manufacturing, Vol. 30, pp. 337-352, available at: https://doi.org/10.1017/S0890060416000330

Nie, L., Wang, M., Gao, Y., Zha, Z.J. and Chua, T.S. (2013), “Beyond text QA: multimedia answer generation by harvesting web information”, IEEE Transactions on Multimedia, Vol. 15 No. 2, pp. 426-441, available at: https://doi.org/10.1109/TMM.2012.2229971

Nie, L., Zhao, Y.L., Akbari, M., Shen, J. and Chua, T.S. (2015b), “Bridging the vocabulary gap between health seekers and healthcare knowledge”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27 No. 2, pp. 396-409, available at: https://doi.org/10.1109/TKDE.2014.2330813

Nie, L., Wang, M., Zhang, L., Yan, S., Zhang, B. and Chua, T.S. (2015a), “Disease inference from health-related questions via sparse deep learning”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27 No. 8, pp. 2107-2119, available at: https://doi.org/10.1109/TKDE.2015.2399298

Roberts, K. and Demner-Fushman, D. (2015), “Toward a natural language interface for EHR questions”, AMIA Summits on Translational Science Proceedings, pp. 157-161.

Roberts, A., Gaizauskas, R., Hepple, M., Davis, N., Demetriou, G., Guo, Y., Kola, J.S., Roberts, I., Setzer, A., Tapuria, A. and Wheeldin, B. (2007), “The CLEF corpus: semantic annotation of clinical text”, AMIA Annual Symposium Proceedings, pp. 625-629.

Schneider, K. (2009), Experience and Knowledge Management in Software Engineering, Springer, Berlin, Heidelberg, available at: https://doi.org/10.1007/978-3-540-95880-2

ShARe/CLEF eHealth (2013), [WWW Document], available at: https://sites.google.com/site/shareclefehealth/home (accessed 13 March 2017).

Smith, B. and Fellbaum, C. (2004), “Medical WordNet: a new methodology for the construction and validation of information resources for consumer health”, Proceedings of the 20th International Conference on Computational Linguistics, COLING '04. Association for Computational Linguistics, Stroudsburg, PA, available at: https://doi.org/10.3115/1220355.1220409

SNOMED International (2018), [WWW Document], available at: www.snomed.org/snomed-ct (accessed 1 August 2018).

Suominen, H., Salanterä, S., Velupillai, S., Chapman, W.W., Savova, G., Elhadad, N., Pradhan, S., South, B.R., Mowery, D.L., Jones, G.J.F., Leveling, J., Kelly, L., Goeuriot, L., Martinez, D. and Zuccon, G. (2013), “Overview of the ShARe/CLEF eHealth evaluation lab 2013”, Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science. Presented at the International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, Berlin, Heidelberg, pp. 212-231, available at: https://doi.org/10.1007/978-3-642-40802-1_24

Teutsch, C. (2003), “Patient-doctor communication”, Medical Clinics of North America, Vol. 87 No. 5, pp. 1115-1145, available at: https://doi.org/10.1016/S0025-7125(03)00066-X

Tseytlin, E., Mitchell, K., Legowski, E., Corrigan, J., Chavan, G. and Jacobson, R.S. (2016), “NOBLE – flexible concept recognition for large-scale biomedical natural language processing”, BMC Bioinformatics, Vol. 17, available at: https://doi.org/10.1186/s12859-015-0871-y

van Tellingen, C. (2007), “About hearsay – or reappraisal of the role of the anamnesis as an instrument of meaningful communication”, Netherlands Heart Journal : Monthly Journal of the Netherlands Society of Cardiology and the Netherlands Heart Foundation, Vol. 15 No. 10, pp. 359-362.

Vydiswaran, V.G.V., Mei, Q., Hanauer, D.A. and Zheng, K. (2014), “Mining consumer health vocabulary from community-generated text”, AMIA Annual Symposium Proceedings, pp. 1150-1159.

Yu, H. and Cao, Y. (2008), “Automatically extracting information needs from ad hoc clinical questions. AMIA”, Annu. Symp. Proc, Vol. 2008, pp. 96-100.

Yujian, L. and Bo, L. (2007), “A normalized levenshtein distance metric”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 29 No. 6, pp. 1091-1095, available at: https://doi.org/10.1109/TPAMI.2007.1078

Zeng, Q.T. and Tse, T. (2006), “Exploring and developing consumer health vocabularies”, Journal of the American Medical Informatics Association, Vol. 13 No. 1, pp. 24-29, available at: https://doi.org/10.1197/jamia.M1761

Zhang, S. and Elhadad, N. (2013), “Unsupervised biomedical named entity recognition”, Journal of Biomedical Informatics, Vol. 46 No. 6, pp. 1088-1098, available at: https://doi.org/10.1016/j.jbi.2013.08.004

Zheng, J. and Yu, H. (2016), “Methods for linking EHR notes to education materials”, Information Retrieval Journal, Vol. 19 No. 1-2, pp. 174-188, available at: https://doi.org/10.1007/s10791-015-9263-1

Further reading

Anamnesis (2013), Miller-Keane Encyclopedia and Dictionary of Medicine, Nursing and Allied Health, Seventh Ed.

Corresponding author

Nassim Abdeldjallal Otmani can be contacted at: Nassim.Otmani@irit.fr