Search results
1 – 10 of over 8000Fattane Zarrinkalam and Mohsen Kahani
The purpose of this paper is to propose a novel citation recommendation system that inputs a text and recommends publications that should be cited by it. Its goal is to help…
Abstract
Purpose
The purpose of this paper is to propose a novel citation recommendation system that inputs a text and recommends publications that should be cited by it. Its goal is to help researchers in finding related works. Further, this paper seeks to explore the effect of using relational features in addition to textual features on the quality of recommended citations.
Design/methodology/approach
In order to propose a novel citation recommendation system, first a new relational similarity measure is proposed for calculating the relatedness of two publications. Then, a recommendation algorithm is presented that uses both relational and textual features to compute the semantic distances of publications of a bibliographic dataset from the input text.
Findings
The evaluation of the proposed system shows that combining relational features with textual features leads to better recommendations, in comparison with relying only on the textual features. It also demonstrates that citation context plays an important role among textual features. In addition, it is concluded that different relational features have different contributions to the proposed similarity measure.
Originality/value
A new citation recommendation system is proposed which uses a novel semantic distance measure. This measure is based on textual similarities and a new relational similarity concept. The other contribution of this paper is that it sheds more light on the importance of citation context in citation recommendation, by providing more evidences through analysis of the results. In addition, a genetic algorithm is developed for assigning weights to the relational features in the similarity measure.
Details
Keywords
Didem Ölçer and Tuğba Taşkaya Temizel
This paper proposes a framework that automatically assesses content coverage and information quality of health websites for end-users.
Abstract
Purpose
This paper proposes a framework that automatically assesses content coverage and information quality of health websites for end-users.
Design/methodology/approach
The study investigates the impact of textual and content-based features in predicting the quality of health-related texts. Content-based features were acquired using an evidence-based practice guideline in diabetes. A set of textual features inspired by professional health literacy guidelines and the features commonly used for assessing information quality in other domains were also used. In this study, 60 websites about type 2 diabetes were methodically selected for inclusion. Two general practitioners used DISCERN to assess each website in terms of its content coverage and quality.
Findings
The proposed framework outputs were compared with the experts' evaluation scores. The best accuracy was obtained as 88 and 92% with textual features and content-based features for coverage assessment respectively. When both types of features were used, the proposed framework achieved 90% accuracy. For information quality assessment, the content-based features resulted in a higher accuracy of 92% against 88% obtained using the textual features.
Research limitations/implications
The experiments were conducted for websites about type 2 diabetes. As the whole process is costly and requires extensive expert human labelling, the study was carried out in a single domain. However, the methodology is generalizable to other health domains for which evidence-based practice guidelines are available.
Practical implications
Finding high-quality online health information is becoming increasingly difficult due to the high volume of information generated by non-experts in the area. The search engines fail to rank objective health websites higher within the search results. The proposed framework can aid search engine and information platform developers to implement better retrieval techniques, in turn, facilitating end-users' access to high-quality health information.
Social implications
Erroneous, biased or partial health information is a serious problem for end-users who need access to objective information on their health problems. Such information may cause patients to stop their treatments provided by professionals. It might also have adverse financial implications by causing unnecessary expenditures on ineffective treatments. The ability to access high-quality health information has a positive effect on the health of both individuals and the whole society.
Originality/value
The paper demonstrates that automatic assessment of health websites is a domain-specific problem, which cannot be addressed with the general information quality assessment methodologies in the literature. Content coverage of health websites has also been studied in the health domain for the first time in the literature.
Details
Keywords
Mohamed Hammami, Youssef Chahir and Liming Chen
Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable…
Abstract
Along with the ever growingWeb is the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable web content. In this paper, we investigate this problem through WebGuard, our automatic machine learning based pornographic website classification and filtering system. Facing the Internet more and more visual and multimedia as exemplified by pornographic websites, we focus here our attention on the use of skin color related visual content based analysis along with textual and structural content based analysis for improving pornographic website filtering. While the most commercial filtering products on the marketplace are mainly based on textual content‐based analysis such as indicative keywords detection or manually collected black list checking, the originality of our work resides on the addition of structural and visual content‐based analysis to the classical textual content‐based analysis along with several major‐data mining techniques for learning and classifying. Experimented on a testbed of 400 websites including 200 adult sites and 200 non pornographic ones, WebGuard, our Web filtering engine scored a 96.1% classification accuracy rate when only textual and structural content based analysis are used, and 97.4% classification accuracy rate when skin color related visual content based analysis is driven in addition. Further experiments on a black list of 12 311 adult websites manually collected and classified by the French Ministry of Education showed that WebGuard scored 87.82% classification accuracy rate when using only textual and structural content‐based analysis, and 95.62% classification accuracy rate when the visual content‐based analysis is driven in addition. The basic framework of WebGuard can apply to other categorization problems of websites which combine, as most of them do today, textual and visual content.
Details
Keywords
Micro-video platforms have gained attention in recent years and have also become an important new channel for merchants to advertise their products. Since little research has…
Abstract
Purpose
Micro-video platforms have gained attention in recent years and have also become an important new channel for merchants to advertise their products. Since little research has studied micro-video advertising, this paper aims to fill the research gap by exploring the determinants of micro-video advertising clicks. We form a micro-video advertising click prediction model and demonstrate the effectiveness of the multimodal information extracted from the advertisement producers, commodities being sold and micro-video contents in the prediction task.
Design/methodology/approach
A multimodal analysis framework was conducted based on real-world micro-video advertisement datasets. To better capture the relations between different modalities, we adopt a cooperative learning model to predict the advertising clicks.
Findings
The experimental results show that the features extracted from different data sources can improve the prediction performance. Furthermore, the combination of different modal features (visual, acoustic, textual and numerical) is also worth studying. Compared to classical baseline models, the proposed cooperative learning model significantly outperforms the prediction results, which demonstrates that the relations between modalities are also important in advertising micro-video generation.
Originality/value
To the best of our knowledge, this is the first study analysing micro-video advertising effects. With the help of our advertising click prediction model, advertisement producers (merchants or their partners) can benefit from generating more effective micro-video advertisements. Furthermore, micro-video platforms can apply our prediction results to optimise their advertisement allocation algorithm and better manage network traffic. This research can be of great help for more effective development of the micro-video advertisement industry.
Details
Keywords
Yanti Idaya Aspura M.K. and Shahrul Azman Mohd Noah
The purpose of this study is to reduce the semantic distance by proposing a model for integrating indexes of textual and visual features via a multi-modality ontology and the use…
Abstract
Purpose
The purpose of this study is to reduce the semantic distance by proposing a model for integrating indexes of textual and visual features via a multi-modality ontology and the use of DBpedia to improve the comprehensiveness of the ontology to enhance semantic retrieval.
Design/methodology/approach
A multi-modality ontology-based approach was developed to integrate high-level concepts and low-level features, as well as integrate the ontology base with DBpedia to enrich the knowledge resource. A complete ontology model was also developed to represent the domain of sport news, with image caption keywords and image features. Precision and recall were used as metrics to evaluate the effectiveness of the multi-modality approach, and the outputs were compared with those obtained using a single-modality approach (i.e. textual ontology and visual ontology).
Findings
The results based on ten queries show a superior performance of the multi-modality ontology-based IMR system integrated with DBpedia in retrieving correct images in accordance with user queries. The system achieved 100 per cent precision for six of the queries and greater than 80 per cent precision for the other four queries. The text-based system only achieved 100 per cent precision for one query; all other queries yielded precision rates less than 0.500.
Research limitations/implications
This study only focused on BBC Sport News collection in the year 2009.
Practical implications
The paper includes implications for the development of ontology-based retrieval on image collection.
Originality value
This study demonstrates the strength of using a multi-modality ontology integrated with DBpedia for image retrieval to overcome the deficiencies of text-based and ontology-based systems. The result validates semantic text-based with multi-modality ontology and DBpedia as a useful model to reduce the semantic distance.
Details
Keywords
Osamah M. Al-Qershi, Junbum Kwon, Shuning Zhao and Zhaokun Li
For the case of many content features, This paper aims to investigate which content features in video and text ads more contribute to accurately predicting the success of…
Abstract
Purpose
For the case of many content features, This paper aims to investigate which content features in video and text ads more contribute to accurately predicting the success of crowdfunding by comparing prediction models.
Design/methodology/approach
With 1,368 features extracted from 15,195 Kickstarter campaigns in the USA, the authors compare base models such as logistic regression (LR) with tree-based homogeneous ensembles such as eXtreme gradient boosting (XGBoost) and heterogeneous ensembles such as XGBoost + LR.
Findings
XGBoost shows higher prediction accuracy than LR (82% vs 69%), in contrast to the findings of a previous relevant study. Regarding important content features, humans (e.g. founders) are more important than visual objects (e.g. products). In both spoken and written language, words related to experience (e.g. eat) or perception (e.g. hear) are more important than cognitive (e.g. causation) words. In addition, a focus on the future is more important than a present or past time orientation. Speech aids (see and compare) to complement visual content are also effective and positive tone matters in speech.
Research limitations/implications
This research makes theoretical contributions by finding more important visuals (human) and language features (experience, perception and future time). Also, in a multimodal context, complementary cues (e.g. speech aids) across different modalities help. Furthermore, the noncontent parts of speech such as positive “tone” or pace of speech are important.
Practical implications
Founders are encouraged to assess and revise the content of their video or text ads as well as their basic campaign features (e.g. goal, duration and reward) before they launch their campaigns. Next, overly complex ensembles may suffer from overfitting problems. In practice, model validation using unseen data is recommended.
Originality/value
Rather than reducing the number of content feature dimensions (Kaminski and Hopp, 2020), by enabling advanced prediction models to accommodate many contents features, prediction accuracy rises substantially.
Details
Keywords
Shasha Deng, Xuan Cheng and Rong Hu
As convenience and anonymity, people with mental illness are increasingly willing to communicate and share information through social media platforms to receive emotional and…
Abstract
Purpose
As convenience and anonymity, people with mental illness are increasingly willing to communicate and share information through social media platforms to receive emotional and spiritual support. The purpose of this paper is to identify the degree of depression based on people's behavioral patterns and discussion content on the Internet.
Design/methodology/approach
Based on the previous studies on depression, the severity of depression is divided into four categories: no significant depressive symptoms, mild MDD, moderate MDD and severe MDD, and defined each of them. Next, in order to automatically identify the severity, the authors proposed social media digital cues to identify the severity of depression, which include textual lexical features, depressive language features and social behavioral features. Finally, the authors evaluate a system that is developed based on social media digital cues in the experiment using social media data.
Findings
The social media digital cues including textual lexical features, depressive language features and social behavioral features (F1, F2 and F3) is the relatively best one to classify four different levels of depression.
Originality/value
This paper innovatively proposes a social media data-based framework (SMDF) to identify and predict different degrees of depression through social media digital cues and evaluates the accuracy of the detection through social media data, providing useful attempts for the identification and intervention of depression.
Details
Keywords
Juan Yang, Xu Du, Jui-Long Hung and Chih-hsiung Tu
Critical thinking is considered important in psychological science because it enables students to make effective decisions and optimizes their performance. Aiming at the…
Abstract
Purpose
Critical thinking is considered important in psychological science because it enables students to make effective decisions and optimizes their performance. Aiming at the challenges and issues of understanding the student's critical thinking, the objective of this study is to analyze online discussion data through an advanced multi-feature fusion modeling (MFFM) approach for automatically and accurately understanding the student's critical thinking levels.
Design/methodology/approach
An advanced MFFM approach is proposed in this study. Specifically, with considering the time-series characteristic and the high correlations between adjacent words in discussion contents, the long short-term memory–convolutional neural network (LSTM-CNN) architecture is proposed to extract deep semantic features, and then these semantic features are combined with linguistic and psychological knowledge generated by the LIWC2015 tool as the inputs of full-connected layers to automatically and accurately predict students' critical thinking levels that are hidden in online discussion data.
Findings
A series of experiments with 94 students' 7,691 posts were conducted to verify the effectiveness of the proposed approach. The experimental results show that the proposed MFFM approach that combines two types of textual features outperforms baseline methods, and the semantic-based padding can further improve the prediction performance of MFFM. It can achieve 0.8205 overall accuracy and 0.6172 F1 score for the “high” category on the validation dataset. Furthermore, it is found that the semantic features extracted by LSTM-CNN are more powerful for identifying self-introduction or off-topic discussions, while the linguistic, as well as psychological features, can better distinguish the discussion posts with the highest critical thinking level.
Originality/value
With the support of the proposed MFFM approach, online teachers can conveniently and effectively understand the interaction quality of online discussions, which can support instructional decision-making to better promote the student's knowledge construction process and improve learning performance.
Details
Keywords
Xiaobo Tang, Heshen Zhou and Shixuan Li
Predicting highly cited papers can enable an evaluation of the potential of papers and the early detection and determination of academic achievement value. However, most highly…
Abstract
Purpose
Predicting highly cited papers can enable an evaluation of the potential of papers and the early detection and determination of academic achievement value. However, most highly cited paper prediction studies consider early citation information, so predicting highly cited papers by publication is challenging. Therefore, the authors propose a method for predicting early highly cited papers based on their own features.
Design/methodology/approach
This research analyzed academic papers published in the Journal of the Association for Computing Machinery (ACM) from 2000 to 2013. Five types of features were extracted: paper features, journal features, author features, reference features and semantic features. Subsequently, the authors applied a deep neural network (DNN), support vector machine (SVM), decision tree (DT) and logistic regression (LGR), and they predicted highly cited papers 1–3 years after publication.
Findings
Experimental results showed that early highly cited academic papers are predictable when they are first published. The authors’ prediction models showed considerable performance. This study further confirmed that the features of references and authors play an important role in predicting early highly cited papers. In addition, the proportion of high-quality journal references has a more significant impact on prediction.
Originality/value
Based on the available information at the time of publication, this study proposed an effective early highly cited paper prediction model. This study facilitates the early discovery and realization of the value of scientific and technological achievements.
Details
Keywords
Srishti Sharma, Mala Saraswat and Anil Kumar Dubey
Owing to the increased accessibility of internet and related technologies, more and more individuals across the globe now turn to social media for their daily dose of news rather…
Abstract
Purpose
Owing to the increased accessibility of internet and related technologies, more and more individuals across the globe now turn to social media for their daily dose of news rather than traditional news outlets. With the global nature of social media and hardly any checks in place on posting of content, exponential increase in spread of fake news is easy. Businesses propagate fake news to improve their economic standing and influencing consumers and demand, and individuals spread fake news for personal gains like popularity and life goals. The content of fake news is diverse in terms of topics, styles and media platforms, and fake news attempts to distort truth with diverse linguistic styles while simultaneously mocking true news. All these factors together make fake news detection an arduous task. This work tried to check the spread of disinformation on Twitter.
Design/methodology/approach
This study carries out fake news detection using user characteristics and tweet textual content as features. For categorizing user characteristics, this study uses the XGBoost algorithm. To classify the tweet text, this study uses various natural language processing techniques to pre-process the tweets and then apply a hybrid convolutional neural network–recurrent neural network (CNN-RNN) and state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) transformer.
Findings
This study uses a combination of machine learning and deep learning approaches for fake news detection, namely, XGBoost, hybrid CNN-RNN and BERT. The models have also been evaluated and compared with various baseline models to show that this approach effectively tackles this problem.
Originality/value
This study proposes a novel framework that exploits news content and social contexts to learn useful representations for predicting fake news. This model is based on a transformer architecture, which facilitates representation learning from fake news data and helps detect fake news easily. This study also carries out an investigative study on the relative importance of content and social context features for the task of detecting false news and whether absence of one of these categories of features hampers the effectiveness of the resultant system. This investigation can go a long way in aiding further research on the subject and for fake news detection in the presence of extremely noisy or unusable data.
Details