Search results

1 – 10 of 52
Article
Publication date: 21 November 2016

Jing Chen, Dan Wang, Quan Lu and Zeyuan Xu

With a mass of electronic multi-topic documents available, there is an increasing need for evaluating emerging analysis tools to help users and digital libraries analyze these…

Abstract

Purpose

With a mass of electronic multi-topic documents available, there is an increasing need for evaluating emerging analysis tools to help users and digital libraries analyze these documents better. The purpose of this paper is to evaluate the effectiveness, efficiency and user satisfaction of THC-DAT, a within-document analysis tool, in reading a multi-topic document.

Design/methodology/approach

The authors reviewed related literature first, then performed a user-centered, comparative evaluation of two within-document analysis tools, THC-DAT and BOOKMARK. THC-DAT extracts a topic hierarchy tree using hierarchical latent Dirichlet allocation (hLDA) method and takes the context information into account. BOOKMARK provides similar functionality to the Table of Contents bookmarks in Adobe Reader. Three novel kinds of tasks were devised for participants to finish on two tools, with objective results to assess reading effectiveness and efficiency. And post-system questionnaires were employed to obtain participants’ subjective judgments about the tools.

Findings

The results confirm that THC-DAT is significantly more effective than BOOKMARK, while not inferior in efficiency. There is some evidence that suggests THC-DAT can slow down the process of approaching cognitive overload and improve users’ willingness to undertake difficult task. Based on qualitative data from questionnaires, the results indicate that users were more satisfied when using THC-DAT than BOOKMARK.

Practical implications

Adopting THC-DAT in digital libraries or electrical document reading systems contributes to promoting users’ reading performance, willingness to undertake difficult task and general satisfaction. Moreover, THC-DAT is of great value to addressing cognitive overload problem in the information retrieval field.

Originality/value

This paper evaluates a novel within-document analysis tool in analyzing a multi-topic document, and proved that this tool is superior to the benchmark in effectiveness and user satisfaction, and not inferior in efficiency.

Details

Library Hi Tech, vol. 34 no. 4
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 21 March 2016

Jing Chen, Tian Tian Wang and Quan Lu

The purpose of this paper is to propose a novel within-document analysis tool (DAT) topic hierarchy and context-based document analysis tool (THC-DAT) which enables users to…

Abstract

Purpose

The purpose of this paper is to propose a novel within-document analysis tool (DAT) topic hierarchy and context-based document analysis tool (THC-DAT) which enables users to interactively analyze any multi-topic document based on fine-grained and hierarchical topics automatically extracted from it. THC-DAT used hierarchical latent Dirichlet allocation method and took the context information into account so that it can reveal the relationships between latent topics and related texts in a document.

Design/methodology/approach

The methodology is a case study. The authors reviewed the related literature first, then utilized a general “build and test” research model. After explaining the model, interface and functions of THC-DAT, a case study was presented using a scholarly paper that was analyzed with the tool.

Findings

THC-DAT can organize and serve document topics and texts hierarchically and context based, which overcomes the drawbacks of traditional DATs. The navigation, browse, search and comparison functions of THC-DAT enable users to read, search and analyze multi-topic document efficiently and effectively.

Practical implications

It can improve the document organization and services in digital libraries or e-readers, by helping users to interactively read, search and analyze documents efficiently and effectively, exploringly learn about unfamiliar topics with little cognitive burden, or deepen their understanding of a document.

Originality/value

This paper designs a tool THC-DAT to analyze document in a THC way. It contributes to overcoming the coarse-analysis drawbacks of existing within-DATs.

Details

Library Hi Tech, vol. 34 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 23 November 2010

Yongzheng Zhang, Evangelos Milios and Nur Zincir‐Heywood

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel…

Abstract

Purpose

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel topic‐based framework to address this problem.

Design/methodology/approach

A two‐stage framework is proposed. The first stage identifies the main topics covered in a web site via clustering and the second stage summarizes each topic separately. The proposed system is evaluated by a user study and compared with the single‐topic summarization approach.

Findings

The user study demonstrates that the clustering‐summarization approach statistically significantly outperforms the plain summarization approach in the multi‐topic web site summarization task. Text‐based clustering based on selecting features with high variance over web pages is reliable; outgoing links are useful if a rich set of cross links is available.

Research limitations/implications

More sophisticated clustering methods than those used in this study are worth investigating. The proposed method should be tested on web content that is less structured than organizational web sites, for example blogs.

Practical implications

The proposed summarization framework can be applied to the effective organization of search engine results and faceted or topical browsing of large web sites.

Originality/value

Several key components are integrated for web site summarization for the first time, including feature selection and link analysis, key phrase and key sentence extraction. Insight into the contributions of links and content to topic‐based summarization was gained. A classification approach is used to minimize the number of parameters.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 5 September 2017

Azadeh Mohebi, Mehri Sedighi and Zahra Zargaran

The purpose of this paper is to introduce an approach for retrieving a set of scientific articles in the field of Information Technology (IT) from a scientific database such as…

Abstract

Purpose

The purpose of this paper is to introduce an approach for retrieving a set of scientific articles in the field of Information Technology (IT) from a scientific database such as Web of Science (WoS), to apply scientometrics indices and compare them with other fields.

Design/methodology/approach

The authors propose to apply a statistical classification-based approach for extracting IT-related articles. In this approach, first, a probabilistic model is introduced to model the subject IT, using keyphrase extraction techniques. Then, they retrieve IT-related articles from all Iranian papers in WoS, based on a Bayesian classification scheme. Based on the probabilistic IT model, they assign an IT membership probability for each article in the database, and then they retrieve the articles with highest probabilities.

Findings

The authors have extracted a set of IT keyphrases, with 1,497 terms through the keyphrase extraction process, for the probabilistic model. They have evaluated the proposed retrieval approach with two approaches: the query-based approach in which the articles are retrieved from WoS using a set of queries composed of limited IT keywords, and the research area-based approach which is based on retrieving the articles using WoS categorizations and research areas. The evaluation and comparison results show that the proposed approach is able to generate more accurate results while retrieving more articles related to IT.

Research limitations/implications

Although this research is limited to the IT subject, it can be generalized for any subject as well. However, for multidisciplinary topics such as IT, special attention should be given to the keyphrase extraction phase. In this research, bigram model is used; however, one can extend it to tri-gram as well.

Originality/value

This paper introduces an integrated approach for retrieving IT-related documents from a collection of scientific documents. The approach has two main phases: building a model for representing topic IT, and retrieving documents based on the model. The model, based on a set of keyphrases, extracted from a collection of IT articles. However, the extraction technique does not rely on Term Frequency-Inverse Document Frequency, since almost all of the articles in the collection share a set of same keyphrases. In addition, a probabilistic membership score is defined to retrieve the IT articles from a collection of scientific articles.

Article
Publication date: 1 January 1973

E. MICHAEL KEEN

Reports a laboratory comparison of the effectiveness and efficiency of five index languages in the subject area of library and information science; three post‐co‐ordinate…

Abstract

Reports a laboratory comparison of the effectiveness and efficiency of five index languages in the subject area of library and information science; three post‐co‐ordinate languages, Compressed Term, Uncontrolled, and Hierarchically Structured, and two pre‐co‐ordinate ones, Hierarchically Structured and Relational Indexing. Eight test comparisons were made, and factors studied were index language specificity and linkage, indexing specificity and exhaustivity, method of co‐ordination, the precision devices of partitioning and relational operators, and the provision of context in the search file. Full details of the test and retrieval results are presented.

Details

Journal of Documentation, vol. 29 no. 1
Type: Research Article
ISSN: 0022-0418

Article
Publication date: 9 December 2019

Noor Arshad, Abu Bakar, Saira Hanif Soroya, Iqra Safder, Sajjad Haider, Saeed-Ul Hassan, Naif Radi Aljohani, Salem Alelyani and Raheel Nawaz

The purpose of this paper is to present a novel approach for mining scientific trends using topics from Call for Papers (CFP). The work contributes a valuable input for…

366

Abstract

Purpose

The purpose of this paper is to present a novel approach for mining scientific trends using topics from Call for Papers (CFP). The work contributes a valuable input for researchers, academics, funding institutes and research administration departments by sharing the trends to set directions of research path.

Design/methodology/approach

The authors procure an innovative CFP data set to analyse scientific evolution and prestige of conferences that set scientific trends using scientific publications indexed in DBLP. Using the Field of Research code 804 from Australian Research Council, the authors identify 146 conferences (from 2006 to 2015) into different thematic areas by matching the terms extracted from publication titles with the Association for Computing Machinery Computing Classification System. Furthermore, the authors enrich the vocabulary of terms from the WordNet dictionary and Growbag data set. To measure the significance of terms, the authors adopt the following weighting schemas: probabilistic, gram, relative, accumulative and hierarchal.

Findings

The results indicate the rise of “big data analytics” from CFP topics in the last few years. Whereas the topics related to “privacy and security” show an exponential increase, the topics related to “semantic web” show a downfall in recent years. While analysing publication output in DBLP that matches CFP indexed in ERA Core A* to C rank conference, the authors identified that A* and A tier conferences not merely set publication trends, since B or C tier conferences target similar CFP.

Originality/value

Overall, the analyses presented in this research are prolific for the scientific community and research administrators to study research trends and better data management of digital libraries pertaining to the scientific literature.

Details

Library Hi Tech, vol. 40 no. 1
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 20 November 2017

Quan Lu, Qingjun Liu, Jing Chen and Ji Li

Since researchers have utilized text signals to develop a mass of within-document visualization analysis tools for reading aid in a long document, there is an increasing need to…

Abstract

Purpose

Since researchers have utilized text signals to develop a mass of within-document visualization analysis tools for reading aid in a long document, there is an increasing need to study the relationship between readers’ behavior of using text signals for navigation and their reading performance in the tools. The purpose of this paper is to combine the text signals using behavior and reading performance in two kinds of analysis tools to verify their relationship and discover whether there is any efficient reading strategy when using text signals to navigate a long document.

Design/methodology/approach

The methodology is a case study. The authors reviewed related literature first. After explaining the design ideas, interface and functions of THC-DAT and BOOKMARK, which are two reading tools utilizing two main kinds of text signals, one utilizing topics and the other utilizing headings for reading aid, a case study was presented to collect click data on the text signals of participants and their reading effectiveness (score) and efficiency (time).

Findings

The results confirm that the text signals using behavior for navigation has a significant impact on reading efficiency and no impact on reading effectiveness in both BOOKMARK and THC-DAT. The discrete degree of clicks behavior on text signals has an impact on reading efficiency. The using behavior of different types of text signals has different impacts on reading efficiency.

Research limitations/implications

Using text signals for navigation time evenly can help improve reading efficiency. And a basic strategy suggested to readers is focusing on reducing their time to find answers when using text signals for navigation in a long document. As to utilizing the two different kinds of text signals, readers can have different strategies. Accordingly, personalized recommendation based on interval of adjacent clicks will help to improve computer-aided reading tools.

Originality/value

This paper combines the text signals using behavior for navigation and reading performance in two kinds of visual analysis tools, studied the relationship between them and discovers some efficient reading strategies when using text signals for navigation to read a long document.

Details

Library Hi Tech, vol. 35 no. 4
Type: Research Article
ISSN: 0737-8831

Keywords

Article
Publication date: 11 November 2014

Mihaela Dinsoreanu and Rodica Potolea

The purpose of this paper is to address the challenge of opinion mining in text documents to perform further analysis such as community detection and consistency control. More…

Abstract

Purpose

The purpose of this paper is to address the challenge of opinion mining in text documents to perform further analysis such as community detection and consistency control. More specifically, we aim to identify and extract opinions from natural language documents and to represent them in a structured manner to identify communities of opinion holders based on their common opinions. Another goal is to rapidly identify similar or contradictory opinions on a target issued by different holders.

Design/methodology/approach

For the opinion extraction problem we opted for a supervised approach focusing on the feature selection problem to improve our classification results. On the community detection problem, we rely on the Infomap community detection algorithm and the multi-scale community detection framework used on a graph representation based on the available opinions and social data.

Findings

The classification performance in terms of precision and recall was significantly improved by adding a set of “meta-features” based on grouping rules of certain part of speech (POS) instead of the actual words. Concerning the evaluation of the community detection feature, we have used two quality metrics: the network modularity and the normalized mutual information (NMI). We evaluated seven one-target similarity functions and ten multi-target aggregation functions and concluded that linear functions perform poorly for data sets with multiple targets, while functions that calculate the average similarity have greater resilience to noise.

Originality/value

Although our solution relies on existing approaches, we managed to adapt and integrate them in an efficient manner. Based on the initial experimental results obtained, we managed to integrate original enhancements to improve the performance of the obtained results.

Details

International Journal of Web Information Systems, vol. 10 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 2 February 2015

Jiunn-Liang Guo, Hei-Chia Wang and Ming-Way Lai

The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The…

Abstract

Purpose

The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus.

Design/methodology/approach

The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique.

Findings

The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books.

Research limitations/implications

Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold.

Practical implications

The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques.

Originality/value

A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

Details

Program, vol. 49 no. 1
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 2 February 2022

Deepak Suresh Asudani, Naresh Kumar Nagwani and Pradeep Singh

Classifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature…

372

Abstract

Purpose

Classifying emails as ham or spam based on their content is essential. Determining the semantic and syntactic meaning of words and putting them in a high-dimensional feature vector form for processing is the most difficult challenge in email categorization. The purpose of this paper is to examine the effectiveness of the pre-trained embedding model for the classification of emails using deep learning classifiers such as the long short-term memory (LSTM) model and convolutional neural network (CNN) model.

Design/methodology/approach

In this paper, global vectors (GloVe) and Bidirectional Encoder Representations Transformers (BERT) pre-trained word embedding are used to identify relationships between words, which helps to classify emails into their relevant categories using machine learning and deep learning models. Two benchmark datasets, SpamAssassin and Enron, are used in the experimentation.

Findings

In the first set of experiments, machine learning classifiers, the support vector machine (SVM) model, perform better than other machine learning methodologies. The second set of experiments compares the deep learning model performance without embedding, GloVe and BERT embedding. The experiments show that GloVe embedding can be helpful for faster execution with better performance on large-sized datasets.

Originality/value

The experiment reveals that the CNN model with GloVe embedding gives slightly better accuracy than the model with BERT embedding and traditional machine learning algorithms to classify an email as ham or spam. It is concluded that the word embedding models improve email classifiers accuracy.

Details

Data Technologies and Applications, vol. 56 no. 4
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 10 of 52