Search results

1 – 6 of 6
Article
Publication date: 3 June 2019

Hongqi Han, Yongsheng Yu, Lijun Wang, Xiaorui Zhai, Yaxin Ran and Jingpeng Han

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise…

Abstract

Purpose

The aim of this study is to present a novel approach based on semantic fingerprinting and a clustering algorithm called density-based spatial clustering of applications with noise (DBSCAN), which can be used to convert investor records into 128-bit semantic fingerprints. Inventor disambiguation is a method used to discover a unique set of underlying inventors and map a set of patents to their corresponding inventors. Resolving the ambiguities between inventors is necessary to improve the quality of the patent database and to ensure accurate entity-level analysis. Most existing methods are based on machine learning and, while they often show good performance, this comes at the cost of time, computational power and storage space.

Design/methodology/approach

Using DBSCAN, the meta and textual data in inventor records are converted into 128-bit semantic fingerprints. However, rather than using a string comparison or cosine similarity to calculate the distance between pair-wise fingerprint records, a binary number comparison function was used in DBSCAN. DBSCAN then clusters the inventor records based on this distance to disambiguate inventor names.

Findings

Experiments conducted on the PatentsView campaign database of the United States Patent and Trademark Office show that this method disambiguates inventor names with recall greater than 99 per cent in less time and with substantially smaller storage requirement.

Research limitations/implications

A better semantic fingerprint algorithm and a better distance function may improve precision. Setting of different clustering parameters for each block or other clustering algorithms will be considered to improve the accuracy of the disambiguation results even further.

Originality/value

Compared with the existing methods, the proposed method does not rely on feature selection and complex feature comparison computation. Most importantly, running time and storage requirements are drastically reduced.

Details

The Electronic Library , vol. 37 no. 2
Type: Research Article
ISSN: 0264-0473

Keywords

Open Access
Article
Publication date: 9 April 2020

Xiaodong Zhang, Ping Li, Xiaoning Ma and Yanjun Liu

The operating wagon records were produced from distinct railway information systems, which resulted in the wagon routing record with the same oriental destination (OD) was…

Abstract

Purpose

The operating wagon records were produced from distinct railway information systems, which resulted in the wagon routing record with the same oriental destination (OD) was different. This phenomenon has brought considerable difficulties to the railway wagon flow forecast. Some were because of poor data quality, which misled the actual prediction, while others were because of the existence of another actual wagon routings. This paper aims at finding all the wagon routing locus patterns from the history records, and thus puts forward an intelligent recognition method for the actual routing locus pattern of railway wagon flow based on SST algorithm.

Design/methodology/approach

Based on the big data of railway wagon flow records, the routing metadata model is constructed, and the historical data and real-time data are fused to improve the reliability of the path forecast results in the work of railway wagon flow forecast. Based on the division of spatial characteristics and the reduction of dimension in the distributary station, the improved Simhash algorithm is used to calculate the routing fingerprint. Combined with Squared Error Adjacency Matrix Clustering algorithm and Tarjan algorithm, the fingerprint similarity is calculated, the spatial characteristics are clustering and identified, the routing locus mode is formed and then the intelligent recognition of the actual wagon flow routing locus is realized.

Findings

This paper puts forward a more realistic method of railway wagon routing pattern recognition algorithm. The problem of traditional railway wagon routing planning is converted into the routing locus pattern recognition problem, and the wagon routing pattern of all OD streams is excavated from the historical data results. The analysis is carried out from three aspects: routing metadata, routing locus fingerprint and routing locus pattern. Then, the intelligent recognition SST-based algorithm of railway wagon routing locus pattern is proposed, which combines the history data and instant data to improve the reliability of the wagon routing selection result. Finally, railway wagon routing locus could be found out accurately, and the case study tests the validity of the algorithm.

Practical implications

Before the forecasting work of railway wagon flow, it needs to know how many kinds of wagon routing locus exist in a certain OD. Mining all the OD routing locus patterns from the railway wagon operating records is helpful to forecast the future routing combined with the wagon characteristics. The work of this paper is the basis of the railway wagon routing forecast.

Originality/value

As the basis of the railway wagon routing forecast, this research not only improves the accuracy and efficiency for the railway wagon routing forecast but also provides the further support of decision-making for the railway freight transportation organization.

Details

Smart and Resilient Transportation, vol. 2 no. 1
Type: Research Article
ISSN: 2632-0487

Keywords

Article
Publication date: 2 March 2023

Hajar Fatemi, Erica Kao, R. Sandra Schillo, Wanyu Li, Pan Du, Nie Jian-Yun and Laurette Dube

This paper examines user generated social media content bearing on consumers’ attitude and belief systems taking the domain of natural food product as illustrative case. This…

Abstract

Purpose

This paper examines user generated social media content bearing on consumers’ attitude and belief systems taking the domain of natural food product as illustrative case. This research sheds light on how consumers think and talk about natural food within the context of food well-being and health.

Design/methodology/approach

The authors used a keyword-based approach to extract user generated content from Twitter and used both food as well-being and food as health frameworks for analysis of more than two million tweets.

Findings

The authors found that consumers mostly discuss food marketing and less frequently discuss food policy. Their results show that tweets regarding naturalness were significantly less frequent in food categories that feature naturalness to an extent, e.g. fruits and vegetables, compared to food categories dominated by technologies, processing and man-made innovation, such as proteins, seasonings and snacks.

Research limitations/implications

This paper provides numerous implications and contributions to the literature on consumer behavior, marketing and public policy in the domain of natural food.

Practical implications

The authors’ exploratory findings can be used to guide food system stakeholders, farmers and food processors to obtain insights into consumers' mindset on food products, novel concepts, systems and diets through social media analytics.

Originality/value

The authors’ results contribute to the literature on the use of social media in food marketing on understanding consumers' attitudes and beliefs toward natural food, food as the well-being literature and food as the health literature, by examining the way consumers think about natural (versus man-made) food using user generated content of Twitter, which has not been previously used.

Details

British Food Journal, vol. 125 no. 9
Type: Research Article
ISSN: 0007-070X

Keywords

Article
Publication date: 4 April 2016

Ilija Subasic, Nebojsa Gvozdenovic and Kris Jack

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate…

Abstract

Purpose

The purpose of this paper is to describe a large-scale algorithm for generating a catalogue of scientific publication records (citations) from a crowd-sourced data, demonstrate how to learn an optimal combination of distance metrics for duplicate detection and introduce a parallel duplicate clustering algorithm.

Design/methodology/approach

The authors developed the algorithm and compared it with state-of-the art systems tackling the same problem. The authors used benchmark data sets (3k data points) to test the effectiveness of our algorithm and a real-life data ( > 90 million) to test the efficiency and scalability of our algorithm.

Findings

The authors show that duplicate detection can be improved by an additional step we call duplicate clustering. The authors also show how to improve the efficiency of map/reduce similarity calculation algorithm by introducing a sampling step. Finally, the authors find that the system is comparable to the state-of-the art systems for duplicate detection, and that it can scale to deal with hundreds of million data points.

Research limitations/implications

Academic researchers can use this paper to understand some of the issues of transitivity in duplicate detection, and its effects on digital catalogue generations.

Practical implications

Industry practitioners can use this paper as a use case study for generating a large-scale real-life catalogue generation system that deals with millions of records in a scalable and efficient way.

Originality/value

In contrast to other similarity calculation algorithms developed for m/r frameworks the authors present a specific variant of similarity calculation that is optimized for duplicate detection of bibliographic records by extending previously proposed e-algorithm based on inverted index creation. In addition, the authors are concerned with more than duplicate detection, and investigate how to group detected duplicates. The authors develop distinct algorithms for duplicate detection and duplicate clustering and use the canopy clustering idea for multi-pass clustering. The work extends the current state-of-the-art by including the duplicate clustering step and demonstrate new strategies for speeding up m/r similarity calculations.

Details

Program, vol. 50 no. 2
Type: Research Article
ISSN: 0033-0337

Keywords

Article
Publication date: 21 November 2018

Ahmed Amir Tazibt and Farida Aoughlis

During crises such as accidents or disasters, an enormous volume of information is generated on the Web. Both people and decision-makers often need to identify relevant and timely…

Abstract

Purpose

During crises such as accidents or disasters, an enormous volume of information is generated on the Web. Both people and decision-makers often need to identify relevant and timely content that can help in understanding what happens and take right decisions, as soon it appears online. However, relevant content can be disseminated in document streams. The available information can also contain redundant content published by different sources. Therefore, the need of automatic construction of summaries that aggregate important, non-redundant and non-outdated pieces of information is becoming critical.

Design/methodology/approach

The aim of this paper is to present a new temporal summarization approach based on a popular topic model in the information retrieval field, the Latent Dirichlet Allocation. The approach consists of filtering documents over streams, extracting relevant parts of information and then using topic modeling to reveal their underlying aspects to extract the most relevant and novel pieces of information to be added to the summary.

Findings

The performance evaluation of the proposed temporal summarization approach based on Latent Dirichlet Allocation, performed on the TREC Temporal Summarization 2014 framework, clearly demonstrates its effectiveness to provide short and precise summaries of events.

Originality/value

Unlike most of the state of the art approaches, the proposed method determines the importance of the pieces of information to be added to the summaries solely relying on their representation in the topic space provided by Latent Dirichlet Allocation, without the use of any external source of evidence.

Details

International Journal of Web Information Systems, vol. 15 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 9 February 2018

Arshad Ahmad, Chong Feng, Shi Ge and Abdallah Yousif

Software developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the…

1743

Abstract

Purpose

Software developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.

Design/methodology/approach

A comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.

Findings

Most of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.

Practical implications

The work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.

Originality/value

The study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.

Details

Data Technologies and Applications, vol. 52 no. 2
Type: Research Article
ISSN: 2514-9288

Keywords

1 – 6 of 6