Search results

1 – 10 of over 4000
Article
Publication date: 23 November 2010

Hao Han and Takehiro Tokuda

The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.

Abstract

Purpose

The purpose of this paper is to present a method to realize the flexible and lightweight integration of general web applications.

Design/methodology/approach

The information extraction and functionality emulation method are proposed to realize the web information integration for the general web applications. All the processes of web information searching, submitting and extraction are run at client‐side by end‐user programming like a real web service.

Findings

The implementation shows that the required programming techniques are within the abilities of general web users, and without needing to write too many programs.

Originality/value

A Java‐based class package was developed for web information searching/submitting/extraction, which users can integrate easily with the general web applications.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 1 February 2002

A.C.M. Fong, S.C. Hui and H.L. Vu

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To…

Abstract

Research organisations and individual researchers increasingly choose to share their research findings by providing lists of their published works on the World Wide Web. To facilitate the exchange of ideas, the lists often include links to published papers in portable document format (PDF) or Postscript (PS) format. Generally, these publication Web sites are updated regularly to include new works. While manual monitoring of relevant Web sites is tedious, commercial search engines and information monitoring systems are ineffective in finding and tracking scholarly publications. Analyses the characteristics of publication index pages and describes effective automatic extraction techniques that the authors have developed. The authors’ techniques combine lexical and syntactic analyses with heuristics. The proposed techniques have been implemented and tested for more than 14,000 Web pages and achieved consistently high success rates of around 90 percent.

Details

Online Information Review, vol. 26 no. 1
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 21 September 2012

Jorge Martinez‐Gil and José F. Aldana‐Montes

Semantic similarity measures are very important in many computer‐related fields. Previous works on applications such as data integration, query expansion, tag refactoring or text…

Abstract

Purpose

Semantic similarity measures are very important in many computer‐related fields. Previous works on applications such as data integration, query expansion, tag refactoring or text clustering have used some semantic similarity measures in the past. Despite the usefulness of semantic similarity measures in these applications, the problem of measuring the similarity between two text expressions remains a key challenge. This paper aims to address this issue.

Design/methodology/approach

In this article, the authors propose an optimization environment to improve existing techniques that use the notion of co‐occurrence and the information available on the web to measure similarity between terms.

Findings

The experimental results using the Miller and Charles and Gracia and Mena benchmark datasets show that the proposed approach is able to outperform classic probabilistic web‐based algorithms by a wide margin.

Originality/value

This paper presents two main contributions. The authors propose a novel technique that beats classic probabilistic techniques for measuring semantic similarity between terms. This new technique consists of using not only a search engine for computing web page counts, but a smart combination of several popular web search engines. The approach is evaluated on the Miller and Charles and Gracia and Mena benchmark datasets and compared with existing probabilistic web extraction techniques.

Details

Online Information Review, vol. 36 no. 5
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 31 July 2007

Alesia Zuccala, Mike Thelwall, Charles Oppenheim and Rajveen Dhiensa

The purpose of this paper is to explore the use of LexiURL as a Web intelligence tool for collecting and analysing links to digital libraries, focusing specifically on the…

2112

Abstract

Purpose

The purpose of this paper is to explore the use of LexiURL as a Web intelligence tool for collecting and analysing links to digital libraries, focusing specifically on the National electronic Library for Health (NeLH).

Design/methodology/approach

The Web intelligence techniques in this study are a combination of link analysis (web structure mining), web server log file analysis (web usage mining), and text analysis (web content mining), utilizing the power of commercial search engines and drawing upon the information science fields of bibliometrics and webometrics. LexiURL is a computer program designed to calculate summary statistics for lists of links or URLs. Its output is a series of standard reports, for example listing and counting all of the different domain names in the data.

Findings

Link data, when analysed together with user transaction log files (i.e. Web referring domains) can provide insights into who is using a digital library and when, and who could be using the digital library if they are “surfing” a particular part of the Web; in this case any site that is linked to or colinked with the NeLH. This study found that the NeLH was embedded in a multifaceted Web context, including many governmental, educational, commercial and organisational sites, with the most interesting being sites from the.edu domain, representing American Universities. Not many links directed to the NeLH were followed on September 25, 2005 (the date of the log file analysis and link extraction analysis), which means that users who access the digital library have been arriving at the site via only a few select links, bookmarks and search engine searches, or non‐electronic sources.

Originality/value

A number of studies concerning digital library users have been carried out using log file analysis as a research tool. Log files focus on real‐time user transactions; while LexiURL can be used to extract links and colinks associated with a digital library's growing Web network. This Web network is not recognized often enough, and can be a useful indication of where potential users are surfing, even if they have not yet specifically visited the NeLH site.

Article
Publication date: 3 August 2021

Irvin Dongo, Yudith Cardinale, Ana Aguilera, Fabiola Martinez, Yuni Quintero, German Robayo and David Cabeza

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on…

Abstract

Purpose

This paper aims to perform an exhaustive revision of relevant and recent related studies, which reveals that both extraction methods are currently used to analyze credibility on Twitter. Thus, there is clear evidence of the need of having different options to extract different data for this purpose. Nevertheless, none of these studies perform a comparative evaluation of both extraction techniques. Moreover, the authors extend a previous comparison, which uses a recent developed framework that offers both alternates of data extraction and implements a previously proposed credibility model, by adding a qualitative evaluation and a Twitter-Application Programming Interface (API) performance analysis from different locations.

Design/methodology/approach

As one of the most popular social platforms, Twitter has been the focus of recent research aimed at analyzing the credibility of the shared information. To do so, several proposals use either Twitter API or Web scraping to extract the data to perform the analysis. Qualitative and quantitative evaluations are performed to discover the advantages and disadvantages of both extraction methods.

Findings

The study demonstrates the differences in terms of accuracy and efficiency of both extraction methods and gives relevance to much more problems related to this area to pursue true transparency and legitimacy of information on the Web.

Originality/value

Results report that some Twitter attributes cannot be retrieved by Web scraping. Both methods produce identical credibility values when a robust normalization process is applied to the text (i.e. tweet). Moreover, concerning the time performance, Web scraping is faster than Twitter API and it is more flexible in terms of obtaining data; however, Web scraping is very sensitive to website changes. Additionally, the response time of the Twitter API is proportional to the distance from the central server at San Francisco.

Details

International Journal of Web Information Systems, vol. 17 no. 6
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 7 August 2009

F. Canan Pembe and Tunga Güngör

The purpose of this paper is to develop a new summarisation approach, namely structure‐preserving and query‐biased summarisation, to improve the effectiveness of web searching…

Abstract

Purpose

The purpose of this paper is to develop a new summarisation approach, namely structure‐preserving and query‐biased summarisation, to improve the effectiveness of web searching. During web searching, one aid for users is the document summaries provided in the search results. However, the summaries provided by current search engines have limitations in directing users to relevant documents.

Design/methodology/approach

The proposed system consists of two stages: document structure analysis and summarisation. In the first stage, a rule‐based approach is used to identify the sectional hierarchies of web documents. In the second stage, query‐biased summaries are created, making use of document structure both in the summarisation process and in the output summaries.

Findings

In structural processing, about 70 per cent accuracy in identifying document sectional hierarchies is obtained. The summarisation method is tested on a task‐based evaluation method using English and Turkish document collections. The results show that the proposed method is a significant improvement over both unstructured query‐biased summaries and Google snippets in terms of f‐measure.

Practical implications

The proposed summarisation system can be incorporated into search engines. The structural processing technique also has applications in other information systems, such as browsing, outlining and indexing documents.

Originality/value

In the literature on summarisation, the effects of query‐biased techniques and document structure are considered in only a few works and are researched separately. The research reported here differs from traditional approaches by combining these two aspects in a coherent framework. The work is also the first automatic summarisation study for Turkish targeting web search.

Details

Online Information Review, vol. 33 no. 4
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 1 June 2005

Yanbo Ru and Ellis Horowitz

The existence and continued growth of the invisible web creates a major challenge for search engines that are attempting to organize all of the material on the web into a form…

2368

Abstract

Purpose

The existence and continued growth of the invisible web creates a major challenge for search engines that are attempting to organize all of the material on the web into a form that is easily retrieved by all users. The purpose of this paper is to identify the challenges and problems underlying existing work in this area.

Design/methodology/approach

A discussion based on a short survey of prior work, including automated discovery of invisible web site search interfaces, automated classification of invisible web sites, label assignment and form filling, information extraction from the resulting pages, learning the query language of the search interface, building content summary for an invisible web site, selecting proper databases, integrating invisible web‐search interfaces, and accessing the performance of an invisible web site.

Findings

Existing technologies and tools for indexing the invisible web follow one of two strategies: indexing the web site interface or examining a portion of the contents of an invisible web site and indexing the results.

Originality/value

The paper is of value to those involved with information management.

Details

Online Information Review, vol. 29 no. 3
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 11 April 2008

Yingzi Jin, Mitsuru Ishizuka and Yutaka Matsuo

Purpose – Social relations play an important role in a real community. Interaction patterns reveal relations among actors (such as persons, groups, firms), which can be merged to…

Abstract

Purpose – Social relations play an important role in a real community. Interaction patterns reveal relations among actors (such as persons, groups, firms), which can be merged to produce valuable information such as a network structure. This paper aims to present a new approach to extract inter‐firm networks from the web for further analysis. Design/methodology/approach – In this study extraction of relations between a pair of firms is obtained by using a search engine and text processing. Because names of firms co‐appear coincidentally on the web, an advanced algorithm is proposed, which is characterised by the addition of keywords (“relation keywords”) to a query. The relation keywords are obtained from the web using a Jaccard coefficient. Findings – As an application, a network of 60 firms in Japan is extracted including IT, communication, broadcasting, and electronics firms from the web and comprehensive evaluations of this approach are shown. The alliance and lawsuit relations are easily obtainable from the web using the algorithm. By adding relation keywords to named pairs of firms as a query, It is possible to collect target pages from the top of web pages more precisely than by only using the named pairs as a query. Practical implications – This study proposes a new approach for extracting inter‐firm networks from the web. The obtained network is useful in several ways. It is possible to find a cluster of firms and characterise a firm by its cluster. Business experts often make such inferences based on firm relations and firm groups. For that reason the firm network might enhance inferential abilities on the business domain. Also we might use obtained networks to recommend business partners based on structural advantages. The authors' intuition is that extracting a social network might provide information that is only recognisable from the network point of view. For example, the centrality of each firm is identified only after generating a social network. Originality/value – This study is a first attempt to extract inter‐firm networks from the web using a search engine. The approach is also applicable to other actors, such as famous persons, organisations or other multiple relational entities.

Details

Online Information Review, vol. 32 no. 2
Type: Research Article
ISSN: 1468-4527

Keywords

Article
Publication date: 12 November 2018

Aleksandra Tomašević, Ranka Stanković, Miloš Utvić, Ivan Obradović and Božo Kolonja

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information…

Abstract

Purpose

This paper aims to develop a system, which would enable efficient management and exploitation of documentation in electronic form, related to mining projects, with information retrieval and information extraction (IE) features, using various language resources and natural language processing.

Design/methodology/approach

The system is designed to integrate textual, lexical, semantic and terminological resources, enabling advanced document search and extraction of information. These resources are integrated with a set of Web services and applications, for different user profiles and use-cases.

Findings

The use of the system is illustrated by examples demonstrating keyword search supported by Web query expansion services, search based on regular expressions, corpus search based on local grammars, followed by extraction of information based on this search and finally, search with lexical masks using domain and semantic markers.

Originality/value

The presented system is the first software solution for implementation of human language technology in management of documentation from the mining engineering domain, but it is also applicable to other engineering and non-engineering domains. The system is independent of the type of alphabet (Cyrillic and Latin), which makes it applicable to other languages of the Balkan region related to Serbian, and its support for morphological dictionaries can be applied in most morphologically complex languages, such as Slavic languages. Significant search improvements and the efficiency of IE are based on semantic networks and terminology dictionaries, with the support of local grammars.

Details

The Electronic Library, vol. 36 no. 6
Type: Research Article
ISSN: 0264-0473

Keywords

Article
Publication date: 23 November 2010

Yongzheng Zhang, Evangelos Milios and Nur Zincir‐Heywood

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel…

Abstract

Purpose

Summarization of an entire web site with diverse content may lead to a summary heavily biased towards the site's dominant topics. The purpose of this paper is to present a novel topic‐based framework to address this problem.

Design/methodology/approach

A two‐stage framework is proposed. The first stage identifies the main topics covered in a web site via clustering and the second stage summarizes each topic separately. The proposed system is evaluated by a user study and compared with the single‐topic summarization approach.

Findings

The user study demonstrates that the clustering‐summarization approach statistically significantly outperforms the plain summarization approach in the multi‐topic web site summarization task. Text‐based clustering based on selecting features with high variance over web pages is reliable; outgoing links are useful if a rich set of cross links is available.

Research limitations/implications

More sophisticated clustering methods than those used in this study are worth investigating. The proposed method should be tested on web content that is less structured than organizational web sites, for example blogs.

Practical implications

The proposed summarization framework can be applied to the effective organization of search engine results and faceted or topical browsing of large web sites.

Originality/value

Several key components are integrated for web site summarization for the first time, including feature selection and link analysis, key phrase and key sentence extraction. Insight into the contributions of links and content to topic‐based summarization was gained. A classification approach is used to minimize the number of parameters.

Details

International Journal of Web Information Systems, vol. 6 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

1 – 10 of over 4000