Search results

1 – 10 of over 25000
To view the access options for this content please click here
Article
Publication date: 19 March 2018

Hyo-Jung Oh, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park and Yong Kim

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

Abstract

Purpose

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

Design/methodology/approach

This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.

Findings

Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.

Research limitations/implications

To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.

Practical implications

The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.

Originality/value

This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Details

Data Technologies and Applications, vol. 52 no. 2
Type: Research Article
ISSN: 2514-9288

Keywords

To view the access options for this content please click here
Article
Publication date: 1 February 2016

Mhamed Zineddine

– The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.

Downloads
1099

Abstract

Purpose

The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.

Design/methodology/approach

A new algorithm was formulated based on best existing algorithms to optimize the existing traffic caused by web crawlers, which is approximately 40 percent of all networking traffic. The crux of this approach is that web servers monitor and log changes and communicate them as an XML file to search engines. The XML file includes the information necessary to generate refreshed pages from existing ones and reference new pages that need to be crawled. Furthermore, the XML file is compressed to decrease its size to the minimum required.

Findings

The results of this study have shown that the traffic caused by search engines’ crawlers might be reduced on average by 84 percent when it comes to text content. However, binary content faces many challenges and new algorithms have to be developed to overcome these issues. The proposed approach will certainly mitigate the deep web issue. The XML files for each domain used by search engines might be used by web browsers to refresh their cache and therefore help reduce the traffic generated by normal users. This reduces users’ perceived latency and improves response time to http requests.

Research limitations/implications

The study sheds light on the deficiencies and weaknesses of the algorithms monitoring changes and generating binary files. However, a substantial decrease of traffic is achieved for text-based web content.

Practical implications

The findings of this research can be adopted by web server software and browsers’ developers and search engine companies to reduce the internet traffic caused by crawlers and cut costs.

Originality/value

The exponential growth of web content and other internet-based services such as cloud computing, and social networks has been causing contention on available bandwidth of the internet network. This research provides a much needed approach to keeping traffic in check.

Details

Internet Research, vol. 26 no. 1
Type: Research Article
ISSN: 1066-2243

Keywords

To view the access options for this content please click here
Article
Publication date: 1 February 1999

Dan Schiller

Espouses the Web with regard to the media and all its areas of relevance. Encourages and supports multinational forms of production as new but admits they may be no more…

Downloads
1348

Abstract

Espouses the Web with regard to the media and all its areas of relevance. Encourages and supports multinational forms of production as new but admits they may be no more sympathetic to social need and democratic practice than previous commercial media. Charts the market and the Web’s changes for commercial business.

Details

info, vol. 1 no. 1
Type: Research Article
ISSN: 1463-6697

Keywords

To view the access options for this content please click here
Article
Publication date: 1 October 2006

Dirk Lewandowski and Philipp Mayr

The purpose of this article is to provide a critical review of Bergman's study on the deep web. In addition, this study brings a new concept into the discussion, the…

Downloads
2620

Abstract

Purpose

The purpose of this article is to provide a critical review of Bergman's study on the deep web. In addition, this study brings a new concept into the discussion, the academic invisible web (AIW). The paper defines the academic invisible web as consisting of all databases and collections relevant to academia but not searchable by the general‐purpose internet search engines. Indexing this part of the invisible web is central to scientific search engines. This paper provides an overview of approaches followed thus far.

Design/methodology/approach

Provides a discussion of measures and calculations, estimation based on informetric laws. Also gives a literature review on approaches for uncovering information from the invisible web.

Findings

Bergman's size estimate of the invisible web is highly questionable. This paper demonstrates some major errors in the conceptual design of the Bergman paper. A new (raw) size estimate is given.

Research limitations/implications

The precision of this estimate is limited due to a small sample size and lack of reliable data.

Practical implications

This study can show that no single library alone will be able to index the academic invisible web. The study suggests a collaboration to accomplish this task.

Originality/value

Provides library managers and those interested in developing academic search engines with data on the size and attributes of the academic invisible web.

Details

Library Hi Tech, vol. 24 no. 4
Type: Research Article
ISSN: 0737-8831

Keywords

To view the access options for this content please click here
Article
Publication date: 1 June 2010

Walter Warnick

The purpose of this paper is to describe the work of the Office of Scientific and Technical Information (OSTI) in the US Department of Energy Office of Science and OSTI's…

Abstract

Purpose

The purpose of this paper is to describe the work of the Office of Scientific and Technical Information (OSTI) in the US Department of Energy Office of Science and OSTI's development of the powerful search engine, WorldWideScience.org. With tools such as Science.gov and WorldWideScience.org, the patron gains access to multiple, geographically dispersed deep web databases and can search all of the constituent sources with a single query.

Design/methodology/approach

The paper is both historical and descriptive.

Findings

That WorldWideScience.org fills a unique niche in discovering scientific material in an information landscape that includes search engines such as Google and Google Scholar.

Originality/value

This is one of the few papers to describe in depth the important work being done by the US Office of Scientific and Technical Information in the field of search and discovery.

Details

Interlending & Document Supply, vol. 38 no. 2
Type: Research Article
ISSN: 0264-1615

Keywords

To view the access options for this content please click here
Article
Publication date: 7 November 2016

Devis Bianchini, Valeria De Antonellis and Michele Melchiori

Modern Enterprise Web Application development can exploit third-party software components, both internal and external to the enterprise, that provide access to huge and…

Abstract

Purpose

Modern Enterprise Web Application development can exploit third-party software components, both internal and external to the enterprise, that provide access to huge and valuable data sets, tested by millions of users and often available as Web application programming interfaces (APIs). In this context, the developers have to select the right data services and might rely, to this purpose, on advanced techniques, based on functional and non-functional data service descriptive features. This paper focuses on this selection task where data service selection may be difficult because the developer has no control on services, and source reputation could be only partially known.

Design/methodology/approach

The proposed framework and methodology are apt to provide advanced search and ranking techniques by considering: lightweight data service descriptions, in terms of (semantic) tags and technical aspects; previously developed aggregations of data services, to use in the selection process of a service the past experiences with the services when used in similar applications; social relationships between developers (social network) and their credibility evaluations. This paper also discusses some experimental results regarding the plan to expand other experiments to check how developers feel using the approach.

Findings

In this paper, a data service selection framework that extends and specializes an existing one for Web APIs selection is presented. The revised multi-layered model for data services is discussed and proper metrics relying on it, meant for supporting the selection of data services in a context of Web application design, are introduced. Model and metrics take into account the network of social relationships between developers, to exploit them for estimating the importance that a developer assigns to other developers’ experience.

Originality/value

This research, with respect to the state of the art, focuses attention on developers’ social networks in an enterprise context, integrating the developers’ credibility assessment and implementing the social network-based data service selection on top of a rich framework based on a multi-perspective model for data services.

Details

International Journal of Web Information Systems, vol. 12 no. 4
Type: Research Article
ISSN: 1744-0084

Keywords

To view the access options for this content please click here
Article
Publication date: 1 March 2002

Norm Medeiros

This article examines the growth of the invisible Web, and efforts underway to make its contents more accessible. Dynamic Web publishing is described. The Open Archives…

Downloads
826

Abstract

This article examines the growth of the invisible Web, and efforts underway to make its contents more accessible. Dynamic Web publishing is described. The Open Archives Metadata Harvesting Protocol is reviewed, as are projects related to OCLC’s implementation of the Open Archives Initiative. Recent Dublin Core activities are reported.

Details

OCLC Systems & Services: International digital library perspectives, vol. 18 no. 1
Type: Research Article
ISSN: 1065-075X

Keywords

To view the access options for this content please click here
Book part
Publication date: 17 May 2018

Monica Maceli

Purpose – As the role of technology in libraries has broadened and expanded, tech-savvy librarians and non-librarian technologists are increasingly working side by side in…

Abstract

Purpose – As the role of technology in libraries has broadened and expanded, tech-savvy librarians and non-librarian technologists are increasingly working side by side in complex digital environments. Little research has explored the key differences between these roles and the implications for the future of the Master of Library Science (MLS) and its variant degrees, particularly as technologists from various backgrounds increasingly enter the information field. This chapter contrasts the technological responsibilities of the two groups to build an understanding of the necessity of the MLS in library-oriented technology work.

Design/Methodology/Approach – Qualitative coding and text mining techniques were used to analyze technology-oriented librarian and non-librarian job advertisements, technology curriculum changes, and surveyed technology interests of current information professionals.

Findings – Findings indicate a clear distinction between librarian and non-librarian technology responsibilities. Librarian positions emphasize web design, data and metadata, technology troubleshooting, and usage of library-oriented software. Non-librarian technologists require programming, database development, and systems administration, with deeper software and systems knowledge. Overlap was noted in the areas of user experience, linked data, and metadata. Several newer trends that information professionals expressed a desire to learn – such as makerspace technologies – were observed to be poorly covered in the technology curriculum, though the MLS curriculum generally covered the tech-savvy librarians’ responsibilities.

Originality/Value – This chapter builds understanding of the current necessity of the MLS in library-oriented technology work, as contrasted against the role of non-librarian technologists, through analysis of a triangulated set of data sources covering employment opportunities, technology curriculum, and librarians’ technology interests.

Details

Re-envisioning the MLS: Perspectives on the Future of Library and Information Science Education
Type: Book
ISBN: 978-1-78754-884-8

Keywords

To view the access options for this content please click here
Article
Publication date: 2 November 2015

Khaled A. Mohamed and Ahmed Hassan

This study aims to explore a framework for evaluating and comparing two federated search tools (FSTs) using two different retrieval protocols: XML gateways and Z39.50…

Downloads
1026

Abstract

Purpose

This study aims to explore a framework for evaluating and comparing two federated search tools (FSTs) using two different retrieval protocols: XML gateways and Z39.50. FSTs are meta-information retrieval systems developed to facilitate the searching of multiple resources through a single search box. FSTs allow searching of heterogeneous platforms, such as bibliographic and full-text databases, online public access catalogues, web search engines and open-access resources.

Design/methodology/approach

The proposed framework consists of three phases: the usability testing, retrievability performance assessment and overall comparison. The think-aloud protocol was implemented for usability testing and FSTs retrieval consistency, and precision tests were carried out to assess the retrievability performance for 20 real user queries.

Findings

Participants were directed to assign weights for the interface usability and system retrievability importance as indicators for FST evaluation. Results indicated that FSTs retrievability performance was of more importance than the interface usability. Participants assigned an average weight of 62 per cent for the system retrievability and 38 per cent for interface usability. In terms of the usability test, there was no significant difference between the two FSTs, while minor differences were found regarding retrieval consistency and precision at 11-point cut-off recall. The overall evaluation showed that the FST based on the XML gateway rated slightly higher than the FST based on the Z39.50 protocol.

Research limitations/implications

This empirical study faced several limitations. First, the lack of participants’ familiarity with usability testing created the need for a deep awareness and rigorous supervision. Second, the difficulties of empirically assessing participants’ perspectives and future attitudes called for mixing between a formal task and the think-aloud protocol for participants in a real environment. This has been a challenge that faced the collection of the usability data including user behaviour, expectations and other empirical data. Third, the differences between the two FSTs in terms of number of connectors and advanced search techniques required setting rigorous procedures for testing FSTs retrieval consistency and precision.

Practical implications

This paper has practical implications in two dimensions. First, its results could be utilized by FST developers to enhance their product’s performance. Second, the framework could be used by librarians to evaluate FSTs performance and capabilities. The framework enables them to compare between library systems in general and FSTs in particular. In addition to these practical implications, the authors encourage researchers to use and enhance the proposed framework.

Social implications

Librarians can use the proposed framework to empirically select an FST, involving users in the selection procedures of these information retrieval systems, so that it accords with users’ perspectives and attitudes and serves the community better.

Originality/value

The proposed framework could be considered a benchmark for FST evaluation.

Details

The Electronic Library, vol. 33 no. 6
Type: Research Article
ISSN: 0264-0473

Keywords

To view the access options for this content please click here
Article
Publication date: 1 June 2003

Glenn McGuigan

Addressing the selection of invisible Web sites for business subject pages as part of collection development, this discussion begins with defining the invisible Web and…

Downloads
660

Abstract

Addressing the selection of invisible Web sites for business subject pages as part of collection development, this discussion begins with defining the invisible Web and examining why certain Web pages are “invisible.” Followed by an acknowledgement of problems concerning the use of terminology, implications for the process of collection development and a brief examination of the criteria in selecting the items for subject Web pages are addressed. Upon using strategies to locate these materials in order to create useful subject pages, selection criteria for the Web sites include determining the credibility of the source, and examining the scope and quality of content.

Details

Collection Building, vol. 22 no. 2
Type: Research Article
ISSN: 0160-4953

Keywords

1 – 10 of over 25000