Search results

1 – 10 of 513

View access options

Article

Publication date: 1 May 2006

Web crawlers compared

Christian Girardi, Filippo Ricca and Paolo Tonella

Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web…

HTML

PDF (337 KB)

Downloads

487

Abstract

Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web crawlers, which can automatically navigate within a Web site and perform proper actions (such as download) during the visit. The most important performance indicators for a Web crawler are its completeness and robustness, measuring respectively the ability to visit the Web site entirely and without errors. The variety of implementation languages and technologies used for Web site development makes these two indicators hard to maximize. We conducted an evaluation study, in which we tested several of the available Web crawlers.

Details

International Journal of Web Information Systems, vol. 2 no. 2

Type: Research Article

DOI:

ISSN: 1744-0084

Keywords

View access options

Book part

Publication date: 14 December 2004

WEB CRAWLERS AND SEARCH ENGINES

Mike Thelwall

HTML

PDF (724 KB)

Details

Link Analysis: An Information Science Approach

Type: Book

DOI:

ISBN: 978-012088-553-4

View access options

Article

Publication date: 19 March 2018

Design and implementation of crawling algorithm to collect deep web information for web archiving

Hyo-Jung Oh, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park and Yong Kim

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

HTML

PDF (497 KB)

Downloads

753

Abstract

Purpose

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

Design/methodology/approach

This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.

Findings

Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.

Research limitations/implications

To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.

Practical implications

The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.

Originality/value

This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Details

Data Technologies and Applications, vol. 52 no. 2

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Book part

Publication date: 26 August 2019

Searching for Extremist Content Online Using the Dark Crawler and Sentiment Analysis

Ryan Scrivens, Tiana Gaudette, Garth Davies and Richard Frank

Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.Methods/approach �…

HTML

PDF (1.7 MB)

EPUB (119 KB)

Abstract

Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.

Methods/approach – The authors describe a customized web-crawler that was developed for the purpose of collecting, classifying, and interpreting extremist content online and on a large scale, followed by an overview of a relatively novel machine learning tool, sentiment analysis, which has sparked the interest of some researchers in the field of terrorism and extremism studies. The authors conclude with a discussion of what they believe is the future applicability of sentiment analysis within the online political violence research domain.

Findings – In order to gain a broader understanding of online extremism, or to improve the means by which researchers and practitioners “search for a needle in a haystack,” the authors recommend that social scientists continue to collaborate with computer scientists, combining sentiment analysis software with other classification tools and research methods, as well as validate sentiment analysis programs and adapt sentiment analysis software to new and evolving radical online spaces.

Originality/value – This chapter provides researchers and practitioners who are faced with new challenges in detecting extremist content online with insights regarding the applicability of a specific set of machine learning techniques and research methods to conduct large-scale data analyses in the field of terrorism and extremism studies.

Details

Methods of Criminology and Criminal Justice Research

Type: Book

DOI:

ISBN: 978-1-78769-865-9

Keywords

View access options

Article

Publication date: 1 December 2004

Using the Web Graph to influence application behaviour

Michael P. Evans and Andrew Walker

The Web's link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In…

HTML

PDF (533 KB)

Downloads

469

Abstract

The Web's link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In contrast the relationship between Web Graph and application is reversed by letting the structure of the Web Graph influence the behaviour of an application. Presents a novel Web crawling agent, AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process. Shows the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the Web crawler. In addition, presents the results of the sample of Web pages collected by the crawling process. In particular, shows: how AlienBot was able to identify the power law inherent in the link structure of the Web; that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML.

Details

Internet Research, vol. 14 no. 5

Type: Research Article

DOI:

ISSN: 1066-2243

Keywords

View access options

Book part

Publication date: 14 December 2004

PERSONAL CRAWLERS

Mike Thelwall

HTML

PDF (1.5 MB)

Details

Link Analysis: An Information Science Approach

Type: Book

DOI:

ISBN: 978-012088-553-4

View access options

Article

Publication date: 1 May 2002

Methodologies for crawler based Web surveys

Mike Thelwall

There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each…

HTML

PDF (113 KB)

Downloads

2272

Abstract

There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well‐known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.

Details

Internet Research, vol. 12 no. 2

Type: Research Article

DOI:

ISSN: 1066-2243

Keywords

View access options

Article

Publication date: 1 April 2001

Results from a web impact factor crawler

Mike Thelwall

Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic…

HTML

PDF (120 KB)

Downloads

698

Abstract

Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic because of the variable coverage of search engines as well as their ability to give significantly different results over short periods of time. The fundamental problem is that although some search engines provide a functionality that is capable of being used for impact calculations, this is not their primary task and therefore they do not give guarantees as to performance in this respect. In this paper, a bespoke web crawler designed specifically for the calculation of reliable WIFs is presented. This crawler was used to calculate WIFs for a number of UK universities, and the results of these calculations are discussed. The principal findings were that with certain restrictions, WIFs can be calculated reliably, but do not correlate with accepted research rankings owing to the variety of material hosted on university servers. Changes to the calculations to improve the fit of the results to research rankings are proposed, but there are still inherent problems undermining the reliability of the calculation. These problems still apply if the WIF scores are taken on their own as indicators of the general impact of any area of the Internet, but with care would not apply to online journals.

Details

Journal of Documentation, vol. 57 no. 2

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

View access options

Article

Publication date: 14 May 2018

Information economy

Robert Fox

This paper aims to describe several methods to expose website information to Web crawlers for providing value-added services to patrons.

HTML

PDF (607 KB)

Downloads

679

Abstract

Purpose

This paper aims to describe several methods to expose website information to Web crawlers for providing value-added services to patrons.

Design/methodology/approach

This is a conceptual paper exploring the areas of search engine optimization (SEO) and usability in the context of search engines.

Findings

Not applicable

Originality/value

This paper explains several methods that can be used to appropriately expose website content and library services to the Web crawlers in such a way that services and content can be syndicated via those search engines.

Details

Digital Library Perspectives, vol. 34 no. 2

Type: Research Article

DOI:

ISSN: 2059-5816

Keywords

View access options

Article

Publication date: 5 July 2011

Evaluating the Kindle DX e‐book reader: results from Amazon.com customer reviews

Jun Qian

This paper aims to evaluate initial user perceptions and use of Amazon's Kindle DX e‐book reader.

HTML

PDF (333 KB)

Downloads

3070

Abstract

Purpose

This paper aims to evaluate initial user perceptions and use of Amazon's Kindle DX e‐book reader.

Design/methodology/approach

The web crawler software LocoySpider was first applied to extract the Kindle DX customer reviews from Amazon web site, then 100 records randomly selected from the 1,358 customer reviews were analyzed with QSR NVIVO 8 software to code the pros and cons of the Kindle DX.

Findings

Data analysis indicates that Kindle DX is a great e‐book device. However, customers expressed their requirements for the next generation Kindle e‐book reader.

Originality/value

This is one of the first research papers of its kind to collect customer reviews of the Amazon Kindle DX with web crawler software and use NVivo 8 qualitative data analysis software to explore user perceptions of the Kindle DX e‐book reader.

Details

Performance Measurement and Metrics, vol. 12 no. 2

Type: Research Article

DOI:

ISSN: 1467-8047

Keywords

Access

Year

Content type

1 – 10 of 513

Abstract

Details

Keywords

Abstract

Details

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Abstract

Details

Keywords

Abstract

Details

Keywords

Abstract

Details

Abstract

Details

Keywords

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information