Search results

1 – 10 of 921

View access options

Book part

Publication date: 14 December 2004

WEB CRAWLERS AND SEARCH ENGINES

Mike Thelwall

HTML

PDF (724 KB)

Details

Link Analysis: An Information Science Approach

Type: Book

DOI:

ISBN: 978-012088-553-4

View access options

Article

Publication date: 1 May 2006

Web crawlers compared

Christian Girardi, Filippo Ricca and Paolo Tonella

Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web…

HTML

PDF (337 KB)

Downloads

487

Abstract

Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web crawlers, which can automatically navigate within a Web site and perform proper actions (such as download) during the visit. The most important performance indicators for a Web crawler are its completeness and robustness, measuring respectively the ability to visit the Web site entirely and without errors. The variety of implementation languages and technologies used for Web site development makes these two indicators hard to maximize. We conducted an evaluation study, in which we tested several of the available Web crawlers.

Details

International Journal of Web Information Systems, vol. 2 no. 2

Type: Research Article

DOI:

ISSN: 1744-0084

Keywords

View access options

Article

Publication date: 16 November 2015

Efficient watcher based web crawler design

Saed ALQARALEH, Omar RAMADAN and Muhammed SALAMAH

The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly…

HTML

PDF (810 KB)

Downloads

556

Abstract

Purpose

The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages.

Design/methodology/approach

In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process.

Findings

Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems.

Originality/value

The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.

Details

Aslib Journal of Information Management, vol. 67 no. 6

Type: Research Article

DOI:

ISSN: 2050-3806

Keywords

View access options

Article

Publication date: 19 March 2018

Design and implementation of crawling algorithm to collect deep web information for web archiving

Hyo-Jung Oh, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park and Yong Kim

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

HTML

PDF (497 KB)

Downloads

753

Abstract

Purpose

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

Design/methodology/approach

This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.

Findings

Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.

Research limitations/implications

To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.

Practical implications

The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.

Originality/value

This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Details

Data Technologies and Applications, vol. 52 no. 2

Type: Research Article

DOI:

ISSN: 2514-9288

Keywords

View access options

Article

Publication date: 1 February 2016

Search engines crawling process optimization: a webserver approach

Mhamed Zineddine

– The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.

HTML

PDF (1.2 MB)

Downloads

1382

Abstract

Purpose

The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.

Design/methodology/approach

A new algorithm was formulated based on best existing algorithms to optimize the existing traffic caused by web crawlers, which is approximately 40 percent of all networking traffic. The crux of this approach is that web servers monitor and log changes and communicate them as an XML file to search engines. The XML file includes the information necessary to generate refreshed pages from existing ones and reference new pages that need to be crawled. Furthermore, the XML file is compressed to decrease its size to the minimum required.

Findings

The results of this study have shown that the traffic caused by search engines’ crawlers might be reduced on average by 84 percent when it comes to text content. However, binary content faces many challenges and new algorithms have to be developed to overcome these issues. The proposed approach will certainly mitigate the deep web issue. The XML files for each domain used by search engines might be used by web browsers to refresh their cache and therefore help reduce the traffic generated by normal users. This reduces users’ perceived latency and improves response time to http requests.

Research limitations/implications

The study sheds light on the deficiencies and weaknesses of the algorithms monitoring changes and generating binary files. However, a substantial decrease of traffic is achieved for text-based web content.

Practical implications

The findings of this research can be adopted by web server software and browsers’ developers and search engine companies to reduce the internet traffic caused by crawlers and cut costs.

Originality/value

The exponential growth of web content and other internet-based services such as cloud computing, and social networks has been causing contention on available bandwidth of the internet network. This research provides a much needed approach to keeping traffic in check.

Details

Internet Research, vol. 26 no. 1

Type: Research Article

DOI:

ISSN: 1066-2243

Keywords

View access options

Article

Publication date: 1 May 2002

Methodologies for crawler based Web surveys

Mike Thelwall

There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each…

HTML

PDF (113 KB)

Downloads

2272

Abstract

There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well‐known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.

Details

Internet Research, vol. 12 no. 2

Type: Research Article

DOI:

ISSN: 1066-2243

Keywords

View access options

Article

Publication date: 1 April 2001

Results from a web impact factor crawler

Mike Thelwall

Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic…

HTML

PDF (120 KB)

Downloads

698

Abstract

Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic because of the variable coverage of search engines as well as their ability to give significantly different results over short periods of time. The fundamental problem is that although some search engines provide a functionality that is capable of being used for impact calculations, this is not their primary task and therefore they do not give guarantees as to performance in this respect. In this paper, a bespoke web crawler designed specifically for the calculation of reliable WIFs is presented. This crawler was used to calculate WIFs for a number of UK universities, and the results of these calculations are discussed. The principal findings were that with certain restrictions, WIFs can be calculated reliably, but do not correlate with accepted research rankings owing to the variety of material hosted on university servers. Changes to the calculations to improve the fit of the results to research rankings are proposed, but there are still inherent problems undermining the reliability of the calculation. These problems still apply if the WIF scores are taken on their own as indicators of the general impact of any area of the Internet, but with care would not apply to online journals.

Details

Journal of Documentation, vol. 57 no. 2

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

View access options

Book part

Publication date: 26 August 2019

Searching for Extremist Content Online Using the Dark Crawler and Sentiment Analysis

Ryan Scrivens, Tiana Gaudette, Garth Davies and Richard Frank

Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.Methods/approach �…

HTML

PDF (1.7 MB)

EPUB (119 KB)

Abstract

Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.

Methods/approach – The authors describe a customized web-crawler that was developed for the purpose of collecting, classifying, and interpreting extremist content online and on a large scale, followed by an overview of a relatively novel machine learning tool, sentiment analysis, which has sparked the interest of some researchers in the field of terrorism and extremism studies. The authors conclude with a discussion of what they believe is the future applicability of sentiment analysis within the online political violence research domain.

Findings – In order to gain a broader understanding of online extremism, or to improve the means by which researchers and practitioners “search for a needle in a haystack,” the authors recommend that social scientists continue to collaborate with computer scientists, combining sentiment analysis software with other classification tools and research methods, as well as validate sentiment analysis programs and adapt sentiment analysis software to new and evolving radical online spaces.

Originality/value – This chapter provides researchers and practitioners who are faced with new challenges in detecting extremist content online with insights regarding the applicability of a specific set of machine learning techniques and research methods to conduct large-scale data analyses in the field of terrorism and extremism studies.

Details

Methods of Criminology and Criminal Justice Research

Type: Book

DOI:

ISBN: 978-1-78769-865-9

Keywords

View access options

Article

Publication date: 12 April 2013

Keyword stuffing and the big three search engines

Herbert Zuze and Melius Weideman

The purpose of this research project was to determine how the three biggest search engines interpret keyword stuffing as a negative design element.

HTML

PDF (629 KB)

Downloads

1739

Abstract

Purpose

The purpose of this research project was to determine how the three biggest search engines interpret keyword stuffing as a negative design element.

Design/methodology/approach

This research was based on triangulation between scholar reporting, search engine claims, SEO practitioners and empirical evidence on the interpretation of keyword stuffing. Five websites with varying keyword densities were designed and submitted to Google, Yahoo! and Bing. Two phases of the experiment were done and the response of the search engines was recorded.

Findings

Scholars have indicated different views in respect of spamdexing, characterised by different keyword density measurements in the body text of a webpage. During both phases, almost all the test webpages, including the one with a 97.3 per cent keyword density, were indexed.

Research limitations/implications

Only the three biggest search engines were considered, and monitoring was done for a set time only. The claims that high keyword densities will lead to blacklisting have been refuted.

Originality/value

Websites should be designed with high quality, well‐written content. Even though keyword stuffing is unlikely to lead to search engine penalties, it could deter human visitors and reduce website value.

Details

Online Information Review, vol. 37 no. 2

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

View access options

Article

Publication date: 1 December 2003

A review of the development and application of the Web impact factor

Xuemei Li

Since 1996, hyperlinks have been studied extensively by applying existing bibliometric methods. The Web impact factor (WIF), for example, is the online counterpart of the journal…

HTML

PDF (125 KB)

Downloads

1274

Abstract

Since 1996, hyperlinks have been studied extensively by applying existing bibliometric methods. The Web impact factor (WIF), for example, is the online counterpart of the journal impact factor. This paper reviews how this link‐based metric has been developed, enhanced and applied. Not only has the metric itself undergone improvement but also the relevant data collection techniques have been enhanced. WIFs have also been validated by significant correlations with traditional research measures. Bibliometric techniques have been further applied to the Web and patterns that might have otherwise been ignored have been found from hyperlinks. This paper concludes with some suggestions for future research.

Details

Online Information Review, vol. 27 no. 6

Type: Research Article

DOI:

ISSN: 1468-4527

Keywords

Access

Year

Content type

1 – 10 of 921

Abstract

Details

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Practical implications

Originality/value

Details

Keywords

Abstract

Details

Keywords

Abstract

Details

Keywords

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Details

Keywords

Abstract

Details

Keywords

Access

Year

Content type

We’re listening — tell us what you think

Something didn’t work…

All feedback is valuable

Join us on our journey

Platform update page

Questions & More Information