Search results
1 – 10 of 513Christian Girardi, Filippo Ricca and Paolo Tonella
Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web…
Abstract
Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web crawlers, which can automatically navigate within a Web site and perform proper actions (such as download) during the visit. The most important performance indicators for a Web crawler are its completeness and robustness, measuring respectively the ability to visit the Web site entirely and without errors. The variety of implementation languages and technologies used for Web site development makes these two indicators hard to maximize. We conducted an evaluation study, in which we tested several of the available Web crawlers.
Details
Keywords
Hyo-Jung Oh, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park and Yong Kim
The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.
Abstract
Purpose
The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.
Design/methodology/approach
This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.
Findings
Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.
Research limitations/implications
To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.
Practical implications
The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.
Originality/value
This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.
Details
Keywords
Ryan Scrivens, Tiana Gaudette, Garth Davies and Richard Frank
Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.Methods/approach …
Abstract
Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.
Methods/approach – The authors describe a customized web-crawler that was developed for the purpose of collecting, classifying, and interpreting extremist content online and on a large scale, followed by an overview of a relatively novel machine learning tool, sentiment analysis, which has sparked the interest of some researchers in the field of terrorism and extremism studies. The authors conclude with a discussion of what they believe is the future applicability of sentiment analysis within the online political violence research domain.
Findings – In order to gain a broader understanding of online extremism, or to improve the means by which researchers and practitioners “search for a needle in a haystack,” the authors recommend that social scientists continue to collaborate with computer scientists, combining sentiment analysis software with other classification tools and research methods, as well as validate sentiment analysis programs and adapt sentiment analysis software to new and evolving radical online spaces.
Originality/value – This chapter provides researchers and practitioners who are faced with new challenges in detecting extremist content online with insights regarding the applicability of a specific set of machine learning techniques and research methods to conduct large-scale data analyses in the field of terrorism and extremism studies.
Details
Keywords
Michael P. Evans and Andrew Walker
The Web's link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In…
Abstract
The Web's link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In contrast the relationship between Web Graph and application is reversed by letting the structure of the Web Graph influence the behaviour of an application. Presents a novel Web crawling agent, AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process. Shows the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the Web crawler. In addition, presents the results of the sample of Web pages collected by the crawling process. In particular, shows: how AlienBot was able to identify the power law inherent in the link structure of the Web; that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML.
Details
Keywords
There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each…
Abstract
There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well‐known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic…
Abstract
Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic because of the variable coverage of search engines as well as their ability to give significantly different results over short periods of time. The fundamental problem is that although some search engines provide a functionality that is capable of being used for impact calculations, this is not their primary task and therefore they do not give guarantees as to performance in this respect. In this paper, a bespoke web crawler designed specifically for the calculation of reliable WIFs is presented. This crawler was used to calculate WIFs for a number of UK universities, and the results of these calculations are discussed. The principal findings were that with certain restrictions, WIFs can be calculated reliably, but do not correlate with accepted research rankings owing to the variety of material hosted on university servers. Changes to the calculations to improve the fit of the results to research rankings are proposed, but there are still inherent problems undermining the reliability of the calculation. These problems still apply if the WIF scores are taken on their own as indicators of the general impact of any area of the Internet, but with care would not apply to online journals.
Details
Keywords
This paper aims to describe several methods to expose website information to Web crawlers for providing value-added services to patrons.
Abstract
Purpose
This paper aims to describe several methods to expose website information to Web crawlers for providing value-added services to patrons.
Design/methodology/approach
This is a conceptual paper exploring the areas of search engine optimization (SEO) and usability in the context of search engines.
Findings
Not applicable
Originality/value
This paper explains several methods that can be used to appropriately expose website content and library services to the Web crawlers in such a way that services and content can be syndicated via those search engines.
Details
Keywords
This paper aims to evaluate initial user perceptions and use of Amazon's Kindle DX e‐book reader.
Abstract
Purpose
This paper aims to evaluate initial user perceptions and use of Amazon's Kindle DX e‐book reader.
Design/methodology/approach
The web crawler software LocoySpider was first applied to extract the Kindle DX customer reviews from Amazon web site, then 100 records randomly selected from the 1,358 customer reviews were analyzed with QSR NVIVO 8 software to code the pros and cons of the Kindle DX.
Findings
Data analysis indicates that Kindle DX is a great e‐book device. However, customers expressed their requirements for the next generation Kindle e‐book reader.
Originality/value
This is one of the first research papers of its kind to collect customer reviews of the Amazon Kindle DX with web crawler software and use NVivo 8 qualitative data analysis software to explore user perceptions of the Kindle DX e‐book reader.
Details