Search results
1 – 10 of 921Christian Girardi, Filippo Ricca and Paolo Tonella
Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web…
Abstract
Tools for the assessment of the quality and reliability of Web applications are based on the possibility of downloading the target of the analysis. This is achieved through Web crawlers, which can automatically navigate within a Web site and perform proper actions (such as download) during the visit. The most important performance indicators for a Web crawler are its completeness and robustness, measuring respectively the ability to visit the Web site entirely and without errors. The variety of implementation languages and technologies used for Web site development makes these two indicators hard to maximize. We conducted an evaluation study, in which we tested several of the available Web crawlers.
Details
Keywords
Saed ALQARALEH, Omar RAMADAN and Muhammed SALAMAH
The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly…
Abstract
Purpose
The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages.
Design/methodology/approach
In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process.
Findings
Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems.
Originality/value
The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
Details
Keywords
Hyo-Jung Oh, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park and Yong Kim
The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.
Abstract
Purpose
The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.
Design/methodology/approach
This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.
Findings
Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.
Research limitations/implications
To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.
Practical implications
The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.
Originality/value
This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.
Details
Keywords
– The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.
Abstract
Purpose
The purpose of this paper is to decrease the traffic created by search engines’ crawlers and solve the deep web problem using an innovative approach.
Design/methodology/approach
A new algorithm was formulated based on best existing algorithms to optimize the existing traffic caused by web crawlers, which is approximately 40 percent of all networking traffic. The crux of this approach is that web servers monitor and log changes and communicate them as an XML file to search engines. The XML file includes the information necessary to generate refreshed pages from existing ones and reference new pages that need to be crawled. Furthermore, the XML file is compressed to decrease its size to the minimum required.
Findings
The results of this study have shown that the traffic caused by search engines’ crawlers might be reduced on average by 84 percent when it comes to text content. However, binary content faces many challenges and new algorithms have to be developed to overcome these issues. The proposed approach will certainly mitigate the deep web issue. The XML files for each domain used by search engines might be used by web browsers to refresh their cache and therefore help reduce the traffic generated by normal users. This reduces users’ perceived latency and improves response time to http requests.
Research limitations/implications
The study sheds light on the deficiencies and weaknesses of the algorithms monitoring changes and generating binary files. However, a substantial decrease of traffic is achieved for text-based web content.
Practical implications
The findings of this research can be adopted by web server software and browsers’ developers and search engine companies to reduce the internet traffic caused by crawlers and cut costs.
Originality/value
The exponential growth of web content and other internet-based services such as cloud computing, and social networks has been causing contention on available bandwidth of the internet network. This research provides a much needed approach to keeping traffic in check.
Details
Keywords
There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each…
Abstract
There have been many attempts to study the content of the Web, either through human or automatic agents. Describes five different previously used Web survey methodologies, each justifiable in its own right, but presents a simple experiment that demonstrates concrete differences between them. The concept of crawling the Web also bears further inspection, including the scope of the pages to crawl, the method used to access and index each page, and the algorithm for the identification of duplicate pages. The issues involved here will be well‐known to many computer scientists but, with the increasing use of crawlers and search engines in other disciplines, they now require a public discussion in the wider research community. Concludes that any scientific attempt to crawl the Web must make available the parameters under which it is operating so that researchers can, in principle, replicate experiments or be aware of and take into account differences between methodologies. Also introduces a new hybrid random page selection methodology.
Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic…
Abstract
Web impact factors, the proposed web equivalent of impact factors for journals, can be calculated by using search engines. It has been found that the results are problematic because of the variable coverage of search engines as well as their ability to give significantly different results over short periods of time. The fundamental problem is that although some search engines provide a functionality that is capable of being used for impact calculations, this is not their primary task and therefore they do not give guarantees as to performance in this respect. In this paper, a bespoke web crawler designed specifically for the calculation of reliable WIFs is presented. This crawler was used to calculate WIFs for a number of UK universities, and the results of these calculations are discussed. The principal findings were that with certain restrictions, WIFs can be calculated reliably, but do not correlate with accepted research rankings owing to the variety of material hosted on university servers. Changes to the calculations to improve the fit of the results to research rankings are proposed, but there are still inherent problems undermining the reliability of the calculation. These problems still apply if the WIF scores are taken on their own as indicators of the general impact of any area of the Internet, but with care would not apply to online journals.
Details
Keywords
Ryan Scrivens, Tiana Gaudette, Garth Davies and Richard Frank
Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.Methods/approach …
Abstract
Purpose – This chapter examines how sentiment analysis and web-crawling technology can be used to conduct large-scale data analyses of extremist content online.
Methods/approach – The authors describe a customized web-crawler that was developed for the purpose of collecting, classifying, and interpreting extremist content online and on a large scale, followed by an overview of a relatively novel machine learning tool, sentiment analysis, which has sparked the interest of some researchers in the field of terrorism and extremism studies. The authors conclude with a discussion of what they believe is the future applicability of sentiment analysis within the online political violence research domain.
Findings – In order to gain a broader understanding of online extremism, or to improve the means by which researchers and practitioners “search for a needle in a haystack,” the authors recommend that social scientists continue to collaborate with computer scientists, combining sentiment analysis software with other classification tools and research methods, as well as validate sentiment analysis programs and adapt sentiment analysis software to new and evolving radical online spaces.
Originality/value – This chapter provides researchers and practitioners who are faced with new challenges in detecting extremist content online with insights regarding the applicability of a specific set of machine learning techniques and research methods to conduct large-scale data analyses in the field of terrorism and extremism studies.
Details
Keywords
Herbert Zuze and Melius Weideman
The purpose of this research project was to determine how the three biggest search engines interpret keyword stuffing as a negative design element.
Abstract
Purpose
The purpose of this research project was to determine how the three biggest search engines interpret keyword stuffing as a negative design element.
Design/methodology/approach
This research was based on triangulation between scholar reporting, search engine claims, SEO practitioners and empirical evidence on the interpretation of keyword stuffing. Five websites with varying keyword densities were designed and submitted to Google, Yahoo! and Bing. Two phases of the experiment were done and the response of the search engines was recorded.
Findings
Scholars have indicated different views in respect of spamdexing, characterised by different keyword density measurements in the body text of a webpage. During both phases, almost all the test webpages, including the one with a 97.3 per cent keyword density, were indexed.
Research limitations/implications
Only the three biggest search engines were considered, and monitoring was done for a set time only. The claims that high keyword densities will lead to blacklisting have been refuted.
Originality/value
Websites should be designed with high quality, well‐written content. Even though keyword stuffing is unlikely to lead to search engine penalties, it could deter human visitors and reduce website value.
Details
Keywords
Since 1996, hyperlinks have been studied extensively by applying existing bibliometric methods. The Web impact factor (WIF), for example, is the online counterpart of the journal…
Abstract
Since 1996, hyperlinks have been studied extensively by applying existing bibliometric methods. The Web impact factor (WIF), for example, is the online counterpart of the journal impact factor. This paper reviews how this link‐based metric has been developed, enhanced and applied. Not only has the metric itself undergone improvement but also the relevant data collection techniques have been enhanced. WIFs have also been validated by significant correlations with traditional research measures. Bibliometric techniques have been further applied to the Web and patterns that might have otherwise been ignored have been found from hyperlinks. This paper concludes with some suggestions for future research.
Details