Search results
1 – 10 of over 13000Yingjie Yang, Sifeng Liu and Naiming Xie
The purpose of this paper is to propose a framework for data analytics where everything is grey in nature and the associated uncertainty is considered as an essential part in data…
Abstract
Purpose
The purpose of this paper is to propose a framework for data analytics where everything is grey in nature and the associated uncertainty is considered as an essential part in data collection, profiling, imputation, analysis and decision making.
Design/methodology/approach
A comparative study is conducted between the available uncertainty models and the feasibility of grey systems is highlighted. Furthermore, a general framework for the integration of grey systems and grey sets into data analytics is proposed.
Findings
Grey systems and grey sets are useful not only for small data, but also big data as well. It is complementary to other models and can play a significant role in data analytics.
Research limitations/implications
The proposed framework brings a radical change in data analytics. It may bring a fundamental change in our way to deal with uncertainties.
Practical implications
The proposed model has the potential to avoid the mistake from a misleading data imputation.
Social implications
The proposed model takes the philosophy of grey systems in recognising the limitation of our knowledge which has significant implications in our way to deal with our social life and relations.
Originality/value
This is the first time that the whole data analytics is considered from the point of view of grey systems.
Details
Keywords
Héctor Rubén Morales, Marcela Porporato and Nicolas Epelbaum
The technical feasibility of using Benford's law to assist internal auditors in reviewing the integrity of high-volume data sets is analysed. This study explores whether Benford's…
Abstract
Purpose
The technical feasibility of using Benford's law to assist internal auditors in reviewing the integrity of high-volume data sets is analysed. This study explores whether Benford's distribution applies to the set of numbers represented by the quantity of records (size) that comprise the different tables that make up a state-owned enterprise's (SOE) enterprise resource planning (ERP) relational database. The use of Benford's law streamlines the search for possible abnormalities within the ERP system's data set, increasing the ability of the internal audit functions (IAFs) to detect anomalies within the database. In the SOEs of emerging economies, where groups compete for power and resources, internal auditors are better off employing analytical tests to discharge their duties without getting involved in power struggles.
Design/methodology/approach
Records of eight databases of an SOE in Argentina are used to analyse the number of records of each table in periods of three to 12 years. The case develops step-by-step Benford's law application to test each ERP module records using Chi-squared (χ²) and mean absolute deviation (MAD) goodness-of-fit tests.
Findings
Benford's law is an adequate tool for performing integrity tests of high-volume databases. A minimum of 350 tables within each database are required for the MAD test to be effective; this threshold is higher than the 67 reported by earlier researches. Robust results are obtained for the complete ERP system and for large modules; modules with less than 350 tables show low conformity with Benford's law.
Research limitations/implications
This study is not about detecting fraud; it aims to help internal auditors red flag databases that will need further attention, making the most out of available limited resources in SOEs. The contribution is a simple, cheap and useful quantitative tool that can be employed by internal auditors in emerging economies to perform the first scan of the data contained in relational databases.
Practical implications
This paper provides a tool to test whether large amounts of data behave as expected, and if not, they can be pinpointed for future investigation. It offers tests and explanations on the tool's application so that internal auditors of SOEs in emerging economies can use it, particularly those that face divergent expectations from antagonist powerful interest groups.
Originality/value
This study demonstrates that even in the context of limited information technology tools available for internal auditors, there are simple and inexpensive tests to review the integrity of high-volume databases. It also extends the literature on high-volume database integrity tests and our knowledge of the IAF in Civil law countries, particularly emerging economies in Latin America.
Details
Keywords
This article provides an account of how databases can be effectively used in entrepreneurship research. Improved quality and access to large secondary databases offer paths to…
Abstract
This article provides an account of how databases can be effectively used in entrepreneurship research. Improved quality and access to large secondary databases offer paths to answer questions of great theoretical value. I present an overview of theoretical, methodological, and practical difficulties in working with database data, together with advice on how such difficulties can be overcome. Conclusions are given, together with suggestions of areas where databases might provide real and important contributions to entrepreneurship research.
Hafiz A. Alaka, Lukumon O. Oyedele, Hakeem A. Owolabi, Muhammad Bilal, Saheed O. Ajayi and Olugbenga O. Akinade
This study explored use of big data analytics (BDA) to analyse data of a large number of construction firms to develop a construction business failure prediction model (CB-FPM)…
Abstract
This study explored use of big data analytics (BDA) to analyse data of a large number of construction firms to develop a construction business failure prediction model (CB-FPM). Careful analysis of literature revealed financial ratios as the best form of variable for this problem. Because of MapReduce’s unsuitability for iteration problems involved in developing CB-FPMs, various BDA initiatives for iteration problems were identified. A BDA framework for developing CB-FPM was proposed. It was validated by using 150,000 datacells of 30,000 construction firms, artificial neural network, Amazon Elastic Compute Cloud, Apache Spark and the R software. The BDA CB-FPM was developed in eight seconds while the same process without BDA was aborted after nine hours without success. This shows the issue of not wanting to use large dataset to develop CB-FPM due to tedious duration is resolvable by applying BDA technique. The BDA CB-FPM largely outperformed an ordinary CB-FPM developed with a dataset of 200 construction firms, proving that use of larger sample size with the aid of BDA, leads to better performing CB-FPMs. The high financial and social cost associated with misclassifications (i.e. model error) thus makes adoption of BDA CB-FPMs very important for, among others, financiers, clients and policy makers.
Details
Keywords
The purpose of this study is to evaluate Department of Defense (DoD)-backed innovation programs as a means of enhancing the adoption of new technology throughout the armed forces.
Abstract
Purpose
The purpose of this study is to evaluate Department of Defense (DoD)-backed innovation programs as a means of enhancing the adoption of new technology throughout the armed forces.
Design/methodology/approach
The distribution of 1.29 million defense contract awards over seven years was analyzed across a data set of more than 8,000 DoD-backed innovation program award recipients. Surveys and interviews of key stakeholder groups were conducted to contextualize the quantitative results and garner additional insights.
Findings
Nearly half of DoD innovation program participants achieve no meaningful growth in direct defense business after program completion, and most small, innovative companies that win follow-on defense contracts solely support their initial sponsor branch. Causes for these program failures include the fact that programs do not market participants’ capabilities to the defense community and do not track participant companies after program completion.
Practical implications
Because the DoD does not market the capabilities of its innovation program participants internally, prospective DoD customers conduct redundant market research or fail to modernize. Program participants become increasingly unwilling to invest in the DoD market long term after the programs fail to deliver their expected benefits.
Originality/value
Limited scholarship evaluates the efficacy of DoD-backed innovation programs as a means of enhancing force readiness. This research not only uses a vast data set to demonstrate the failures of these programs but also presents concrete recommendations for improving them – including establishing an “Innovators Database” to track program participants and an incentive to encourage contracting entities and contractors to engage with them.
Details
Keywords
Zachary Hornberger, Bruce Cox and Raymond R. Hill
Large/stochastic spatiotemporal demand data sets can prove intractable for location optimization problems, motivating the need for aggregation. However, demand aggregation induces…
Abstract
Purpose
Large/stochastic spatiotemporal demand data sets can prove intractable for location optimization problems, motivating the need for aggregation. However, demand aggregation induces errors. Significant theoretical research has been performed related to the modifiable areal unit problem and the zone definition problem. Minimal research has been accomplished related to the specific issues inherent to spatiotemporal demand data, such as search and rescue (SAR) data. This study provides a quantitative comparison of various aggregation methodologies and their relation to distance and volume based aggregation errors.
Design/methodology/approach
This paper introduces and applies a framework for comparing both deterministic and stochastic aggregation methods using distance- and volume-based aggregation error metrics. This paper additionally applies weighted versions of these metrics to account for the reality that demand events are nonhomogeneous. These metrics are applied to a large, highly variable, spatiotemporal demand data set of SAR events in the Pacific Ocean. Comparisons using these metrics are conducted between six quadrat aggregations of varying scales and two zonal distribution models using hierarchical clustering.
Findings
As quadrat fidelity increases the distance-based aggregation error decreases, while the two deliberate zonal approaches further reduce this error while using fewer zones. However, the higher fidelity aggregations detrimentally affect volume error. Additionally, by splitting the SAR data set into training and test sets this paper shows the stochastic zonal distribution aggregation method is effective at simulating actual future demands.
Originality/value
This study indicates no singular best aggregation method exists, by quantifying trade-offs in aggregation-induced errors practitioners can utilize the method that minimizes errors most relevant to their study. Study also quantifies the ability of a stochastic zonal distribution method to effectively simulate future demand data.
Details
Keywords
Vania Vidal, Valéria Magalhães Pequeno, Narciso Moura Arruda Júnior and Marco Antonio Casanova
Enterprise knowledge graphs (EKG) in resource description framework (RDF) consolidate and semantically integrate heterogeneous data sources into a comprehensive dataspace…
Abstract
Purpose
Enterprise knowledge graphs (EKG) in resource description framework (RDF) consolidate and semantically integrate heterogeneous data sources into a comprehensive dataspace. However, to make an external relational data source accessible through an EKG, an RDF view of the underlying relational database, called an RDB2RDF view, must be created. The RDB2RDF view should be materialized in situations where live access to the data source is not possible, or the data source imposes restrictions on the type of query forms and the number of results. In this case, a mechanism for maintaining the materialized view data up-to-date is also required. The purpose of this paper is to address the problem of the efficient maintenance of externally materialized RDB2RDF views.
Design/methodology/approach
This paper proposes a formal framework for the incremental maintenance of externally materialized RDB2RDF views, in which the server computes and publishes changesets, indicating the difference between the two states of the view. The EKG system can then download the changesets and synchronize the externally materialized view. The changesets are computed based solely on the update and the source database state and require no access to the content of the view.
Findings
The central result of this paper shows that changesets computed according to the formal framework correctly maintain the externally materialized RDB2RDF view. The experiments indicate that the proposed strategy supports live synchronization of large RDB2RDF views and that the time taken to compute the changesets with the proposed approach was almost three orders of magnitude smaller than partial rematerialization and three orders of magnitude smaller than full rematerialization.
Originality/value
The main idea that differentiates the proposed approach from previous work on incremental view maintenance is to explore the object-preserving property of typical RDB2RDF views so that the solution can deal with views with duplicates. The algorithms for the incremental maintenance of relational views with duplicates published in the literature require querying the materialized view data to precisely compute the changesets. By contrast, the approach proposed in this paper requires no access to view data. This is important when the view is maintained externally, because accessing a remote data source may be too slow.
Details
Keywords
Purpose: The study elaborates the contextual conditions of the academic workplace in which gender, age, and nationality considerably influence the likelihood of…
Abstract
Purpose: The study elaborates the contextual conditions of the academic workplace in which gender, age, and nationality considerably influence the likelihood of self-categorization as being affected by workplace bullying. Furthermore, the intersectionality of these sociodemographic characteristics is examined.
Basic Design: The hypotheses underlying the study were mainly derived from the social role, social identity, and cultural distance theory, as well as from role congruity and relative deprivation theory. A survey data set of a large German research organization, the Max Planck Society, was used. A total of 3,272 cases of researchers and 2,995 cases of non-scientific employees were included in the analyses performed. For both groups of employees, binary logistic regression equations were constructed. the outcome of each equation is the estimated percentage of individuals who reported themselves as having experienced bullying at work occasionally or more frequently in the 12 months prior to the survey. The predictors are the demographic and organization-specific characteristics (hierarchical position, scientific field, administrative unit) of the respondents and selected interaction terms. Using regression equations, hypothetically relevant conditional marginal means and differences in regression parameters were calculated and compared by means of t-tests.
Results: In particular, the gender-related hypotheses of the study could be completely or conditionally verified. Accordingly, female scientific and non-scientific employees showed a higher bullying vulnerability in (almost) all contexts of the academic workplace. An increased bullying vulnerability was also found for foreign researchers. However, the patterns found here contradicted those that were hypothesized. Concerning the effect of age analyzed for non-scientific personnel, especially the age group 45–59 years showed a higher bullying probability, with the gender gap in bullying vulnerability being greatest for the youngest and oldest age groups in the sample.
Interpre4tation and Relevance: The results of the study especially support the social identity theory regarding gender. In the sample studied, women in minority positions have a higher vulnerability to bullying in their work fields, which is not the case for men. However, the influence of nationality on bullying vulnerability is more complex. The study points to the further development of cultural distance theory, whose hypotheses are only partly able to explain the results. The evidence for social role theory is primarily seen in the interaction of gender with age and hierarchical level. Accordingly, female early career researchers and young women (and women in the oldest age group) on the non-scientific staff presumably experience a masculine workplace. Thus, the results of the study contradict the role congruity theory.
Details
Keywords
Kai Zheng, Xianjun Yang, Yilei Wang, Yingjie Wu and Xianghan Zheng
The purpose of this paper is to alleviate the problem of poor robustness and over-fitting caused by large-scale data in collaborative filtering recommendation algorithms.
Abstract
Purpose
The purpose of this paper is to alleviate the problem of poor robustness and over-fitting caused by large-scale data in collaborative filtering recommendation algorithms.
Design/methodology/approach
Interpreting user behavior from the probabilistic perspective of hidden variables is helpful to improve robustness and over-fitting problems. Constructing a recommendation network by variational inference can effectively solve the complex distribution calculation in the probabilistic recommendation model. Based on the aforementioned analysis, this paper uses variational auto-encoder to construct a generating network, which can restore user-rating data to solve the problem of poor robustness and over-fitting caused by large-scale data. Meanwhile, for the existing KL-vanishing problem in the variational inference deep learning model, this paper optimizes the model by the KL annealing and Free Bits methods.
Findings
The effect of the basic model is considerably improved after using the KL annealing or Free Bits method to solve KL vanishing. The proposed models evidently perform worse than competitors on small data sets, such as MovieLens 1 M. By contrast, they have better effects on large data sets such as MovieLens 10 M and MovieLens 20 M.
Originality/value
This paper presents the usage of the variational inference model for collaborative filtering recommendation and introduces the KL annealing and Free Bits methods to improve the basic model effect. Because the variational inference training denotes the probability distribution of the hidden vector, the problem of poor robustness and overfitting is alleviated. When the amount of data is relatively large in the actual application scenario, the probability distribution of the fitted actual data can better represent the user and the item. Therefore, using variational inference for collaborative filtering recommendation is of practical value.
Details
Keywords
Daniel Šandor and Marina Bagić Babac
Sarcasm is a linguistic expression that usually carries the opposite meaning of what is being said by words, thus making it difficult for machines to discover the actual meaning…
Abstract
Purpose
Sarcasm is a linguistic expression that usually carries the opposite meaning of what is being said by words, thus making it difficult for machines to discover the actual meaning. It is mainly distinguished by the inflection with which it is spoken, with an undercurrent of irony, and is largely dependent on context, which makes it a difficult task for computational analysis. Moreover, sarcasm expresses negative sentiments using positive words, allowing it to easily confuse sentiment analysis models. This paper aims to demonstrate the task of sarcasm detection using the approach of machine and deep learning.
Design/methodology/approach
For the purpose of sarcasm detection, machine and deep learning models were used on a data set consisting of 1.3 million social media comments, including both sarcastic and non-sarcastic comments. The data set was pre-processed using natural language processing methods, and additional features were extracted and analysed. Several machine learning models, including logistic regression, ridge regression, linear support vector and support vector machines, along with two deep learning models based on bidirectional long short-term memory and one bidirectional encoder representations from transformers (BERT)-based model, were implemented, evaluated and compared.
Findings
The performance of machine and deep learning models was compared in the task of sarcasm detection, and possible ways of improvement were discussed. Deep learning models showed more promise, performance-wise, for this type of task. Specifically, a state-of-the-art model in natural language processing, namely, BERT-based model, outperformed other machine and deep learning models.
Originality/value
This study compared the performance of the various machine and deep learning models in the task of sarcasm detection using the data set of 1.3 million comments from social media.
Details