Search results
1 – 10 of 465Alexander Döschl, Max-Emanuel Keller and Peter Mandl
This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing…
Abstract
Purpose
This paper aims to evaluate different approaches for the parallelization of compute-intensive tasks. The study compares a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and resilient distributed data set (RDD) (Apache Spark) paradigms and a graphics processing unit (GPU) approach with Numba for compute unified device architecture (CUDA).
Design/methodology/approach
The paper uses a simple but computationally intensive puzzle as a case study for experiments. To find all solutions using brute force search, 15! permutations had to be computed and tested against the solution rules. The experimental application comprises a Java multi-threaded algorithm, distributed computing solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) paradigms and a GPU approach with Numba for CUDA. The implementations were benchmarked on Amazon-EC2 instances for performance and scalability measurements.
Findings
The comparison of the solutions with Apache Hadoop and Apache Spark under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30% lower, while the performance of Spark especially benefits from an increasing number of tasks. With the CUDA implementation, more than 16 times faster execution is achievable for the same price compared to the Spark solution. Apart from the multi-threaded implementation, the processing times of all solutions scale approximately linearly. Finally, several application suggestions for the different parallelization approaches are derived from the insights of this study.
Originality/value
There are numerous studies that have examined the performance of parallelization approaches. Most of these studies deal with processing large amounts of data or mathematical problems. This work, in contrast, compares these technologies on their ability to implement computationally intensive distributed algorithms.
Details
Keywords
Shashi Kant Ratnakar, Utpal Kiran and Deepak Sharma
Structural topology optimization is computationally expensive due to the involvement of high-resolution mesh and repetitive use of finite element analysis (FEA) for computing the…
Abstract
Purpose
Structural topology optimization is computationally expensive due to the involvement of high-resolution mesh and repetitive use of finite element analysis (FEA) for computing the structural response. Since FEA consumes most of the computational time in each optimization iteration, a novel GPU-based parallel strategy for FEA is presented and applied to the large-scale structural topology optimization of 3D continuum structures.
Design/methodology/approach
A matrix-free solver based on preconditioned conjugate gradient (PCG) method is proposed to minimize the computational time associated with solution of linear system of equations in FEA. The proposed solver uses an innovative strategy to utilize only symmetric half of elemental stiffness matrices for implementation of the element-by-element matrix-free solver on GPU.
Findings
Using solid isotropic material with penalization (SIMP) method, the proposed matrix-free solver is tested over three 3D structural optimization problems that are discretized using all hexahedral structured and unstructured meshes. Results show that the proposed strategy demonstrates 3.1× –3.3× speedup for the FEA solver stage and overall speedup of 2.9× –3.3× over the standard element-by-element strategy on the GPU. Moreover, the proposed strategy requires almost 1.8× less GPU memory than the standard element-by-element strategy.
Originality/value
The proposed GPU-based matrix-free element-by-element solver takes a more general approach to the symmetry concept than previous works. It stores only symmetric half of the elemental matrices in memory and performs matrix-free sparse matrix-vector multiplication (SpMV) without any inter-thread communication. A customized data storage format is also proposed to store and access only symmetric half of elemental stiffness matrices for coalesced read and write operations on GPU over the unstructured mesh.
Details
Keywords
Mandeep Kaur, Rajinder Sandhu and Rajni Mohana
The purpose of this study is to verify that if applications categories are segmented and resources are allocated based on their specific category, how effective scheduling can be…
Abstract
Purpose
The purpose of this study is to verify that if applications categories are segmented and resources are allocated based on their specific category, how effective scheduling can be done?.
Design/methodology/approach
This paper proposes a scheduling framework for IoT application jobs, based upon the Quality of Service (QoS) parameters, which works at coarse grained level to select a fog environment and at fine grained level to select a fog node. Fog environment is chosen considering availability, physical distance, latency and throughput. At fine grained (node selection) level, a probability triad (C, M, G) is anticipated using Naïve Bayes algorithm which provides probability of newly submitted application job to fall in either of the categories Compute (C) intensive, Memory (M) intensive and GPU (G) intensive.
Findings
Experiment results showed that the proposed framework performed better than traditional cloud and fog computing paradigms.
Originality/value
The proposed framework combines types of applications and computation capabilities of Fog computing environment, which is not carried out to the best of knowledge of authors.
Details
Keywords
Andre Luis Cavalcanti Bueno, Noemi de La Rocque Rodriguez and Elisa Dominguez Sotelino
The purpose of this work is to present a methodology that harnesses the computational power of multiple graphics processing units (GPUs) and hides the complexities of tuning GPU…
Abstract
Purpose
The purpose of this work is to present a methodology that harnesses the computational power of multiple graphics processing units (GPUs) and hides the complexities of tuning GPU parameters from the users.
Design/methodology/approach
A methodology for auto-tuning OpenCL configuration parameters has been developed.
Findings
This described process helps simplify coding and generates a significant gain in time for each method execution.
Originality/value
Most authors develop their GPU applications for specific hardware configurations. In this work, a solution is offered to make the developed code portable to any GPU hardware.
Details
Keywords
Victor U. Karthik, Sivamayam Sivasuthan, Arunasalam Rahunanthan, Ravi S. Thyagarajan, Paramsothy Jayakumar, Lalita Udpa and S. Ratnajeevan H. Hoole
Inverting electroheat problems involves synthesizing the electromagnetic arrangement of coils and geometries to realize a desired heat distribution. To this end two finite element…
Abstract
Purpose
Inverting electroheat problems involves synthesizing the electromagnetic arrangement of coils and geometries to realize a desired heat distribution. To this end two finite element problems need to be solved, first for the magnetic fields and the joule heat that the associated eddy currents generate and then, based on these heat sources, the second problem for heat distribution. This two-part problem needs to be iterated on to obtain the desired thermal distribution by optimization. Being a time consuming process, the purpose of this paper is to parallelize the process using the graphics processing unit (GPU) and the real-coded genetic algorithm, each for both speed and accuracy.
Design/methodology/approach
This coupled problem represents a heavy computational load with long wait-times for results. The GPU has recently been demonstrated to enhance the efficiency and accuracy of the finite element computations and cut down solution times. It has also been used to speedup the naturally parallel genetic algorithm. The authors use the GPU to perform coupled electroheat finite element optimization by the genetic algorithm to achieve computational efficiencies far better than those reported for a single finite element problem. In the genetic algorithm, coding objective functions in real numbers rather than binary arithmetic gives added speed and accuracy.
Findings
The feasibility of the method proposed to reduce computational time and increase accuracy is established through the simple problem of shaping a current carrying conductor so as to yield a constant temperature along a line. The authors obtained a speedup (CPU time to GPU time ratio) saturating to about 28 at a population size of 500 because of increasing communications between threads. But this far better than what is possible on a workstation.
Research limitations/implications
By using the intrinsically parallel genetic algorithm on a GPU, large complex coupled problems may be solved very quickly. The method demonstrated here without accounting for radiation and convection, may be trivially extended to more completely modeled electroheat systems. Since the primary purpose here is to establish methodology and feasibility, the thermal problem is simplified by neglecting convection and radiation. While that introduces some error, the computational procedure is still validated.
Practical implications
The methodology established has direct applications in electrical machine design, metallurgical mixing processes, and hyperthermia treatment in oncology. In these three practical application areas, the authors need to compute the exciting coil (or antenna) arrangement (current magnitude and phase) and device geometry that would accomplish a desired heat distribution to achieve mixing, reduce machine heat or burn cancerous tissue. This process presented does it more accurately and speedily.
Social implications
Particularly the above-mentioned application in oncology will alleviate human suffering through use in hyperthermia treatment planning in cancer treatment. The method presented provides scope for new commercial software development and employment.
Originality/value
Previous finite element shape optimization of coupled electroheat problems by this group used gradient methods whose difficulties are explained. Others have used analytical and circuit models in place of finite elements. This paper applies the massive parallelization possible with GPUs to the inherently parallel genetic algorithm, and extends it from single field system problems to coupled problems, and thereby realizes practicable solution times for such a computationally complex problem. Further, by using GPU computations rather than CPU, accuracy is enhanced. And then by using real number rather than binary coding for object functions, further accuracy and speed gains are realized.
Details
Keywords
Guoli Ji, Yong Zeng, Zijiang Yang, Congting Ye and Jingci Yao
The time complexity of most multiple sequence alignment algorithm is O(N2) or O(N3) (N is the number of sequences). In addition, with the development of biotechnology, the amount…
Abstract
Purpose
The time complexity of most multiple sequence alignment algorithm is O(N2) or O(N3) (N is the number of sequences). In addition, with the development of biotechnology, the amount of biological sequences grows significantly. The traditional methods have some difficulties in handling large-scale sequence. The proposed Lemk_MSA method aims to reduce the time complexity, especially for large-scale sequences. At the same time, it can keep similar accuracy level compared to the traditional methods.
Design/methodology/approach
LemK_MSA converts multiple sequence alignment into corresponding 10D vector alignment by ten types of copy modes based on Lempel-Ziv. Then, it uses k-means algorithm and NJ algorithm to divide the sequences into several groups and calculate guide tree of each group. A complete guide tree for multiple sequence alignment could be constructed by merging guide tree of every group. Moreover, for large-scale multiple sequence, Lemk_MSA proposes a GPU-based parallel way for distance matrix calculation.
Findings
Under this approach, the time efficiency to process multiple sequence alignment can be improved. The high-throughput mouse antibody sequences are used to validate the proposed method. Compared to ClustalW, MAFFT and Mbed, LemK_MSA is more than ten times efficient while ensuring the alignment accuracy at the same time.
Originality/value
This paper proposes a novel method with sequence vectorization for multiple sequence alignment based on Lempel-Ziv. A GPU-based parallel method has been designed for large-scale distance matrix calculation. It provides a new way for multiple sequence alignment research.
Details
Keywords
Shengquan Wang, Chao Wang, Yong Cai and Guangyao Li
The purpose of this paper is to improve the computational speed of solving nonlinear dynamics by using parallel methods and mixed-precision algorithm on graphic processing units…
Abstract
Purpose
The purpose of this paper is to improve the computational speed of solving nonlinear dynamics by using parallel methods and mixed-precision algorithm on graphic processing units (GPUs). The computational efficiency of traditional central processing units (CPUs)-based computer aided engineering software has been difficult to satisfy the needs of scientific research and practical engineering, especially for nonlinear dynamic problems. Besides, when calculations are performed on GPUs, double-precision operations are slower than single-precision operations. So this paper implemented mixed precision for nonlinear dynamic problem simulation using Belytschko-Tsay (BT) shell element on GPU.
Design/methodology/approach
To minimize data transfer between heterogeneous architectures, the parallel computation of the fully explicit finite element (FE) calculation is realized using a vectorized thread-level parallelism algorithm. An asynchronous data transmission strategy and a novel dependency relationship link-based method, for efficiently solving parallel explicit shell element equations, are used to improve the GPU utilization ratio. Finally, this paper implements mixed precision for nonlinear dynamic problems simulation using the BT shell element on a GPU and compare it to the CPU-based serially executed program and a GPU-based double-precision parallel computing program.
Findings
For a car body model containing approximately 5.3 million degrees of freedom, the computational speed is improved 25 times over CPU sequential computation, and approximately 10% over double-precision parallel computing method. The accuracy error of the mixed-precision computation is small and can satisfy the requirements of practical engineering problems.
Originality/value
This paper realized a novel FE parallel computing procedure for nonlinear dynamic problems using mixed-precision algorithm on CPU-GPU platform. Compared with the CPU serial program, the program implemented in this article obtains a 25 times acceleration ratio when calculating the model of 883,168 elements, which greatly improves the calculation speed for solving nonlinear dynamic problems.
Details
Keywords
Hongbin Liu, Xinrong Su and Xin Yuan
Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling…
Abstract
Purpose
Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling and it provides a deeper understanding of the complicated transitional and turbulent flow mechanism; however, the large computational cost limits its application in high Reynolds number flow. This study aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation.
Design/methodology/approach
Compared to the central processing units (CPUs), graphics processing units (GPUs) can provide higher computational speed. This work aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. A set of low-dissipation schemes designed for unstructured mesh is implemented with compute unified device architecture programming model. Several key parameters affecting the performance of the GPU code are discussed and further speed-up can be obtained by analysing the underlying finite volume-based numerical scheme.
Findings
The results show that an acceleration ratio of approximately 84 (on a single GPU) for double precision algorithm can be achieved with this unstructured GPU code. The transitional flow inside a compressor is simulated and the computational efficiency has been improved greatly. The transition process is discussed and the role of K-H instability playing in the transition mechanism is verified.
Practical/implications
The speed-up gained from GPU-enabled solver reaches 84 compared to original code running on CPU and the vast speed-up enables the fast-turnaround high-fidelity LES simulation.
Originality/value
The GPU-enabled flow solver is implemented and optimized according to the feature of finite volume scheme. The solving time is reduced remarkably and the detail structures including vortices are captured.
Details
Keywords
Garland Durham and John Geweke
Massively parallel desktop computing capabilities now well within the reach of individual academics modify the environment for posterior simulation in fundamental and potentially…
Abstract
Massively parallel desktop computing capabilities now well within the reach of individual academics modify the environment for posterior simulation in fundamental and potentially quite advantageous ways. But to fully exploit these benefits algorithms that conform to parallel computing environments are needed. This paper presents a sequential posterior simulator designed to operate efficiently in this context. The simulator makes fewer analytical and programming demands on investigators, and is faster, more reliable, and more complete than conventional posterior simulators. The paper extends existing sequential Monte Carlo methods and theory to provide a thorough and practical foundation for sequential posterior simulation that is well suited to massively parallel computing environments. It provides detailed recommendations on implementation, yielding an algorithm that requires only code for simulation from the prior and evaluation of prior and data densities and works well in a variety of applications representative of serious empirical work in economics and finance. The algorithm facilitates Bayesian model comparison by producing marginal likelihood approximations of unprecedented accuracy as an incidental by-product, is robust to pathological posterior distributions, and provides estimates of numerical standard error and relative numerical efficiency intrinsically. The paper concludes with an application that illustrates the potential of these simulators for applied Bayesian inference.
Details
Keywords
Vaclav Snasel, Tran Khanh Dang, Josef Kueng and Lingping Kong
This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate…
Abstract
Purpose
This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate different architectural aspects and collect and provide our comparative evaluations.
Design/methodology/approach
Collecting over 40 IMC papers related to hardware design and optimization techniques of recent years, then classify them into three optimization option categories: optimization through graphic processing unit (GPU), optimization through reduced precision and optimization through hardware accelerator. Then, the authors brief those techniques in aspects such as what kind of data set it applied, how it is designed and what is the contribution of this design.
Findings
ML algorithms are potent tools accommodated on IMC architecture. Although general-purpose hardware (central processing units and GPUs) can supply explicit solutions, their energy efficiencies have limitations because of their excessive flexibility support. On the other hand, hardware accelerators (field programmable gate arrays and application-specific integrated circuits) win on the energy efficiency aspect, but individual accelerator often adapts exclusively to ax single ML approach (family). From a long hardware evolution perspective, hardware/software collaboration heterogeneity design from hybrid platforms is an option for the researcher.
Originality/value
IMC’s optimization enables high-speed processing, increases performance and analyzes massive volumes of data in real-time. This work reviews IMC and its evolution. Then, the authors categorize three optimization paths for the IMC architecture to improve performance metrics.
Details