Search results
1 – 10 of 21Hongbin Liu, Hu Ren, Hanfeng Gu, Fei Gao and Guangwen Yang
The purpose of this paper is to provide an automatic parallelization toolkit for unstructured mesh-based computation. Among all kinds of mesh types, unstructured meshes are…
Abstract
Purpose
The purpose of this paper is to provide an automatic parallelization toolkit for unstructured mesh-based computation. Among all kinds of mesh types, unstructured meshes are dominant in engineering simulation scenarios and play an essential role in scientific computations for their geometrical flexibility. However, the high-fidelity applications based on unstructured grids are still time-consuming, no matter for programming or running.
Design/methodology/approach
This study develops an efficient UNstructured Acceleration Toolkit (UNAT), which provides friendly high-level programming interfaces and elaborates lower level implementation on the target hardware to get nearly hand-optimized performance. At the present state, two efficient strategies, a multi-level blocks method and a row-subsections method, are designed and implemented on Sunway architecture. Random memory access and write–write conflict issues of unstructured meshes have been handled by partitioning, coloring and other hardware-specific techniques. Moreover, a data-reuse mechanism is developed to increase the computational intensity and alleviate the memory bandwidth bottleneck.
Findings
The authors select sparse matrix-vector multiplication as a performance benchmark of UNAT across different data layouts and different matrix formats. Experimental results show that the speed-ups reach up to 26× compared to single management processing element, and the utilization ratio tests indicate the capability of achieving nearly hand-optimized performance. Finally, the authors adopt UNAT to accelerate a well-tuned unstructured solver and obtain speed-ups of 19× and 10× on average for main kernels and overall solver, respectively.
Originality/value
The authors design an unstructured mesh toolkit, UNAT, to link the hardware and numerical algorithm, and then, engineers can focus on the algorithms and solvers rather than the parallel implementation. For the many-core processor SW26010 of the fastest supercomputer in China, UNAT yields up to 26× speed-ups and achieves nearly hand-optimized performance.
Details
Keywords
Yuji Sato and Mikiko Sato
The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast…
Abstract
Purpose
The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs).
Design/methodology/approach
For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection.
Findings
The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults.
Originality/value
The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.
Details
Keywords
Satyadhyan Chickerur and Aswatha Kumar M
In this decade, educators in engineering higher education are at the cross roads. On one side there are people who argue that the traditional courses and teaching methods are…
Abstract
In this decade, educators in engineering higher education are at the cross roads. On one side there are people who argue that the traditional courses and teaching methods are still appropriate, while there are others who believe that the vast technological advancement in information and computing technologies could be harnessed for effective teaching and learning. This chapter presents an approach to develop industry-relevant curricula in engineering higher education that involves project-based learning. It is also shown that the effectiveness of the course can be improved by designing the curriculum using modified Bloom’s taxonomy and using various online tools and technologies. Discussion about various tools introduced and the rationale for using those tools is also covered. The impact of each tool on student learning is also summarized.
Details
Keywords
Hongbin Liu, Xinrong Su and Xin Yuan
Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling…
Abstract
Purpose
Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling and it provides a deeper understanding of the complicated transitional and turbulent flow mechanism; however, the large computational cost limits its application in high Reynolds number flow. This study aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation.
Design/methodology/approach
Compared to the central processing units (CPUs), graphics processing units (GPUs) can provide higher computational speed. This work aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. A set of low-dissipation schemes designed for unstructured mesh is implemented with compute unified device architecture programming model. Several key parameters affecting the performance of the GPU code are discussed and further speed-up can be obtained by analysing the underlying finite volume-based numerical scheme.
Findings
The results show that an acceleration ratio of approximately 84 (on a single GPU) for double precision algorithm can be achieved with this unstructured GPU code. The transitional flow inside a compressor is simulated and the computational efficiency has been improved greatly. The transition process is discussed and the role of K-H instability playing in the transition mechanism is verified.
Practical/implications
The speed-up gained from GPU-enabled solver reaches 84 compared to original code running on CPU and the vast speed-up enables the fast-turnaround high-fidelity LES simulation.
Originality/value
The GPU-enabled flow solver is implemented and optimized according to the feature of finite volume scheme. The solving time is reduced remarkably and the detail structures including vortices are captured.
Details
Keywords
Rainald Löhner and Joseph D. Baum
Prompted by the empirical evidence that achievable flow solver speeds for large problems are limited by what appears to be a time of the order of O(0.1) sec/timestep regardless of…
Abstract
Purpose
Prompted by the empirical evidence that achievable flow solver speeds for large problems are limited by what appears to be a time of the order of O(0.1) sec/timestep regardless of the number of cores used, the purpose of this paper is to identify why this phenomenon occurs.
Design/methodology/approach
A series of timing studies, as well as in-depth analysis of memory and inter-processors transfer requirements were carried out for a typical field solver. The results were analyzed and compared to the expected performance.
Findings
The analysis shows that at present flow speeds per core are already limited by the achievable transfer rate to RAM. For smaller domains/larger number of processors, the limiting speed of CFD solvers is given by the MPI communication network.
Research limitations/implications
This implies that at present, there is a “limiting useful size” for domains, and that there is a lower limit for the time it takes to update a flowfield.
Practical implications
For practical calculations this implies that the time required for running large-scale problems will not decrease markedly once these applications migrate to machines with hundreds of thousands of cores.
Originality/value
This is the first time such a finding has been reported in this context.
Details
Keywords
Tanvir Habib Sardar and Ahmed Rimaz Faizabadi
In recent years, there is a gradual shift from sequential computing to parallel computing. Nowadays, nearly all computers are of multicore processors. To exploit the available…
Abstract
Purpose
In recent years, there is a gradual shift from sequential computing to parallel computing. Nowadays, nearly all computers are of multicore processors. To exploit the available cores, parallel computing becomes necessary. It increases speed by processing huge amount of data in real time. The purpose of this paper is to parallelize a set of well-known programs using different techniques to determine best way to parallelize a program experimented.
Design/methodology/approach
A set of numeric algorithms are parallelized using hand parallelization using OpenMP and auto parallelization using Pluto tool.
Findings
The work discovers that few of the algorithms are well suited in auto parallelization using Pluto tool but many of the algorithms execute more efficiently using OpenMP hand parallelization.
Originality/value
The work provides an original work on parallelization using OpenMP programming paradigm and Pluto tool.
Details
Keywords
Mehdi Darbandi, Amir Reza Ramtin and Omid Khold Sharafi
A set of routers that are connected over communication channels can from network-on-chip (NoC). High performance, scalability, modularity and the ability to parallel the structure…
Abstract
Purpose
A set of routers that are connected over communication channels can from network-on-chip (NoC). High performance, scalability, modularity and the ability to parallel the structure of the communications are some of its advantages. Because of the growing number of cores of NoC, their arrangement has got more valuable. The mapping action is done based on assigning different functional units to different nodes on the NoC, and the way it is done contains a significant effect on implementation and network power utilization. The NoC mapping issue is one of the NP-hard problems. Therefore, for achieving optimal or near-optimal answers, meta-heuristic algorithms are the perfect choices. The purpose of this paper is to design a novel procedure for mapping process cores for reducing communication delays and cost parameters. A multi-objective particle swarm optimization algorithm standing on crowding distance (MOPSO-CD) has been used for this purpose.
Design/methodology/approach
In the proposed approach, in which the two-dimensional mesh topology has been used as base construction, the mapping operation is divided into two stages as follows: allocating the tasks to suitable cores of intellectual property; and plotting the map of these cores in a specific tile on the platform of NoC.
Findings
The proposed method has dramatically improved the related problems and limitations of meta-heuristic algorithms. This algorithm performs better than the particle swarm optimization (PSO) and genetic algorithm in convergence to the Pareto, producing a proficiently divided collection of solving ways and the computational time. The results of the simulation also show that the delay parameter of the proposed method is 1.1 per cent better than the genetic algorithm and 0.5 per cent better than the PSO algorithm. Also, in the communication cost parameter, the proposed method has 2.7 per cent better action than a genetic algorithm and 0.16 per cent better action than the PSO algorithm.
Originality/value
As yet, the MOPSO-CD algorithm has not been used for solving the task mapping issue in the NoC.
Details
Keywords
Sura Nawfal and Fakhrulddin Ali
The purpose of this paper is to achieve the acceleration of 3D object transformation using parallel techniques such as multi-core central processing unit (MC CPU) or graphic…
Abstract
Purpose
The purpose of this paper is to achieve the acceleration of 3D object transformation using parallel techniques such as multi-core central processing unit (MC CPU) or graphic processing unit (GPU) or even both. Generating 3D animation scenes in computer graphics requires applying a 3D transformation on the vertices of the objects. These transformations consume most of the execution time. Hence, for high-speed graphic systems, acceleration of vertex transform is very much sought for because it requires many matrix operations (need) to be performed in a real time, so the execution time is essential for such processing.
Design/methodology/approach
In this paper, the acceleration of 3D object transformation is achieved using parallel techniques such as MC CPU or GPU or even both. Multiple geometric transformations are concatenated together at a time in any order in an interactive manner.
Findings
The performance results are presented for a number of 3D objects with paralleled implementations of the affine transform on the NVIDIA GPU series. The maximum execution time was about 0.508 s to transform 100 million vertices using LabVIEW and 0.096 s using Visual Studio. Other results also showed the significant speed-up compared to CPU, MC CPU and other previous work computations for the same object complexity.
Originality/value
The high-speed execution of 3D models is essential in many applications such as medical imaging, 3D games and robotics.
Details
Keywords
Vaclav Snasel, Tran Khanh Dang, Josef Kueng and Lingping Kong
This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate…
Abstract
Purpose
This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate different architectural aspects and collect and provide our comparative evaluations.
Design/methodology/approach
Collecting over 40 IMC papers related to hardware design and optimization techniques of recent years, then classify them into three optimization option categories: optimization through graphic processing unit (GPU), optimization through reduced precision and optimization through hardware accelerator. Then, the authors brief those techniques in aspects such as what kind of data set it applied, how it is designed and what is the contribution of this design.
Findings
ML algorithms are potent tools accommodated on IMC architecture. Although general-purpose hardware (central processing units and GPUs) can supply explicit solutions, their energy efficiencies have limitations because of their excessive flexibility support. On the other hand, hardware accelerators (field programmable gate arrays and application-specific integrated circuits) win on the energy efficiency aspect, but individual accelerator often adapts exclusively to ax single ML approach (family). From a long hardware evolution perspective, hardware/software collaboration heterogeneity design from hybrid platforms is an option for the researcher.
Originality/value
IMC’s optimization enables high-speed processing, increases performance and analyzes massive volumes of data in real-time. This work reviews IMC and its evolution. Then, the authors categorize three optimization paths for the IMC architecture to improve performance metrics.
Details
Keywords
Jose M. Chaves-Gonzalez and Miguel A. Vega-Rodríguez
The purpose of this paper is to study the use of a heterogeneous and evolutionary team approach based on different sources of knowledge to address a real-world problem within the…
Abstract
Purpose
The purpose of this paper is to study the use of a heterogeneous and evolutionary team approach based on different sources of knowledge to address a real-world problem within the telecommunication domain: the frequency assignment problem (FAP). Evolutionary algorithms have been proved as very suitable strategies when they are used to solve NP-hard optimization problems. However, these algorithms can find difficulties when they fall into local minima and the generation of high-quality solutions when tacking real-world instances of the problem is computationally very expensive. In this scenario, the use of a heterogeneous parallel team represents a very interesting approach.
Design/methodology/approach
The results have been validated by using two real-world telecommunication instances which contain real information about two GSM networks. Contrary to most of related publications, this paper is focussed on aspects which are relevant for real communication networks. Moreover, due to the stochastic nature of metaheuristics, the results are validated through a formal statistical analysis. This analysis is divided in two stages: first, a complete statistical study, and after that, a full comparative study against results previously published.
Findings
Comparative study shows that a heterogeneous evolutionary proposal obtains better results than proposals which are based on a unique source of knowledge. In fact, final results provided in the work surpass the results of other relevant studies previously published in the literature.
Originality/value
The paper provides a complete study of the contribution provided by the different metaheuristics included in the team and the impact of using different sources of evolutionary knowledge when the system is applied to solve a real-world FAP problem. The conclusions obtained in this study represent an original contribution never reached before for FAP.
Details