Search results

1 – 10 of 21
Article
Publication date: 30 April 2020

Hongbin Liu, Hu Ren, Hanfeng Gu, Fei Gao and Guangwen Yang

The purpose of this paper is to provide an automatic parallelization toolkit for unstructured mesh-based computation. Among all kinds of mesh types, unstructured meshes are…

Abstract

Purpose

The purpose of this paper is to provide an automatic parallelization toolkit for unstructured mesh-based computation. Among all kinds of mesh types, unstructured meshes are dominant in engineering simulation scenarios and play an essential role in scientific computations for their geometrical flexibility. However, the high-fidelity applications based on unstructured grids are still time-consuming, no matter for programming or running.

Design/methodology/approach

This study develops an efficient UNstructured Acceleration Toolkit (UNAT), which provides friendly high-level programming interfaces and elaborates lower level implementation on the target hardware to get nearly hand-optimized performance. At the present state, two efficient strategies, a multi-level blocks method and a row-subsections method, are designed and implemented on Sunway architecture. Random memory access and write–write conflict issues of unstructured meshes have been handled by partitioning, coloring and other hardware-specific techniques. Moreover, a data-reuse mechanism is developed to increase the computational intensity and alleviate the memory bandwidth bottleneck.

Findings

The authors select sparse matrix-vector multiplication as a performance benchmark of UNAT across different data layouts and different matrix formats. Experimental results show that the speed-ups reach up to 26× compared to single management processing element, and the utilization ratio tests indicate the capability of achieving nearly hand-optimized performance. Finally, the authors adopt UNAT to accelerate a well-tuned unstructured solver and obtain speed-ups of 19× and 10× on average for main kernels and overall solver, respectively.

Originality/value

The authors design an unstructured mesh toolkit, UNAT, to link the hardware and numerical algorithm, and then, engineers can focus on the algorithms and solvers rather than the parallel implementation. For the many-core processor SW26010 of the fastest supercomputer in China, UNAT yields up to 26× speed-ups and achieves nearly hand-optimized performance.

Details

Engineering Computations, vol. 37 no. 9
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 4 March 2014

Yuji Sato and Mikiko Sato

The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast…

Abstract

Purpose

The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs).

Design/methodology/approach

For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection.

Findings

The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults.

Originality/value

The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 7 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Book part
Publication date: 1 November 2012

Satyadhyan Chickerur and Aswatha Kumar M

In this decade, educators in engineering higher education are at the cross roads. On one side there are people who argue that the traditional courses and teaching methods are…

Abstract

In this decade, educators in engineering higher education are at the cross roads. On one side there are people who argue that the traditional courses and teaching methods are still appropriate, while there are others who believe that the vast technological advancement in information and computing technologies could be harnessed for effective teaching and learning. This chapter presents an approach to develop industry-relevant curricula in engineering higher education that involves project-based learning. It is also shown that the effectiveness of the course can be improved by designing the curriculum using modified Bloom’s taxonomy and using various online tools and technologies. Discussion about various tools introduced and the rationale for using those tools is also covered. The impact of each tool on student learning is also summarized.

Details

Increasing Student Engagement and Retention Using Social Technologies
Type: Book
ISBN: 978-1-78190-239-4

Keywords

Article
Publication date: 24 August 2018

Hongbin Liu, Xinrong Su and Xin Yuan

Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling…

Abstract

Purpose

Adopting large eddy simulation (LES) to simulate the complex flow in turbomachinery is appropriate to overcome the limitation of current Reynolds-Averaged Navier–Stokes modelling and it provides a deeper understanding of the complicated transitional and turbulent flow mechanism; however, the large computational cost limits its application in high Reynolds number flow. This study aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation.

Design/methodology/approach

Compared to the central processing units (CPUs), graphics processing units (GPUs) can provide higher computational speed. This work aims to develop a three-dimensional GPU-enabled parallel-unstructured solver to speed up the high-fidelity LES simulation. A set of low-dissipation schemes designed for unstructured mesh is implemented with compute unified device architecture programming model. Several key parameters affecting the performance of the GPU code are discussed and further speed-up can be obtained by analysing the underlying finite volume-based numerical scheme.

Findings

The results show that an acceleration ratio of approximately 84 (on a single GPU) for double precision algorithm can be achieved with this unstructured GPU code. The transitional flow inside a compressor is simulated and the computational efficiency has been improved greatly. The transition process is discussed and the role of K-H instability playing in the transition mechanism is verified.

Practical/implications

The speed-up gained from GPU-enabled solver reaches 84 compared to original code running on CPU and the vast speed-up enables the fast-turnaround high-fidelity LES simulation.

Originality/value

The GPU-enabled flow solver is implemented and optimized according to the feature of finite volume scheme. The solving time is reduced remarkably and the detail structures including vortices are captured.

Details

Engineering Computations, vol. 35 no. 5
Type: Research Article
ISSN: 0264-4401

Keywords

Article
Publication date: 26 August 2014

Rainald Löhner and Joseph D. Baum

Prompted by the empirical evidence that achievable flow solver speeds for large problems are limited by what appears to be a time of the order of O(0.1) sec/timestep regardless of…

1133

Abstract

Purpose

Prompted by the empirical evidence that achievable flow solver speeds for large problems are limited by what appears to be a time of the order of O(0.1) sec/timestep regardless of the number of cores used, the purpose of this paper is to identify why this phenomenon occurs.

Design/methodology/approach

A series of timing studies, as well as in-depth analysis of memory and inter-processors transfer requirements were carried out for a typical field solver. The results were analyzed and compared to the expected performance.

Findings

The analysis shows that at present flow speeds per core are already limited by the achievable transfer rate to RAM. For smaller domains/larger number of processors, the limiting speed of CFD solvers is given by the MPI communication network.

Research limitations/implications

This implies that at present, there is a “limiting useful size” for domains, and that there is a lower limit for the time it takes to update a flowfield.

Practical implications

For practical calculations this implies that the time required for running large-scale problems will not decrease markedly once these applications migrate to machines with hundreds of thousands of cores.

Originality/value

This is the first time such a finding has been reported in this context.

Details

International Journal of Numerical Methods for Heat & Fluid Flow, vol. 24 no. 7
Type: Research Article
ISSN: 0961-5539

Keywords

Article
Publication date: 7 February 2019

Tanvir Habib Sardar and Ahmed Rimaz Faizabadi

In recent years, there is a gradual shift from sequential computing to parallel computing. Nowadays, nearly all computers are of multicore processors. To exploit the available…

2030

Abstract

Purpose

In recent years, there is a gradual shift from sequential computing to parallel computing. Nowadays, nearly all computers are of multicore processors. To exploit the available cores, parallel computing becomes necessary. It increases speed by processing huge amount of data in real time. The purpose of this paper is to parallelize a set of well-known programs using different techniques to determine best way to parallelize a program experimented.

Design/methodology/approach

A set of numeric algorithms are parallelized using hand parallelization using OpenMP and auto parallelization using Pluto tool.

Findings

The work discovers that few of the algorithms are well suited in auto parallelization using Pluto tool but many of the algorithms execute more efficiently using OpenMP hand parallelization.

Originality/value

The work provides an original work on parallelization using OpenMP programming paradigm and Pluto tool.

Details

Data Technologies and Applications, vol. 53 no. 1
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 30 April 2020

Mehdi Darbandi, Amir Reza Ramtin and Omid Khold Sharafi

A set of routers that are connected over communication channels can from network-on-chip (NoC). High performance, scalability, modularity and the ability to parallel the structure…

Abstract

Purpose

A set of routers that are connected over communication channels can from network-on-chip (NoC). High performance, scalability, modularity and the ability to parallel the structure of the communications are some of its advantages. Because of the growing number of cores of NoC, their arrangement has got more valuable. The mapping action is done based on assigning different functional units to different nodes on the NoC, and the way it is done contains a significant effect on implementation and network power utilization. The NoC mapping issue is one of the NP-hard problems. Therefore, for achieving optimal or near-optimal answers, meta-heuristic algorithms are the perfect choices. The purpose of this paper is to design a novel procedure for mapping process cores for reducing communication delays and cost parameters. A multi-objective particle swarm optimization algorithm standing on crowding distance (MOPSO-CD) has been used for this purpose.

Design/methodology/approach

In the proposed approach, in which the two-dimensional mesh topology has been used as base construction, the mapping operation is divided into two stages as follows: allocating the tasks to suitable cores of intellectual property; and plotting the map of these cores in a specific tile on the platform of NoC.

Findings

The proposed method has dramatically improved the related problems and limitations of meta-heuristic algorithms. This algorithm performs better than the particle swarm optimization (PSO) and genetic algorithm in convergence to the Pareto, producing a proficiently divided collection of solving ways and the computational time. The results of the simulation also show that the delay parameter of the proposed method is 1.1 per cent better than the genetic algorithm and 0.5 per cent better than the PSO algorithm. Also, in the communication cost parameter, the proposed method has 2.7 per cent better action than a genetic algorithm and 0.16 per cent better action than the PSO algorithm.

Originality/value

As yet, the MOPSO-CD algorithm has not been used for solving the task mapping issue in the NoC.

Details

International Journal of Pervasive Computing and Communications, vol. 16 no. 2
Type: Research Article
ISSN: 1742-7371

Keywords

Article
Publication date: 17 October 2018

Sura Nawfal and Fakhrulddin Ali

The purpose of this paper is to achieve the acceleration of 3D object transformation using parallel techniques such as multi-core central processing unit (MC CPU) or graphic…

Abstract

Purpose

The purpose of this paper is to achieve the acceleration of 3D object transformation using parallel techniques such as multi-core central processing unit (MC CPU) or graphic processing unit (GPU) or even both. Generating 3D animation scenes in computer graphics requires applying a 3D transformation on the vertices of the objects. These transformations consume most of the execution time. Hence, for high-speed graphic systems, acceleration of vertex transform is very much sought for because it requires many matrix operations (need) to be performed in a real time, so the execution time is essential for such processing.

Design/methodology/approach

In this paper, the acceleration of 3D object transformation is achieved using parallel techniques such as MC CPU or GPU or even both. Multiple geometric transformations are concatenated together at a time in any order in an interactive manner.

Findings

The performance results are presented for a number of 3D objects with paralleled implementations of the affine transform on the NVIDIA GPU series. The maximum execution time was about 0.508 s to transform 100 million vertices using LabVIEW and 0.096 s using Visual Studio. Other results also showed the significant speed-up compared to CPU, MC CPU and other previous work computations for the same object complexity.

Originality/value

The high-speed execution of 3D models is essential in many applications such as medical imaging, 3D games and robotics.

Details

Journal of Engineering, Design and Technology, vol. 16 no. 6
Type: Research Article
ISSN: 1726-0531

Keywords

Article
Publication date: 22 December 2023

Vaclav Snasel, Tran Khanh Dang, Josef Kueng and Lingping Kong

This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate…

82

Abstract

Purpose

This paper aims to review in-memory computing (IMC) for machine learning (ML) applications from history, architectures and options aspects. In this review, the authors investigate different architectural aspects and collect and provide our comparative evaluations.

Design/methodology/approach

Collecting over 40 IMC papers related to hardware design and optimization techniques of recent years, then classify them into three optimization option categories: optimization through graphic processing unit (GPU), optimization through reduced precision and optimization through hardware accelerator. Then, the authors brief those techniques in aspects such as what kind of data set it applied, how it is designed and what is the contribution of this design.

Findings

ML algorithms are potent tools accommodated on IMC architecture. Although general-purpose hardware (central processing units and GPUs) can supply explicit solutions, their energy efficiencies have limitations because of their excessive flexibility support. On the other hand, hardware accelerators (field programmable gate arrays and application-specific integrated circuits) win on the energy efficiency aspect, but individual accelerator often adapts exclusively to ax single ML approach (family). From a long hardware evolution perspective, hardware/software collaboration heterogeneity design from hybrid platforms is an option for the researcher.

Originality/value

IMC’s optimization enables high-speed processing, increases performance and analyzes massive volumes of data in real-time. This work reviews IMC and its evolution. Then, the authors categorize three optimization paths for the IMC architecture to improve performance metrics.

Details

International Journal of Web Information Systems, vol. 20 no. 1
Type: Research Article
ISSN: 1744-0084

Keywords

Article
Publication date: 30 September 2014

Jose M. Chaves-Gonzalez and Miguel A. Vega-Rodríguez

The purpose of this paper is to study the use of a heterogeneous and evolutionary team approach based on different sources of knowledge to address a real-world problem within the…

Abstract

Purpose

The purpose of this paper is to study the use of a heterogeneous and evolutionary team approach based on different sources of knowledge to address a real-world problem within the telecommunication domain: the frequency assignment problem (FAP). Evolutionary algorithms have been proved as very suitable strategies when they are used to solve NP-hard optimization problems. However, these algorithms can find difficulties when they fall into local minima and the generation of high-quality solutions when tacking real-world instances of the problem is computationally very expensive. In this scenario, the use of a heterogeneous parallel team represents a very interesting approach.

Design/methodology/approach

The results have been validated by using two real-world telecommunication instances which contain real information about two GSM networks. Contrary to most of related publications, this paper is focussed on aspects which are relevant for real communication networks. Moreover, due to the stochastic nature of metaheuristics, the results are validated through a formal statistical analysis. This analysis is divided in two stages: first, a complete statistical study, and after that, a full comparative study against results previously published.

Findings

Comparative study shows that a heterogeneous evolutionary proposal obtains better results than proposals which are based on a unique source of knowledge. In fact, final results provided in the work surpass the results of other relevant studies previously published in the literature.

Originality/value

The paper provides a complete study of the contribution provided by the different metaheuristics included in the team and the impact of using different sources of evolutionary knowledge when the system is applied to solve a real-world FAP problem. The conclusions obtained in this study represent an original contribution never reached before for FAP.

Details

Engineering Computations, vol. 31 no. 7
Type: Research Article
ISSN: 0264-4401

Keywords

1 – 10 of 21