Search results

1 – 2 of 2
Article
Publication date: 17 October 2022

Santosh Kumar B. and Krishna Kumar E.

Deep learning techniques are unavoidable in a variety of domains such as health care, computer vision, cyber-security and so on. These algorithms demand high data transfers but…

50

Abstract

Purpose

Deep learning techniques are unavoidable in a variety of domains such as health care, computer vision, cyber-security and so on. These algorithms demand high data transfers but require bottlenecks in achieving the high speed and low latency synchronization while being implemented in the real hardware architectures. Though direct memory access controller (DMAC) has gained a brighter light of research for achieving bulk data transfers, existing direct memory access (DMA) systems continue to face the challenges of achieving high-speed communication. The purpose of this study is to develop an adaptive-configured DMA architecture for bulk data transfer with high throughput and less time-delayed computation.

Design/methodology/approach

The proposed methodology consists of a heterogeneous computing system integrated with specialized hardware and software. For the hardware, the authors propose an field programmable gate array (FPGA)-based DMAC, which transfers the data to the graphics processing unit (GPU) using PCI-Express. The workload characterization technique is designed using Python software and is implementable for the advanced risk machine Cortex architecture with a suitable communication interface. This module offloads the input streams of data to the FPGA and initiates the FPGA for the control flow of data to the GPU that can achieve efficient processing.

Findings

This paper presents an evaluation of a configurable workload-based DMA controller for collecting the data from the input devices and concurrently applying it to the GPU architecture, bypassing the hardware and software extraneous copies and bottlenecks via PCI Express. It also investigates the usage of adaptive DMA memory buffer allocation and workload characterization techniques. The proposed DMA architecture is compared with the other existing DMA architectures in which the performance of the proposed DMAC outperforms traditional DMA by achieving 96% throughput and 50% less latency synchronization.

Originality/value

The proposed gated recurrent unit has produced 95.6% accuracy in characterization of the workloads into heavy, medium and normal. The proposed model has outperformed the other algorithms and proves its strength for workload characterization.

Details

International Journal of Pervasive Computing and Communications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1742-7371

Keywords

Open Access
Article
Publication date: 7 July 2022

Sirilak Ketchaya and Apisit Rattanatranurak

Sorting is a very important algorithm to solve problems in computer science. The most well-known divide and conquer sorting algorithm is quicksort. It starts with dividing the…

1240

Abstract

Purpose

Sorting is a very important algorithm to solve problems in computer science. The most well-known divide and conquer sorting algorithm is quicksort. It starts with dividing the data into subarrays and finally sorting them.

Design/methodology/approach

In this paper, the algorithm named Dual Parallel Partition Sorting (DPPSort) is analyzed and optimized. It consists of a partitioning algorithm named Dual Parallel Partition (DPPartition). The DPPartition is analyzed and optimized in this paper and sorted with standard sorting functions named qsort and STLSort which are quicksort, and introsort algorithms, respectively. This algorithm is run on any shared memory/multicore systems. OpenMP library which supports multiprocessing programming is developed to be compatible with C/C++ standard library function. The authors’ algorithm recursively divides an unsorted array into two halves equally in parallel with Lomuto's partitioning and merge without compare-and-swap instructions. Then, qsort/STLSort is executed in parallel while the subarray is smaller than the sorting cutoff.

Findings

In the authors’ experiments, the 4-core Intel i7-6770 with Ubuntu Linux system is implemented. DPPSort is faster than qsort and STLSort up to 6.82× and 5.88× on Uint64 random distributions, respectively.

Originality/value

The authors can improve the performance of the parallel sorting algorithm by reducing the compare-and-swap instructions in the algorithm. This concept can be used to develop related problems to increase speedup of algorithms.

Details

Applied Computing and Informatics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2634-1964

Keywords

Access

Year

Content type

Earlycite article (2)
1 – 2 of 2