Analysis and optimization of Dual Parallel Partition Sorting with OpenMP

Purpose – Sortingisaveryimportantalgorithmtosolveproblemsincomputerscience.Themostwell-known divide and conquer sorting algorithm is quicksort. It starts with dividing the data into subarrays and finally sorting them. Design/methodology/approach – In this paper, the algorithm named Dual Parallel Partition Sorting ( DPPSort ) is analyzed and optimized. It consists of a partitioning algorithm named Dual Parallel Partition ( DPPartition ). The DPPartition is analyzed and optimized in this paper and sorted with standard sorting functions named qsort and STLSort which are quicksort, and introsort algorithms, respectively. This algorithm is run on any shared memory/multicore systems. OpenMP library which supports multiprocessing programming is developed to be compatible with C/C þþ standard library function. The authors ’ algorithm recursivelydividesanunsorted arrayinto twohalvesequally in parallelwith Lomuto ’ spartitioning and merge without compare-and-swap instructions. Then, qsort/STLSort is executed in parallel while the subarray is smaller than the sorting cutoff. Findings – In the authors ’ experiments, the 4-core Intel i7-6770 with Ubuntu Linux system is implemented. DPPSort isfasterthan qsort and STLSort upto6.82 3 and5.88 3 onUint64randomdistributions,respectively. Originality/value – The authors can improve the performance of the parallel sorting algorithm by reducing the compare-and-swap instructions in the algorithm. This concept can be used to develop related problems to increase speedup of algorithms.


Introduction
The well-known algorithm for solving biological, scientific applications including big data is sorting. Quicksort [1,2] is the important sorting algorithm which is a divide and conquer technique. The unsorted array is divided into smaller subarrays and sorted independently. It consists of partitioning and sorting steps in this technique. Initially, the partitioning step divides the unsorted array recursively using pivot(s) into subarrays (divide). It runs until each subarray is shorter than cutoff size. Note that, the popular partitioning algorithm is Hoare's partitioning algorithm [3]. Then, the sorter executes the sorting step independently (conquer).
The partitioning step is very important for sorting the data in parallel. The parallel partitioning which divides the unsorted array into two subarrays is proposed in this paper. Then, the partitioned subarrays on each side are merged to their position. Note that, the data

Dual Parallel
Partition Sorting are only swapped to their correct positions without compare-and-swap instructions in our merge algorithm. Finally, the pivot is moved to its correct position and this algorithm is executed recursively. The partitioning step which is the part of divide and conquer concept is focused. OpenMP Task in the OpenMP library [4] is implemented using this algorithm. Run time, Speedup and Speedup per core/thread results compared with their original algorithms are presented. We optimized this algorithm using sorting cutoff size that affects run time on each data distribution. Perf profiling tool [5] is used to measure and analyze cache misses, branch misprediction and other metrics. Finally, we compare our proposed partitioning algorithm with Hoare's partitioning algorithm.
The contributions of this paper are summarized as follows: (1) We proposed the parallel sorting algorithm named Dual Parallel Partition Sorting (DPPSort) algorithm which consists of partitioning and sorting steps using OpenMP.
(2) The parallel partitioning algorithm called Dual Parallel Partition Phase which divides the array into 2 subarrays. Then, partition them using Lomuto's partitioning algorithm independently. Finally, those two subarrays are merged without compareand-swap instructions using Multi-Swap Phase is proposed.
(3) The performance metrics such as Run time, Speedup, Speedup/core, Speedup/thread, cache misses and branch load misses of DPPSort and the others are compared and analyzed.
This paper is organized as follows: Section 2 shows Background and Related work. Section 3 proposes our Dual Parallel Partition for sorting. In section 4, the results are shown and compared with any distributions. Finally, sections 5 shows conclusion and future work.

Background and related work
In this section, OpenMP [4] which is a parallel application programming interface is proposed. Moreover, sequential standard sorting algorithms named qsort and STLSort are proposed. Finally, several parallel sorting algorithms are proposed and compared with standard sorting algorithms in this section.

OpenMP library
OpenMP [4] is an application programming interface (API) which supports parallel programming on a shared memory system. It consists of complier directives, environment variables and functions that support C/Cþþ and Fortran. The execution model of OpenMP is the fork-join model. It starts with the master thread in sequential part. Then, worker threads are forked in parallel . Finally, all threads are joined while their works are finished. The overhead between CPU cores of this API is very low compared with other libraries. The constructs of OpenMP consist of single program multiple data (SPMD) constructs, tasking constructs, device constructs, work sharing constructs and synchronization constructs. The tasking construct is implemented in the recursion function in this paper. A task unit is executed by a thread independently.

Standard sorting algorithm library
The well-known standard sorting algorithm libraries called qsort and STLSort are presented in this paper. qsort is a standard library for sorting the data. It is a well-known quicksort algorithm which consists of partitioning and sorting steps. < stdlib.h > directive must be included in C language to use this function.
STLSort [6] is a sorting standard library function that can sort the data. It consists of 3 algorithms. Introsort algorithm which is combined with quicksort and heapsort is performed. Then, insertion sort is executed to sort subarray. To implement this function in Cþþ, < algorithm > directive must be declared.
In the quicksort algorithm, there are 2 well-known partitioning algorithms which are used to sort the data. The first algorithm is Hoare's partitioning algorithm [3]. It is the most popular algorithm that indices traverse from left to right and right to left to compare and swap data. The second algorithm is Lomuto's partitioning algorithm. Its indices traverse in the same direction. The first index is used to scan the array and the second index is used to divide the data that is less than pivot or greater than pivot. These indices run to compare and swap data until the first index is at the last data of the array.

Related works
There are several parallel quicksort algorithms which can be run on shared memory system. Many algorithm concepts start with dividing the data into blocks and partitioning the data in parallel. Then, the data in each block is merged to the correct position. We classify the related work into 4 categories as follows: 2.3.1 Parallel quicksort using fetch-and-add instruction and block-based techniques. Heidelberger et al. [7] proposed parallel quicksort on an ideal Parallel Random Access Machine using Fetch-and-add instruction. Speedup of 4003 with 500 processors can be obtained from sorting 2 20 data. PQuicksort, which is a fine-grain parallel quicksort algorithm, was proposed by Tsigas and Zhang [8]. Their algorithm uses neutralized blocks technique in parallel. Speedup of 113 can be obtained with 32 processor cores. S€ uß and Leopold [9] presented the implementation of pthreads and OpenMP 2.0 library to their parallel quicksort. Its speedup is 3.243 on a 4-core AMD Opteron processor. Trao re et al. [10] showed workstealing technique of deque-free parallel introspective sorting algorithm. Speedup of 8.13 on a 16-core processor can be achieved. Ayguade et al. [9] presented MultiSort which divides the input and sorts them with quicksort. After that, sorted data are merged in parallel. The best speedup is 13.63 on a 32-core CPU. Kim et al. [11] developed an Introspective quicksort and executed on embedded dual core OMAP-4430. Speedup of their parallel Introspective quicksort is 1.473. Saleem et al. [12] estimated speedup of both quick and merge sort algorithms using Intel Cilk Plus. Ranokpanuwat and Kittitornkun [13] developed Parallel Partition and Merge Quick sort (PPMQSort). Speedup of 12.293 is achieved on 8-core Hyperthread Xeon E5520 for sorting 200 million random integer data.
Recently, MultiStack Parallel Partition (MSP) which is a block-based partitioning algorithm was proposed [14]. Threads are forked to compare and swap the data in parallel using stacks. MSPSort is better than balanced quicksort and multiway merge sort while sorting Uint32 data on i7-2600, R7-1700 and R9-2920 machines.
2.3.2 Parallel sorting algorithms using multi-pivot technique. Man et al. [15,16] developed psort which splits the unsorted array into groups of data and sorts them in parallel. After that, those groups of data are merged and sorted again in sequential order. This algorithm is run on a 24-core CPU and Speedup of 113 is achieved. Mahafzah [17] developed their multi-pivot sorting algorithm that divides the unsorted array into partitions in parallel up to 8 threads. Speedup of 3.83 can be obtained with a 2-core HyperThread machine. In 2017, parallel Hybrid Dual Pivot Sort (HDPSort) was presented by Taotiamton and Kittitornkun [18]. Both Lomuto's and Hoare's partitioning algorithms are implemented with two pivots in parallel using OpenMP. Speedup of 2.493 and 3.023 are achieved on Intel Core i7-2600 and AMD FX-8320 systems, respectively.

Parallel partitioning.
Chen et al. [19,20] proposed the performance-aware model for Sparse matrix multiplication on the Sunway TaihuLight supercomputer. A multi-level parallelism design for SpGEMM was developed to optimize load balance, coalesced DMA Dual Parallel Partition Sorting transmission, data reuse, vectorized computation and parallel pipeline processing. Later, they presented an adaptive and efficient framework for the sparse tensor-vector product kernel on the Sunway TaihuLight supercomputer. The auto-tuner that selects the best tensor partitioning method to improve load balance was proposed. Its maximum GFLOPS is up to 195.69 on 128 processors. 2.3.4 GPU sorting algorithms. The parallel quicksort on GPU was implemented in 2010 [21]. It requires more memory to sort the data because it is not an in-place algorithm. This algorithm contains 2 phases. The first phase divides the data to GPU local memory and partitioning them. The second phase runs a portioning algorithm recursively using stack and sorting them.
Kozakai et al. [22] developed an integer sorting algorithm based on histogram and prefixsums which run on GPU. Their algorithm is faster than the well-known sorting algorithms Thrust sort and CUB sort on Intel Xeon E5-2620 v3 and NVIDIA Tesla K40c.

Dual Parallel Partition Sorting algorithm
This section shows the divide and conquer sorting algorithm named Dual Parallel Partition Sorting (DPPSort). There are 5 algorithms which are implemented in this work. Firstly, Dual Parallel Partition function DPPartition (Algorithm 1) is the partitioning function. Median of five function or MO5 (Algorithm 2) is the pivot selection function before partitioning. LPar and RPar functions (Algorithms 3 and 4) which are Lomuto's partitioning algorithms are implemented in DPPartition. Note that, LPar and RPar are partitioning from left to right and right to left, respectively. Finally, MSwap (Algorithm 5) is a merging algorithm which swaps the partitioned arrays which are greater than pivot from left subarray and less than pivot from the right subarray. We have declared the notation in this paper as follows: arr is array of data, l is left position index, r is right position index, c is Sorting cutoff size and p is pivot position index.   The DPPartition begins with comparing size of subarray with sorting cutoff size c. While subarray size is larger than c, MO5 function is executed to select a pivot. This function selects the data at left, quarter, middle, 3rd quarter and right positions in the subarray. Then, sort those selected data and choose the middle position of them as pivot. Next, LPar and RPar functions are executed using the pivot position p. The tasking construct (omp task) is implemented to both functions and new positions of pivot (new_midL and new_midR) are returned. Note that, new_midL and new_midR are declared as shared variables which can be accessed after returning. Moreover, omp taskwait is executed to synchronize both left and right. After that, MSwap function is executed to swap the data which is greater than pivot from the left partition and less than or equal to pivot from the right. It returns a new pivot position (new_midC) of this level. Finally, DPPartition is run in parallel on the left and right subarrays using omp task. In addition, the subarray, which is smaller than c, sequential STLSort/qsort function is forked in parallel independently.

Dual Parallel Partitioning phase
There are two pointers in the Dual Parallel Partitioning phase which are used to partition the data in LPar for left partition and RPar for right partition. The pointers in each function traverse in the same direction. In LPar function, it traverses from left most (l) to the middleÀ1 (p À 1) of subarray. On the other hand, the pointer of RPar traverses from right most (r) to the middleþ1 (p þ 1). There are indexl and indexr which split the data that are less than and greater than p. Moreover, i and j are used to divide partitioned and unpartitioned.

Sorting phase
The data which are partitioned with Dual Parallel Partitioning and Multi-Swap phases successfully and smaller than sorting cutoff size are sorted by sorting function (qsort or STLSort) in parallel. The data are sorted using OpenMP parallel tasks by forking thread without blocking. The worker thread is joined with its master thread after the data are sorted automatically.

Lomuto's vs Hoare's algorithm in Dual Parallel Partitioning phase
In this algorithm, the array is divided into two halves. Then we use Lomuto's partitioning algorithm on the left half whose index is run from left to the middle (Algorithm 3) and the right half whose index is run from right to middle (Algorithm 4). Moreover, we replace Hoare's partitioning algorithm in Dual Parallel Partitioning Phase and compare this algorithm with our proposed algorithm.

Results
There are 3 metrics which are used to measure the performance of DPPSort qsort and DPPSort STL .  (1) .
DPPSort is faster than qsort and STLSort algorithms. The fastest algorithm is DPPSort STL . Its run time is only 3.97 and 2.87 seconds to sort 200 million Uint64 data using qsort and STLSort, respectively as a sorting cutoff algorithm. Run time of qsort and STLSort function are 26.06 and 16.06 seconds, respectively. Sorting cutoff size is proportional to run time and effects with run time complexity. Run time of DPPSort qsort and DPPSort STL which run random data are the fastest at c 5 N/32 and c 5 N/8, respectively. Run time of DPPSort qsort and DPPSort STL are illustrated in Figure 2a and b. DPPSort run time slightly falls while the sorting cutoff size is smaller in any sorting cutoff algorithm. We can notice that the best sorting cutoff size is at c 5 N/32 on DPPSort qsort . The Dual Partitioning phase should be run until its partitions are small enough. Then, sort the partitions using a sorting cutoff algorithm in parallel. On the other hand, the best sorting cutoff size of DPPSort STL is at c 5 N/8. Its run time depends on the sorting cutoff algorithm. ACI STLSort is faster than qsort significantly and can sort the medium data size efficiently. Therefore, it is not needed to split the data into small sizes. Table 1 shows average run time of each distribution at c 5 N/8. We can notice that reversed distribution run time of DPPSort qsort , STLSort and qsort are the fastest compared with other distributions. It can be due to every algorithm using Hoare's partitioning as the partitioning algorithm. This algorithm swaps the most left and right data that is the greatest and lowest, respectively. Then, its indices run to the middle position and finally sort the data.
The DPPSort STL run time of reversed, nearly sorted and few unique distributions are almost the same. It can be due to T DPPartition being lower than the other distributions which affects T DPPSort . The DPPSort STL run time of reversed, nearly sorted and few unique are almost the same. The run time of DPPSort qsort of reversed is slightly faster than the others. It can be due to the DPPartition algorithm which reduces T DPPartition that affects T DPPSort .

Speedup.
Speedup is the metric which can be measured by the performance of the DPPSort algorithm. It is the fraction of original run time versus the DPPSort algorithm run time as shown in equation (2).
Our experiments show DPPSort with qsort and STLSort is a sorting cutoff algorithm. Average Speedup of DPPSort qsort and DPPSort STL (Uint64 random) are shown in Table 2.

Dual Parallel Partition Sorting
Speedup of DPPSort qsort is greater than DPPSort STL significantly. It can be due to partitioning run time of both DPPSort qsort and DPPSort STL are similar. However, sorting run time of DPPSort qsort is greater than DPPSort STL significantly. In addition, Speedup of DPPSort qsort is greater than DPPSort STL . Figure 3a shows Speedup of DPPSort qsort of various sorting cutoff size. It can be noticed that Speedup increases until c 5 N/32. However, Speedup of DPPSort STL reaches the highest at c 5 N/8 as shown in Figure 3b. It can be due to the fraction between partitioning and sorting. The qsort() is quicksort algorithm which divides the data using Hoare's algorithm then sort them with insertion sort when the subarrays are smaller. The STLSort is Introsort algorithm which contains quicksort and heapsort in the partitioning step. It divides the data into subarray using quicksort and use heapsort to sort partially while partitioning. When the subarray is small enough, the insertion sort is called to sort that subarray. The STLSort() can sort the large data better than qsort() because of its partitioning algorithm.
The best Speedup of DPPSort qsort and DPPSort STL at Uint32 data size are at c 5 N/64 and c 5 N/32, respectively. Moreover, the best Speedup of DPPSort qsort at Uint64 data size is between c 5 N/32 and c 5 N/64. However, the best Speedup of DPPSort STL at Uint64 data size is lower than the other parameters. It is between c 5 N/4 and c 5 N/8. We can notice that the significant parameters are sorting cutoff algorithm, sorting cutoff size and data type. Sorting cutoff algorithm size is proportional to sorting cutoff algorithm. While qsort is a sorting cutoff algorithm, the sorting cutoff size is smaller than STLSort. The important parameter is data type. The best Speedup of Uint64 is larger compared with Uint32 data.
4.2.3 Speedup per core and thread. Speedup per core and Speedup per thread are the metrics used to measure the parallel algorithm. If these metrics are higher, the algorithm can use the processor core efficiently. It is the fraction of Speedup of any sorting algorithm versus CPU cores. Note that, Speedup per thread is the fraction of Speedup of any sorting algorithm versus Hardware threads.
Speedup per core and thread results of related algorithms are calculated and available on https://github.com/DPPSort/AnalyzeOptDPPSort/blob/main/Tables/t3.pdf. It can be noticed that Speedup per core of our algorithm is higher than the others. Speedup per core of DPPSort qsort and DPPSort STL is up to 1.71 and 1.47 which are greater than 1.00. This means it can use the processor core efficiently.
However, the 4-core with 8-thread i7-6770 is used to run our DPPSort in our experiment. The Speedup per thread is calculated because of the Hyperthreading Technology of Intel   DPPSort qsort , DPPSort STL , STLSort and qsort are run with Perf Profiling tools at Uint32 and Uint64 with N 5 200 3 10 6 and c 5 N/8. Cache misses of Uint32 data are slightly less than Uint64 data. It can be due to the data type of the Uint32 which is smaller than the Uint64. On the other hand, data types do not affect branch load misses. Therefore, branch load misses of Uint32 and Uint64 are very similar.
It can be noticed that the run time of the reversed distribution is the lowest compared with the others. Its branch load misses metric is the lowest. Moreover, DPPSort qsort run time of reversed distribution is faster than DPPSort STL . Branch load misses metric of DPPSort qsort is lower than DPPSort STL . This can be due to the Hoare's algorithm in qsort which is the sorting cutoff algorithm.
We can notice that the random distribution of every algorithm is the slowest. Both cache misses and branch load misses metrics are the greatest.
There are 2 important metrics which affect the run time of the sorting algorithm. The first priority metric is branch load misses. While the branch load misses metric is greater, its run time is greater than the lower one. If branch load misses of two algorithms are about the same, the second priority metric is cache misses.

Our proposed vs Hoare's partitioning algorithm in Dual Parallel Partitioning phase results
In this paper, we replace and compare our proposed algorithm with Hoare's partitioning algorithm. The results of DPPSort STL which uses our proposed and Hoare's algorithms are available on https://github.com/DPPSort/AnalyzeOptDPPSort/blob/main/Tables/t6.pdf. In this experiment, the Uint32 200 million data are run with different sorting cutoff which c 5 N/2, N/4, N/8, N/16, N/32 and N/64.
It can be noticed that run time of DPPSort STL which uses Hoare's partitioning algorithm as partitioning algorithm, is faster than our proposed partitioning algorithm at c 5 N/2 and N/4. Moreover, its standard deviation is less than ours. Our proposed partitioning algorithm is faster than or equal to Hoare's algorithm at c 5 N/8, N/16, N/32 and N/64. In addition, the standard deviation of our proposed partitioning algorithm is less than Hoare's algorithm at N/16, N/32 and N/64. It means that while the sorting cutoff is smaller, our proposed partitioning algorithm is faster than Hoare's algorithm and it is stable more than Hoare's. It can be due to our proposed partitioning algorithm using Lomuto's partitioning algorithm inside. However, our proposed partitioning algorithm uses Hoare's style outside. Therefore, its locality is better than Hoare's partitioning algorithm.

Dual Parallel
Partition Sorting

Conclusion and future work
This paper proposes an Optimized Dual Parallel Partition Sorting (DPPSort) algorithm. The concept of DPPSort is to partition the data into two parts. Then, run the partitioning algorithm in parallel and merge them with the MultiÀSwap algorithm. This algorithm is run recursively until it is smaller than sorting cutoff sizes. The partition is sorted using the standard sorting function in parallel.
DPPSort is applied and runs on Intel core i7-6770 with Linux system. It is faster than other standard sorting algorithms like qsort and STLSort. Speedup on random distribution is up to 6.823 and 5.883, respectively. Note that, Speedup per thread of 0.853 can be obtained. Its performance depends on the sorting cutoff algorithm, its size, data type and data distribution.
The first priority metric that affects run time is branch load misses. The second one is cache misses. It affects run time significantly while branch load misses of compared algorithms are the same.
DPPSort can be improved in the future works. We can apply this algorithm to the larger machines and heterogeneous systems. Moreover, we can implement this algorithm to the heterogeneous system to achieve Speedup of the algorithm.