Data Structure Using C By Udit Agarwal Pdf Free
We also describe a method to produce compile-time optimizations for data-parallel applications and to extract scheduling information at runtime (i.e., inspector-executor parallelization). We show that amorphous data-parallelism is ubiquitous in DDP applications, and for a fixed number of cores, the parallelism of individual partitions can be increased through partitionwise compiler optimization. In addition, for a given data structure, we can produce a set of parallel partitions from which one can select the parallelization, if any, that is most appropriate.
To show the usefulness of this method, we show that compile-time optimization of a representative implementation of the FFT gives significant performance gains, and explain how the optimization can be used to extract parallelism information from the compile-time. The behavior of the optimization is most surprising because of two reasons: First, it turns out that the data structure has a higher constant factor, thereby requiring a different degree of data-parallelism in order to achieve a given speedup. This indicates that the optimization must use a different scheduling algorithm from the FFT itself; and second, the performance is independent of the number of cores in the machine and the number of partitions. Thus, the optimization is independent of a core’s share of the memory bandwidth, which is typically not the case for standard data-parallel optimizations. This suggests that the FFT parallelization is an excellent candidate for inspector-executor parallelization using optimizations of this kind. We then use this method to implement inspector-executor parallelization using the FFT, and we show how this is implemented by viper, an OpenMP-based compiler for the simd programming model. We show that the FFT parallelized by viper exhibits many of the expected characteristics of inspector-executor parallelization, supporting the claim that we have made. d8a7b2ff72
The goal of this paper is to analyze and exploit the performance characteristics of hierarchical matrix-based data structures. Most of the literature treats the single-precision case only, while this paper will also provide comparisons and implementations for the double precision case. The paper also discusses certain aspects which are specific to each type of matrix. We proceed in four steps. Step 1 illustrates a structured data representation of the hierarchical matrix; this representation is used in Step 2 to describe the performance of the representation of the hierarchical matrix. The last two steps describe techniques for improving the performance of these data structures. The performance of these data structures is compared to one of the best known hierarchical matrix implementations, UMFPACK, and three hierarchical matrix data structures proposed in the literature.
Irregular algorithms are algorithms with complex main data structures such as directed and undirected graphs, trees, etc. A useful abstraction for many irregular algorithms is its operator formulation in which the algorithm is viewed as the iterated application of an operator to certain nodes, called active nodes, in the graph. Each operator application, called an activity, usually touches only a small part of the overall graph, so non-overlapping activities can be performed in parallel. In topology-driven implementations, all nodes are assumed to be active so the operator is applied everywhere in the graph even if there is no work to do at some nodes. In contrast, in data-driven implementations the operator is applied only to nodes at which there might be work to do. Multicore implementations of irregular algorithms are usually data-driven because current multicores only support small numbers of threads and work-efficiency is important. Conversely, many irregular GPU implementations use a topology-driven approach because work inefficiency can be counterbalanced by the large number of GPU threads. In this paper, we study data-driven and topology-driven implementations of six important graph algorithms on GPUs. Our goal is to understand the tradeoffs between these implementations and how to optimize them. We find that data-driven versions are generally faster and scale better despite the cost of maintaining a worklist. However, topology-driven versions can be superior when certain algorithmic properties are exploited to optimize the implementation. These results led us to devise hybrid approaches that combine the two techniques and outperform both of them.