In this work, we evaluate an analytical GPU performance model based on Little's law, that expresses the kernel execution time in terms of latency bound, throughput bound, and achieved occupancy.
We then combine it with the results of several research papers, introduce equations for data transfer time estimation, and finally incorporate it into the MERPSYS framework, which is a general-purpose simulator for parallel and distributed systems.
The resulting solution enables the user to express a CUDA application in a MERPSYS editor using an extended Java language and then conveniently evaluate its performance for various launch configurations using different hardware units.
We also provide a systematic methodology for extracting kernel characteristics, that are used as input parameters of the model.
The model was evaluated using kernels representing different traits and for a large variety of launch configurations.
We found it to be very accurate for computation bound kernels and realistic workloads, whilst for memory throughput bound kernels and uncommon scenarios the results were still within acceptable limits.
We have also proven its portability between two devices of the same hardware architecture but different processing power.
Consequently, MERPSYS with the theoretical models embedded in it can be used for evaluation
of application performance on various GPUs and used for performance prediction and e.g. purchase decision making.