2018-01-03: VC5 GL/Vulkan, VC4 perf counters

For VC5 features:

  • Fixed bugs in shader storage buffer objects / atomic counters.
  • Added partial shader_image_load_store support.
  • Started investigating how to support compute shaders
  • More progress on vulkan image layouts.
  • Updated to a current version of the SW simulator, fixed bugs that that revealed.
  • Reworked TMU output size/channel count setup.
  • Fixed flat shading for >24 varying components.
  • Fixed conditional dicards within control flow (dEQP testcase)
  • Fixed CPU-side tiling for miplevels > 1 (dEQP testcase)
  • Fixed render target setup for RGB10_A2UI (GPU hangs)
  • Started on V3D 4.1 support.

For shader_image_load_store, we (like some cases of Intel) have to do manual tiling address decode and format conversion based on top of SSBO accesses, and (like freedreno and some cases of Intel) want to use normal texture accesses for plain loads. I’ve started on a NIR pass that will split apart the tiling math from the format conversion math so that we can all hopefully share some code here. Some tests are already passing.

The compute shaders are interesting on this hardware. There’s no custom hardware support for them. Instead, you can emit a certain series of line primitives to the fragment shader such that you get shader instances spawned on all the QPUs in the right groups. It’s not pretty, but it means that the infrastructure is totally shared.

On the VC4 front, I got a chance to try out Boris’s performance counter work, using 3DMMES as a testcase. He found frameretrace really hard to work on, and so we don’t have a port of it yet (the fact that porting is necessary seems like a serious architectural problem). However, I was able to use things like “apitrace –pdrawcalls=GL_AMD_performance_monitor:QPU-total-clk-cycles-waiting-varyings” to poke at the workload.

There are easy ways to be led astray with the performance counter support on a tiling GPU (since we flush the frame at each draw call, the GPU spends all its time loading/storing the frame buffer instead of running shaders, so idle clock cycles are silly to look at the draw call level). However, being able to look at things like cycles spent in the shaders of each draw call let us approximate the total time spent in each shader, to direct optimization work.