2017-08-01: copy performance

This week I worked on building an implementation of the overlapping blit extension for VC4. My first extension spec effectively said “unscaled overlapping BlitFramebuffer now gives you the answer you obviously want.” I had plumbed it all the way through to the backend, written spec text, and was getting ready to submit it for review, when I realized it wasn’t the spec I wanted.

The problem for X11, and specifically Raspbian’s desktop, is that we’re doing a CopyArea with a non-rectangular region. If we implement overlapping blits using BlitFramebuffer, we have a blit per rectangle in the region, each of which dispatches a rendering job to the kernel. Since rendering jobs only operate on tiles, those single-scanline rectangles at the top and bottom of your rounded window have 64x the cost they should, not even counting job dispatch overhead.

So, I went back to the drawing board and wrote a new spec. What I’m offering now is control over the basic HW functionality: You can flip a bit that says “Instead of tile raster order being HW-defined, you have two bits for left-to-right or right-to-left, and top-to-bottom or bottom-to-top.” Combined with GL_ARB_texture_barrier (where I provide the definition for what guarantees the tile raster order provides you), we can submit drawing for all the rectangles in the copy in a single job.

So far, review in Mesa is going well, though the kernel has silence so far. Jason Ekstrand at Intel noted that while they can’t do tile raster order, they could do overlapping 1:1 blits in most cases, and could break things down in the others to a series of valid operations with cache flushes in between. However, his description of the breakdown process is already something that Glamor could do using GL_ARB_texture_barrier:

while (dest_region is nonempty) {
    src_region = dest_region + src copy offset
    this_region = dest_region - src_region
    copy this region;
    glTextureBarrier();
    dest_region -= this_region
}

While I don’t need to do this for VC4, since we have a better option with my extension, it still seems like a good thing to implement.

For VC4, I have x11perf -copywinwin500 performing the same as the previous SW rasterization that Raspbian had. Next week: fixing the interactivity now that the raw performance is in place.

Other news tidbits this week:

  • Landed the CLIF-style debug dumping for VC4
  • Landed the bank-aware register selection optimization for VC4, at last.
  • Debugged an assertion failure regression on VC4 from Marek’s CPU optimization work
  • Tested and landed Hans Verkuil’s HDMI CEC driver for VC4.
  • Landed kernel-side BO labeling (debugfs memory allocation descriptions) support for VC4
  • Did some review on the DRI3 plane modifier patch series from Collabora (a great start, but underspecified in some critical ways).