2017-06-02: cec, transposer, memory consumption

It’s been a busy month. Part of that is that I took a week off to go to our Burning Man regional event, which threw a serious wrench into my status updates.

Hans Verkuil now has HDMI CEC working, and is polishing the patches up for inclusion in the kernel. Unfortunatelly we set his work back a bit with Boris’s new HDMI power management code, and that’s getting resolved now. He also tripped over a bug in my new dma-buf reservation code that I helped track down.

Boris has the transposer working and has reviewed the core writeback code, so hopefully we can land it soon. I’m really excited to use this for testing tiled display support (see below).

My work for the last couple of weeks has been in figuring out what to do about the fact that the new Raspbian display stack (vc4 firmwarekms, glamor, compton compositor) takes way more memory than the old display stack (fbdev on the firmware, no glamor, no compositing). It’s been pretty easy for users to hit the 256MB CMA limit, and we aren’t even using 256MB of CMA on the Pi0/1 due to its only having 512MB of system memory.

One of the ideas would be to page GPU allocations out to system memory. That only gets us from 256MB to <1GB of memory available. It’s not like you want to be swapping to SD cards. If we’re hitting 256MB this easily, we need to figure out how to fix that.

The first step was to figure out where our memory was going. I wrote a hack that had the kernel give names to the BOs it allocated, and added a new UAPI that let userspace attach more descriptive names to BOs. With that, I added a little hack to Mesa to attach some descriptions. I then made /debug/dri/0/bo_stats print out the stats on count/size per name, rather than just total. Here’s the output of LXDE with compton as I was dragging a terminal around:

 scanout resource 1920x1080@32/0:  48600kb BOs (6)
    tiling shadow 1920x1080:  32640kb BOs (4)
		   overflow:  16384kb BOs (1)
    resource 1920x1080@32/0:  16320kb BOs (2)
    resource 1024x1024@32/0:   4096kb BOs (1)
      tiling shadow 661x411:   2184kb BOs (2)
      resource 657x386@32/0:   1092kb BOs (1)
  scanout resource 661x411@32/0:   1068kb BOs (1)
     resource 1048576x1@8/0:   1024kb BOs (1)
     resource 1024x1024@8/0:   1024kb BOs (1)
		   resource:    888kb BOs (20)
      tiling shadow 1920x26:    240kb BOs (1)
      resource 1920x26@32/0:    240kb BOs (1)
		     shader:    236kb BOs (56)
  scanout resource 1920x26@32/0:    196kb BOs (1)
		 mesa cache:    152kb BOs (5)
       resource 659x20@32/0:     84kb BOs (1)
       resource 65536x1@8/0:     64kb BOs (1)
      resource 128x128@32/0:     64kb BOs (1)
	  resource 1x1@32/0:     56kb BOs (14)
	      dumb 64x64@32:     48kb BOs (3)
	resource 659x1@32/0:     36kb BOs (3)
		      cache:     28kb BOs (4)
			RCL:     20kb BOs (2)
	 resource 26x1@32/0:     16kb BOs (4)
	resource 1x386@32/0:     16kb BOs (2)
	resource 1x361@32/0:     16kb BOs (2)
	 resource 1x25@32/0:     16kb BOs (4)
	resource 609x1@32/0:     12kb BOs (1)
	resource 607x1@32/0:     12kb BOs (1)
	resource 12x18@32/0:      4kb BOs (1)
			BCL:      4kb BOs (1)

This horrifying. First, this is already after a fix I’ve landed that kept us from leaking an extra overflow BO, that cost 16MB at all times. Next, 6 full-screen scanout resources? I expected 3, certainly, due to compton’s pageflipping (though for a compositor, keeping 3 is bad: I should have GPU idle time built into my pipeline anyway, as presumably GPU work is being done by other clients to generate the frames I’m compositing). Our non-pageflipping scanout buffer is probably another one (can we free the screen pixmap’s contents during pageflipping? That would sure be nice). Maybe the file manager is drawing into a window rather than the root window, to make the another one. There’s still one more I haven’t accounted for, though.

More embarassing is the tiling shadow allocations. We do this in vc4 because it can’t texture from linear BOs, so we allocate a shadow resource that we blit a tiled copy into. The cost of those tile blits are actually why we’re running a compositor in Raspbian in the first place, because window movement is too expensive otherwise.

My crazy idea was to make a new mode for glamor that just keeps all pixmaps in system memory until they need to be on the GPU (because they’re either scanned out by the GPU or they have an exposed reference to them for DRI). My previous reworks for linear vs tiled buffers in glamor already made this look easy. You end up with software fallbacks all the time (The old fbturbo driver was all software fallbacks, though), and with the new gbm_bo_map() interface we should be able to keep the cost of software fallbacks down to just the cost of the rendering performed, not a 1920x1080 readback per drawing operation.

However, 20 patches in and without having gotten to the point of gbm_bo_map() yet, this is looking like a mess. On Friday I had breakfast with keithp and talked over what I was doing, and he convinced me to go back to an older, simpler plan: Just make all of our pixmaps tiled, so we don’t have the shadow copies at all. If we don’t have the shadow copies, then we don’t need the compositor to get window-dragging performance. At least 64MB of the allocations above would go away.

The new plan means we eat the memory bandwidth cost of scanning out from tiled, but it will be easy enough to register a timer to go make a linear copy and flip to that. We’re not the only platform that would like that, I’m sure.

I’ve now written the kernel interface and debugging it is the plan for this week.

In other news, my panel-bridge v3 is now in drm-misc-next, along with most of the associated code deletion. I also helped the team trying to integrate my PL111 code with a bug in panning support, which their current 3D stack relies on. In the process of debugging the pl111+vc4 code (a full screen X11 fallback would reallocate the BO when it shouldn’t), I added debugfs register dumping for pl111, which should be useful for other developers using the code.

I have the X Server building in Travis CI, at long last. The meson build system was critical for this, as autotools was just too slow. The other key part was switching to Docker for bundling our dependencies, since 20 autotools invocations was unreasonable, and I won’t have the time to convert those 20 things to meson. Next steps on CI are to get the X Test suite running in the meson build system (the docker image already has it present).

Finally, earlier this month I submitted my Mesa register allocation series (let the driver choose the color during graph coloring), which should be good for about a 2% performance improvement to vc4 3D.