computing, costs and caching, oh my

Via coding horror, I stumbled upon a simply wonderful talk by Herb Sutter about various performance issues like how much operations cost. It also discusses how memory, latency and machine architecture can affect that cost how this has changed over the years. You can find the slides and a video of the presentation at http://nwcpp.org/Meetings/2007/09.html.

Be prepared for a total geek-out. This is highly technical (and awesome, but that's bordering on a tautology) stuff and probably not for the faint of heart. Slides 6 and 7, for example, around the 23m mark) show the value of cache compared to getting something from RAM, and just how bad retrieval from disk is. Later (slides 13 and on; around 55m in the video), when it comes to threads and how a compiler or even hardware may screw you over not do what you want to do, or even what you tell it to do, people how still have them are allowed to run to their moms for safety. By Patina, that is just nasty.

Near the end Sutter discusses the differences between using vectors, lists and sets and what the penalties for the latter are for something as simple add adding all the values in them. This starts at around slide 22, or 1h40m. Even if the rest is gobblyjook, this part is easy to understand. Basically, low footprint and sequential accesses are Good Things, even if you have cache and stuff. Especially when you have cache and stuff.

memcpy and memset replacements for GBA/NDS

The standard C functions for copying and filling are memcpy() and memset(). They're part of the standard library, are easy to use and are often implemented with some optimizations so that they're usually faster than manual looping. The DKA version, for example will fill as words if the alignments and sizes allow for it. This can be much faster than doing the loops yourself.

There is, however, one small annoying fact about these two: they're not VRAM-safe. If the alignment and size aren't right for the word transfers, they will transfer bytes. Not only will this be slow, of course, but because you can't write to VRAM in bytes, the data will be corrupted.

The solutions for this have mostly come down to “so don't do that then”. Often, this can be sufficient: tiles in VRAM are word-aligned by definition, and source graphics data can and should be word-aligned anyway. However, now that I'm finally working on a bitmap blitter for 8bpp and 16bpp, I find that it's simply not enough. So I wrote the following set of functions to serve as replacements.

The code

My main goal here was to create smallish and portable replacements, not to have the greatest and fastestest code around because that's rather platform dependent. Yes, even the difference between GBA and NDS should matter, because of the differences in ldr/str times and caching.

There are 5 functions here. The main functions here are tonccpy and __toncset for copying and filling words, respectively. The other 3 are interfaces for __toncset for filling 8-bit, 16-bit and 32-bit data; you need these for, say, filling with a color instead of 8-bit data. For the rest of the discussion, I will use the name “toncset” for the internal routine for convenience.

Continue reading