OptimizingTorqueOnMacOSX

Some basic notes on optimizing Torque for MacOS X...

TODO: Intro goes here.

This document is written against the Torque 1.3 release, and the 1.3 demo, as it stands out of the box (or out of the .dmg, if you will).

I am a command line whore and therefore a total XCode n00b, everything I did that involves the UI was feeling around to make it spit out the gcc command lines I wanted. Anything specific to XCode almost certainly has a better solution to get from Point A to Point B. Comments and corrections are welcome.

Basic ideas (to be fleshed out!)

Build the right thing!

First things first, make sure you're compiling a release build and not a debug build! In XCode, you want to select "Show Detailed Build Results" from the "Build" menu (or just hit Shift-Apple-B), and make sure "Active Target" is "Torque-MacCarb-Release" and "Active Build Style" is "Deployment", and then click the "Build" button to rebuild with the new optimizations. Obviously, you want to dial these back to debug settings if you need to step through the engine in the debugger, but if you are making a public build or are mostly fighting with scripts, then you might as well get the speed boost.

Use better compiler options.

Add some extra compiler optimizations to the build. The two you'll definitely want are -falign-loops=16 and -faltivec into your compile command lines. These are safe for debug builds, too. -faltivec doesn't actually change performance at all, it just gives us access to the vector-related keywords and Altivec intrinsics...this is needed on all build targets because we'll be adding some Altivec code later. The G4 takes a performance hit if you have a loop that isn't aligned to 16 bytes (hence the -falign-loops=16) ... G5 systems take a performance hit if they aren't aligned to 32-bytes, but with -falign-loops=16, you make this G4 friendly all the time, and a 50% chance of each loop being G5 friendly...overall, this isn't an unreasonable tradeoff at this point. You can just align all loops to 32 bytes, but you might start to see performance loss from cache thrashing and a larger executable to ship. Experiment if you like, but "16" seems to be safe for games shipped in 2005.

Most of the other Torque 1.3 optimization command lines seem okay (-O3, -mdynamic-no-pic, etc). If you are willing to jettison G3 support, you can add some G4-specific optimizations, but I don't consider this a good practice at this point in time unless you're doing a high-end game. There's a lot of G3 iBooks out there still. All the Altivec patches we'll be making will fall back to G3-friendly behavior at runtime, if necessary, too. Also worth adding is -ffast-math, which give less accurate but faster results for some heavy-expense math routines, and -fno-math-errno, which makes most of the functions in the C runtime's math.h neglect to set the variable "errno" on error...generally this lets the compiler replace some of these functions with inline code (notably, sqrt() on G5-specific builds, but other things as well).

On XCode's "Project" menu, pick "Edit Active Target 'Torque-MacCarb-Release'", then "GCC Compiler Settings", and put "-falign-loops=16 -faltivec -ffast-math -fno-math-errno" at the end of the "Other C Compiler Flags" junk. Add just "-faltivec" to the "Torque-MacCarb-Debug" settings.

While we're discussing debug settings, Torque 1.3 comes with the debug build set to optimize for binary size (-Os)...this makes debugging harder. Dial this back to -O0 (no optimizations at all) to make debugging easier. Also, release builds can have the "generate debug symbols" box checked if you will remember to strip the binary before shipping. Hey, sometimes you need to use the debugger with a release build! Stripping a binary removes debug symbols after the fact and is perfectly safe, but that's up to your discretion and understanding of the "strip" command line tool.

Also, you're going to need the veclib framework. This ships on every Mac, you just need to tell the project to use it, since we'll be adding some patches later that rely on it. Expand "torque_pb_2_1" in the tree on the left pane in XCode's main window, find "Frameworks" and right-click/ctrl-click it, and choose "Add" -> "Existing Frameworks..." from the popup menu. Then go find "Macintosh HD/System/Library/Frameworks/vecLib.framework" and click the "Add" button. Add it to all targets. That's all.

You can't beat Apple's C runtime.

There are several places in the engine that have for-loops that are meant to be basic memory block copies, or string manipulation. These will almost certainly be slower than the standard C runtime on MacOS...which isn't necessarily true on, say, Linux. Apple has spent a lot of effort on not only hand-optimizing memcpy(), but hand-optimizing it for each processor they ship and making sure the C runtime does the right thing for the system that your app is running on. You will not beat this with a trivial loop in C (even if you are clever and copy 32 bits at a time).

Since Torque stores all script variables as strings, regardless of their intended type, it spends an inordinate amount of time in string manipulation routines, so swapping these string routines out with the standard C runtime versions, you can pick up some speed.

Altivec blender.

On any terrain-based scene, you will spend an insane amount of time in engine/terrain/blender.cc...15 to 35% of your CPU time is not uncommon! Mostly this is doing work that is fairly well-suited for the Altivec unit, and Kyle Goodwin has written an Altivec version of this code which makes it an order of magnitude faster.

Other Altivec-able hotspots.

In the first view of Orc Town in the Torque Feature Demo, 3.8% of the CPU time is spent in processTriFan(). An Altivec version brings this down to 2.2%.

Unnecessary type conversions.

Code such as this...

 float add_point_5(float x) {
    return x += 0.5;
 }

...doesn't just add 0.5 to x, it converts x from a 32-bit float to a 64-bit double, adds 0.5 to it, and then converts it back to a 32-bit float for returning. While this is arguably a problem on all platforms, the conversion is significantly more costly on PowerPC chips.

A simple grep for ".0" in the .cc files turns up over 2000 instances, and the .h files list up to 1800 more. Each of these are good for a few free cycles each.

Altivec and register-heavy math functions.

There are some math functions that benefit from being Mac-ified.

Replace sqrt() with fsqrtre opcode and newton-raphson.

The C runtime's sqrt() is slow on MacOS. Very slow. Unlike x86, the PowerPC doesn't necessarily have a full-blown opcode for square root calculation, so to give sqrt() full precision, it has to do a lot of generic work on the CPU to make the calculations. It's not uncommon for Torque games to spend 2% or more of their CPU time in sqrt().

You can avoid this by using the "fsqrtre" opcode on G3 and G4 systems. This gives a "reciprocal estimate" square root, which is low precision, but fast, fast, fast compared to sqrt().

It's worth noting that G5 systems do have a real square root opcode, which is faster than the reciprocal estimate method and gives full precision to boot. sqrt() in the C runtime will use this on G5 systems. You can also coerce GCC to use this opcode directly when you call sqrt() and not call into the C runtime for ultimate performance, but that loses you all G3 and G4 systems. For now, it's probably worth doing the reciprocal estimate version on all platforms.

Division is expensive

Division is a guaranteed pipeline stall on PowerPC chips. When you can avoid it, you should. Multiply instead when you can. Also, if you need a reciprocal, don't devide with 1.0f, use the "fres" opcode for a reciprocal estimate (possibly with some Newton-Raphson). There are even some places where we do "1.0/sqrt()", which breaks the division advice, the sqrt() advice, and the float-to-double overhead advice. :)

Altivec Ogg Vorbis.

Frame-by-frame, I didn't see Ogg Vorbis as a big bottleneck, but since it _IS_ a bottleneck when it is doing its work, it's worth making it faster. I have written an Altivec version of Ogg Vorbis 1.0, which is a drop-in replacement for the existing libs. It can make Ogg decoding up to 50% faster on G4/G5 systems, which doesn't suck.

Use AL_EXT_float32 with internal Ogg Vorbis.

Some OpenAL implementations can be fed PCM data in 32-bit floating point format...which is notable because CoreAudio, the MacOS audio API that OpenAL tends to be layered on, only eats float32 data. Ogg Vorbis, by default, decodes to float32, and then, in a highly inefficient for-loop, converts to 16-bit int before returning the data to the app...but you can change how you use the Vorbis libraries and OpenAL, so that you pass audio in float32 from start to finish down the audio pipeline...as it stands now, you're converting from float to int in Ogg Vorbis, converting back to Float32 inside OpenAL, and probably converting back to int inside CoreAudio to feed the physical audio device. All of that adds up in terms of processing load and cache thrashing.

If the AL implementation supports AL_EXT_float32, you can go straight down the audio pipeline with no conversions.

Use AL_EXT_vorbis and dodge all this stuff.

If the AL implementation supports AL_EXT_vorbis, you can use it to feed the compressed Ogg data to the AL and skip the decoding and buffer queueing (and, if the AL is really smart, it's using my Altivec decoder and a pure float32 rendering pipeline internally, too). For small sounds, it might be faster to decode the audio completely and feed it to alBufferData() as PCM data once, but for streaming audio that you are feeding via AL's buffer queueing mechanism, this is a total win.

Don't use glFinish() or glFlush()

Apple's OpenGL implementation is highly parallelized and suffers much more than other platforms when you force it to synchronize unnecessarily. Translation: never use glFinish() or glFlush() if you can avoid doing so. Torque does so, mostly to fix an issue with the Direct3D wrapper. It is safe to comment out all of these calls on the Mac for an immediate framerate boost.

Strategic inlining.

To be written.

TorqueScript variables are always strings internally

All TorqueScript variables are converted to and from strings all the time. Even points are converted to "x y" format. Ideally, someone should change this behaviour inside the engine to something more efficient.

Cache terrain as static data.

...via vertex_buffer_object or vertex_array_range. To be written.

Other stuff.

To be written.

--ryan.