UnrealEngine3: Work officially begins. Time to get this building under gcc again. Since we changed up the source tree layout, the Makefiles from UE2 are basically useless. Rather than rewrite them, I'm going to spend some time exploring SCons, which came highly recommended by TTimo, the Doom3/linux guy. First Unreal steals their colored lighting, then their build system! :) Obviously, there's a lot to be done at this point, but best to start now so I'm not scrambling to port a whole engine when UE3 games get closer to shipping. Updates as I have them. Spider-Man 2: Gone gold, baby. MojoPatch: Now open source! http://icculus.org/news/news.php?id=2040 Shrek 2: Now shipping! OpenAL: For those that weren't at WWDC, Apple gave out preview discs of MacOS X Tiger. One of the things Tiger installs by default? OpenAL.framework. No kidding. That's basically awesome. It's Apple's version, which is open-sourced and residing in Creative.com's CVS repository. Likely I'll move my implementation over to Linux and stop further Mac development, so that there is a clear technology path on the Mac. I'll devote further Mac development and debugging to the Apple implementation. After all, it was partially a stop-gap solution (remember when ut2003 took 25% of the CPU mixing audio? It was a needed fix, no doubt!), and partially a technology proof-of-concept to show Apple what works well in terms of game development. No doubt it has served me well. In the short term, I'll have to decide what we ship on the disc with Unreal-based games. For ut2004, my implementation is the only one with ALC_EXT_capture support for VoIP...this could be added to the other implementation, but hasn't been as of yet. My version is apparently a little faster, but it's stereo only (but the subversion repository doesn't crash on M-Audio 5.1 and 7.1 cards anymore), so Apple's tech is probably more attractive for further development by default. All of the missing functionality in Apple's implementation could be fixed with some elbow grease, which I'm sure will show up one way or another in the near future. Overall, this is a very good step forward, and I applaud Apple for giving game developers something they really need. Duke3D: Latest CVS builds and runs on Solaris/x86 (and presumably Solaris/sparc, too). I get a lot of questions about MacOS X: the game _does_ run on OSX, there just isn't a nice installer or anything at the moment, so you have to compile it yourself. When there's time, I'll put together a shareware-based installer, and, if I can find a copy, one that works with the Mac retail disc. UTPG: (Yes, this is still being worked on.) Unreal Tournament 2003: There's an exploit in the ut2003 network code, so here's a new build. Linux: http://0day.icculus.org/ut2003/ut2003lnx_patch2225-3-BETA.tar.bz2 MacOSX: http://0day.icculus.org/ut2003/ut2003-mac-patch-2225-3.dmg.bz2 The Linux one has about a million changes over the stock 2225, since it's got all the MacOSX work on top of it. Consider it beta. The Mac version has one or two fixes, so it's worth updating. Unreal Tournament 2004: If Mac retail installer crashes on you, use this: http://icculus.org/~icculus/tmp/UT2004-mac-updated-installer.tar.bz2 Linux (x86 and amd64) official 3236 patch (new build with load times fixed): http://icculus.org/news/news.php?id=2064 MacOS X (un?)official 3236 patch (YES, this is newer than 3229): http://icculus.org/news/news.php?id=2064 Call of Duty: 1.4 is out, now with PunkBuster support: http://www.callofduty.com/patch/ This is a 1.4 server with an exploit closed. Admins should all upgrade: http://icculus.org/betas/cod/COD-lnxded-1.4-07252004.tar.bz2 Postal 2 Share the Pain: Linux demo: http://icculus.org/news/news.php?id=1816 Linux retail: In beta testing (apply at http://www.linuxgamepublishing.com/) Mac retail: In beta testing (no more applications, please!) America's Army: 2.1.0 is out for Linux and Mac: http://icculus.org/news/news.php?id=2046 Other stuff: (Updated: my solution is at the bottom.) So here's your programming test. You have a program that reads a bunch of data from disk...several hundred megabytes. Most of this data is comprised of callstacks...variable-length lists of pointers that represent a program's program flow. Ever hit a breakpoint in gdb and hit "bt"? That backtrace. That's what I'm talking about. Let's say you are writing a profiler that is examining a run of a program. Let's say you're writing Apple's Shark tool or whatnot. You might have hundreds of thousands (or millions) of backtraces that represent a small snapshot of a program at any given time. So, how do you efficiently store this information as it comes flooding in? Here are your interfaces: "callstackid" can be anything that fits in the platform's wordsize...a pointer or an integer or whatnot, but it has to be an intrinsic datatype of some sort, not a structure. You can have structures behind the scenes. callstackid callstack_add(void **ptrs, size_t count); ...this will get called repeatedly as the callstacks are read from disk. (ptrs) is an array of void-pointers, which represent the callstack. (count) are the number of elements pointed to by (ptrs), which is not null-terminated. This is called once per callstack. We'll be nice and let you have: void callstack_doneadding(void); ...which is called when we're done reading from disk, in case you want notification that there won't be more calls to callstack_add(). Now we'll be calling... size_t callstack_framecount(callstackid id); ..where you return the number of pointers in that specific callstack. We'll use this to allocate enough space before calling: void callstack_get(callstackid id, void **ptrs); ...where you copy the pointers into the array we supply in (ptrs). In Real Life, this would pass the size of the array, but you can assume we allocated it based on what you fed us in callstack_framecount(), and that id is a valid value you fed us from callstack_add(). Etc. That's it. The goal is to implement this interface so that it is: 1) as fast as possible, even with a potentially huge data set, in all parts of this API (i.e. - if adding is fast, but getting is SLOW, then it's no good). 2) uses as little memory as possible. Remember that we might feed you gigabytes of data over time in callstacks_add(). We'll be nice and say you won't get more than your address space can hold, but even then, you can't use the whole thing (ever use 4 gigs of swap?). 3) as simple as possible (har!). You can use C or C++ at your discretion. Nothing else. If it's not in the standard C runtime, you can't use any external functions, either (yes, that means no STL or Boost, etc). There isn't a 100% Good answer; With large datasets, it's always a collection of tradeoffs and content assumptions. Welcome to the industry. Those that complete this assignment have the satisfaction of knowing that you probably came up with an acceptable answer faster than I did. :/ ...So here was my solution. I used C++, which was pretty daring for me, but in this case, it helped simplify the expression of my solution sufficiently. I haven't debugged this yet, but it does compile. Ok, here are some basic assumptions about our data set. All of this is about tradeoffs and optimizing for the Most Likely Case, after all: - Most of your callstacks are going to be the same. Lots of your program time is generally spent in a few places. Lots of callstacks are going to be complete duplicates in almost any practical sampling of a program, so storing them multiple times is wasteful. Obviously we're going to want to figure out if a callstack has been seen before and feed back a previous ID instead of building a new one. - Most of your callstacks are going to share a common ancestry...i.e. - they'll probably all have main() at the top (or one of a few thread entry points). After that, they probably have other functions in common. For example: void x(void) { y(); z(); } void main(void) { x(); } There you might have two callstacks: y -> x -> main and z -> x -> main The only difference is the call to y() or z(). The rest is duplication. Storing the duplicate parts is wasted space if we can share this data between callstack IDs. So we'll make each element of a callstack into a node in a tree: typedef void *PTR; // "PTR *" is less confusing than "void **". class CallstackNode { public: CallstackNode(PTR _ptr=0, CallstackNode *p=NULL, size_t d=0) : ptr(_ptr), depth(d), parent(p), children(NULL), sibling(NULL) {} ~CallstackNode(); PTR ptr; size_t depth; CallstackNode *parent; CallstackNode *children; CallstackNode *sibling; }; // CallstackNode ... Some of these elements eat a little more memory, but save us some CPU time later. For example, "depth" is how far down the tree a node is, which we could get from walking back up the tree, but since this information is convenient at the time of creation, we save it so framecount() is fast. Again, tradeoffs. Now, my required API is wrapped in a C++ class. This is mainly so I can use the destructor to automatically delete the tree when we're done, but that was just a question of convenience; this could be done with a global variable just as easily. class CallstackManager { public: typedef CallstackNode *callstackid; // Consider this opaque. callstackid add(PTR *ptrs, size_t framecount); void done_adding(); size_t framecount(callstackid id); void get(callstackid id, PTR *ptrs); protected: CallstackNode root; }; // CallstackManager You'll notice that "callstackid", our opaque descriptor for a given callstack, is just a CallstackNode. This will be the deepest leaf of the tree (the "y" in "y -> x -> main"). The implementation... CallstackManager::callstackid CallstackManager::add(PTR *ptrs, size_t framecount) { CallstackNode *parent = &this->root; // top of tree. CallstackNode *node = parent->children; // root node is placeholder. size_t origframecount = framecount; // assume everything is coming from main(), so start from the back so // we put it at the top of the tree. This will result in less dupes, // as nodes that have common ancestry will share common nodes. ptrs += framecount; while ((node != NULL) && (framecount)) { if (node->ptr != *ptrs) // non-matching node; check siblings. node = node->sibling; else // matches, check next level... { ptrs--; framecount--; parent = node; node = node->children; } // else } // while // (framecount == 0) here means a complete match with existing branch. while (framecount) // build any missing nodes... { node = new CallstackNode(*ptrs, parent, origframecount - framecount); node->sibling = parent->children; parent->children = node; parent = node; ptrs--; framecount--; } // if return((callstackid) node); // bottom of the callstack ("main", etc). } // CallstackManager::add ...So we start at the top of the tree, see if (say) "main" is in there. Yes? Move to main's children nodes and look at the next element of the callstack. Repeat while each element matches a node. If you hit one that doesn't, this is a new callstack, so create all the missing child nodes. When done, return a pointer to the bottom node. void CallstackManager::done_adding() { // no-op in this implementation } // CallstackManager::done_adding ...This implementation doesn't need any post-processing. size_t CallstackManager::framecount(callstackid id) { return(((CallstackNode *) id)->depth); } // CallstackManager::framecount ...We store the depth of a given node, so framecount() runs in O(1) time. void CallstackManager::get(callstackid id, PTR *ptrs) { CallstackNode *node = (CallstackNode *) id; while (node) { *ptrs = node->ptr; ptrs++; node = node->parent; } // while } // CallstackManager::get ...get() runs in O(framesize) time; it just walks back up the tree from the node returned by add(), filling in the caller's array. Most stacks in a real-world program are going to be small...100 frames would be considered large in many cases, but this will still work with recursive functions that go thousands of frames deep. ...finally, the cleanup is simple, too: // CallstackManager automatically deletes the root node in its default // destructor, which causes the whole tree to collapse. CallstackNode::~CallstackNode() { CallstackNode *kid = this->children; while (kid) { CallstackNode *next = kid->sibling; delete kid; kid = next; } // while } // CallstackNode destructor That's it. The drawbacks of this method: - Each node takes some amount of memory. There may be other approaches that can keep a callstack as an array of pointers and use less memory overall. You could eat more CPU and beat up on the cache more by removing the "depth" field, and maybe "sibling". - All these node creations cause a lot of initial malloc pressure, and possibly a lot of memory fragmentation. After the initial flurry, the memory allocations should drop off as previous-seen nodes start showing up. Possible improvements: - Might be able to get a win by encoding recursive function calls into a single node. Not sure if the added complexity is worth the effort in a Real World program's flow. - We could take the time to sort the siblings into a sane order, and then implement a faster search. This might require keeping siblings as an array that is realloc()'d. Alternately, we could just move the most-recently matching node in a given set of children to the front of the list, since it's likely to hit multiple times in a row. The least-used nodes will float to the back of the list. - Could possibly doubly-link nodes, so if you have two largely similiar branches with different parents, they can exist in multiple places in the tree, and point to each other. This makes destruction harder, and initial adding would need a means to find duplicates. For extremely diverse data sets, the added effort could be a considerable win, but again, I suspect that the average usage patterns render this effort basically worthless. Tradeoffs, tradeoffs, tradeoffs... --ryan.