Finger info for

 Work officially begins. Time to get this building under gcc again. Since
 we changed up the source tree layout, the Makefiles from UE2 are basically
 useless. Rather than rewrite them, I'm going to spend some time exploring
 SCons, which came highly recommended by TTimo, the Doom3/linux guy. First
 Unreal steals their colored lighting, then their build system! :)
 Obviously, there's a lot to be done at this point, but best to start now
 so I'm not scrambling to port a whole engine when UE3 games get closer to
 shipping. Updates as I have them.

Spider-Man 2:
 Gone gold, baby.

 Now open source!

Shrek 2:
 Now shipping!

 For those that weren't at WWDC, Apple gave out preview discs of MacOS X Tiger.
 One of the things Tiger installs by default? OpenAL.framework. No kidding.
 That's basically awesome. It's Apple's version, which is open-sourced and
 residing in's CVS repository.

 Likely I'll move my implementation over to Linux and stop further Mac
 development, so that there is a clear technology path on the Mac. I'll
 devote further Mac development and debugging to the Apple implementation.
 After all, it was partially a stop-gap solution (remember when ut2003 took
 25% of the CPU mixing audio? It was a needed fix, no doubt!), and partially
 a technology proof-of-concept to show Apple what works well in terms of game
 development. No doubt it has served me well.

 In the short term, I'll have to decide what we ship on the disc with
 Unreal-based games. For ut2004, my implementation is the only one with
 ALC_EXT_capture support for VoIP...this could be added to the other
 implementation, but hasn't been as of yet. My version is apparently a
 little faster, but it's stereo only (but the subversion repository doesn't
 crash on M-Audio 5.1 and 7.1 cards anymore), so Apple's tech is probably
 more attractive for further development by default.

 All of the missing functionality in Apple's implementation could be fixed
 with some elbow grease, which I'm sure will show up one way or another in
 the near future. Overall, this is a very good step forward, and I applaud
 Apple for giving game developers something they really need.

 Latest CVS builds and runs on Solaris/x86 (and presumably Solaris/sparc, too).
 I get a lot of questions about MacOS X: the game _does_ run on OSX, there
 just isn't a nice installer or anything at the moment, so you have to
 compile it yourself. When there's time, I'll put together a shareware-based
 installer, and, if I can find a copy, one that works with the Mac retail disc.

 (Yes, this is still being worked on.)

Unreal Tournament 2003:
 There's an exploit in the ut2003 network code, so here's a new build.


 The Linux one has about a million changes over the stock 2225, since it's got
 all the MacOSX work on top of it. Consider it beta. The Mac version has one
 or two fixes, so it's worth updating.

Unreal Tournament 2004:
 If Mac retail installer crashes on you, use this:

 Linux (x86 and amd64) official 3236 patch (new build with load times fixed):

 MacOS X (un?)official 3236 patch (YES, this is newer than 3229):

Call of Duty:
 1.4 is out, now with PunkBuster support:

 This is a 1.4 server with an exploit closed. Admins should all upgrade:

Postal 2 Share the Pain:
 Linux demo:
 Linux retail: In beta testing (apply at
 Mac retail: In beta testing (no more applications, please!)

America's Army:
 2.1.0 is out for Linux and Mac:

Other stuff:

 (Updated: my solution is at the bottom.)

 So here's your programming test.

 You have a program that reads a bunch of data from disk...several hundred
 megabytes. Most of this data is comprised of callstacks...variable-length
 lists of pointers that represent a program's program flow.

 Ever hit a breakpoint in gdb and hit "bt"? That backtrace. That's what I'm
 talking about. Let's say you are writing a profiler that is examining a
 run of a program. Let's say you're writing Apple's Shark tool or whatnot.
 You might have hundreds of thousands (or millions) of backtraces that
 represent a small snapshot of a program at any given time.

 So, how do you efficiently store this information as it comes flooding in?
 Here are your interfaces:

 "callstackid" can be anything that fits in the platform's wordsize...a
 pointer or an integer or whatnot, but it has to be an intrinsic datatype
 of some sort, not a structure. You can have structures behind the scenes.

  callstackid callstack_add(void **ptrs, size_t count);

 ...this will get called repeatedly as the callstacks are read from disk.
 (ptrs) is an array of void-pointers, which represent the callstack. (count)
 are the number of elements pointed to by (ptrs), which is not null-terminated.
 This is called once per callstack.

 We'll be nice and let you have:

  void callstack_doneadding(void);

 ...which is called when we're done reading from disk, in case you want
 notification that there won't be more calls to callstack_add().

 Now we'll be calling...

  size_t callstack_framecount(callstackid id);

 ..where you return the number of pointers in that specific callstack. We'll
 use this to allocate enough space before calling:

  void callstack_get(callstackid id, void **ptrs);

 ...where you copy the pointers into the array we supply in (ptrs). In Real
 Life, this would pass the size of the array, but you can assume we allocated
 it based on what you fed us in callstack_framecount(), and that id is a valid
 value you fed us from callstack_add(). Etc.

 That's it. The goal is to implement this interface so that it is:
  1) as fast as possible, even with a potentially huge data set, in all
  parts of this API (i.e. - if adding is fast, but getting is SLOW, then
  it's no good).
  2) uses as little memory as possible. Remember that we might feed you
  gigabytes of data over time in callstacks_add(). We'll be nice and
  say you won't get more than your address space can hold, but
  even then, you can't use the whole thing (ever use 4 gigs of swap?).
  3) as simple as possible (har!).

 You can use C or C++ at your discretion. Nothing else. If it's not in the
 standard C runtime, you can't use any external functions, either (yes, that
 means no STL or Boost, etc).

 There isn't a 100% Good answer; With large datasets, it's always a collection
 of tradeoffs and content assumptions. Welcome to the industry.

 Those that complete this assignment have the satisfaction of knowing that you
 probably came up with an acceptable answer faster than I did. :/

 ...So here was my solution. I used C++, which was pretty daring for me, but
 in this case, it helped simplify the expression of my solution sufficiently.
 I haven't debugged this yet, but it does compile.

 Ok, here are some basic assumptions about our data set. All of this is about
 tradeoffs and optimizing for the Most Likely Case, after all:

 - Most of your callstacks are going to be the same. Lots of your program
  time is generally spent in a few places. Lots of callstacks are going to be
  complete duplicates in almost any practical sampling of a program, so
  storing them multiple times is wasteful. Obviously we're going to want to
  figure out if a callstack has been seen before and feed back a previous
  ID instead of building a new one.

 - Most of your callstacks are going to share a common ancestry...i.e. -
  they'll probably all have main() at the top (or one of a few thread entry
  points). After that, they probably have other functions in common.

  For example:
  void x(void) { y(); z(); }
  void main(void) { x(); }

  There you might have two callstacks:
  y -> x -> main
  z -> x -> main

  The only difference is the call to y() or z(). The rest is duplication.
  Storing the duplicate parts is wasted space if we can share this data
  between callstack IDs.

  So we'll make each element of a callstack into a node in a tree:

  typedef void *PTR; // "PTR *" is less confusing than "void **".

  class CallstackNode
  CallstackNode(PTR _ptr=0, CallstackNode *p=NULL, size_t d=0)
  : ptr(_ptr), depth(d), parent(p), children(NULL), sibling(NULL) {}
  PTR ptr;
  size_t depth;
  CallstackNode *parent;
  CallstackNode *children;
  CallstackNode *sibling;
  }; // CallstackNode

  ... Some of these elements eat a little more memory, but save us some CPU
  time later. For example, "depth" is how far down the tree a node is, which
  we could get from walking back up the tree, but since this information is
  convenient at the time of creation, we save it so framecount() is fast.
  Again, tradeoffs.

  Now, my required API is wrapped in a C++ class. This is mainly so I can use
  the destructor to automatically delete the tree when we're done, but that was
  just a question of convenience; this could be done with a global variable
  just as easily.

  class CallstackManager
  typedef CallstackNode *callstackid; // Consider this opaque.

  callstackid add(PTR *ptrs, size_t framecount);
  void done_adding();
  size_t framecount(callstackid id);
  void get(callstackid id, PTR *ptrs);

  CallstackNode root;
  }; // CallstackManager

  You'll notice that "callstackid", our opaque descriptor for a given callstack,
  is just a CallstackNode. This will be the deepest leaf of the tree (the
  "y" in "y -> x -> main").

  The implementation...

  CallstackManager::add(PTR *ptrs, size_t framecount)
  CallstackNode *parent = &this->root; // top of tree.
  CallstackNode *node = parent->children; // root node is placeholder.
  size_t origframecount = framecount;

  // assume everything is coming from main(), so start from the back so
  // we put it at the top of the tree. This will result in less dupes,
  // as nodes that have common ancestry will share common nodes.
  ptrs += framecount;

  while ((node != NULL) && (framecount))
  if (node->ptr != *ptrs) // non-matching node; check siblings.
  node = node->sibling;
  else // matches, check next level...
  parent = node;
  node = node->children;
  } // else
  } // while

  // (framecount == 0) here means a complete match with existing branch.

  while (framecount) // build any missing nodes...
  node = new CallstackNode(*ptrs, parent, origframecount - framecount);
  node->sibling = parent->children;
  parent->children = node;
  parent = node;
  } // if

  return((callstackid) node); // bottom of the callstack ("main", etc).
  } // CallstackManager::add

  ...So we start at the top of the tree, see if (say) "main" is in there. Yes?
  Move to main's children nodes and look at the next element of the callstack.
  Repeat while each element matches a node. If you hit one that doesn't, this
  is a new callstack, so create all the missing child nodes.

  When done, return a pointer to the bottom node.

  void CallstackManager::done_adding()
  // no-op in this implementation
  } // CallstackManager::done_adding

  ...This implementation doesn't need any post-processing.

  size_t CallstackManager::framecount(callstackid id)
  return(((CallstackNode *) id)->depth);
  } // CallstackManager::framecount

  ...We store the depth of a given node, so framecount() runs in O(1) time.

  void CallstackManager::get(callstackid id, PTR *ptrs)
  CallstackNode *node = (CallstackNode *) id;
  while (node)
  *ptrs = node->ptr;
  node = node->parent;
  } // while
  } // CallstackManager::get

  ...get() runs in O(framesize) time; it just walks back up the tree from the
  node returned by add(), filling in the caller's array. Most stacks in a
  real-world program are going to be small...100 frames would be considered
  large in many cases, but this will still work with recursive functions that
  go thousands of frames deep.

  ...finally, the cleanup is simple, too:

  // CallstackManager automatically deletes the root node in its default
  // destructor, which causes the whole tree to collapse.
  CallstackNode *kid = this->children;
  while (kid)
  CallstackNode *next = kid->sibling;
  delete kid;
  kid = next;
  } // while
  } // CallstackNode destructor

  That's it.

  The drawbacks of this method:

  - Each node takes some amount of memory. There may be other approaches that
  can keep a callstack as an array of pointers and use less memory overall.
  You could eat more CPU and beat up on the cache more by removing the "depth"
  field, and maybe "sibling".
  - All these node creations cause a lot of initial malloc pressure, and
  possibly a lot of memory fragmentation. After the initial flurry, the memory
  allocations should drop off as previous-seen nodes start showing up.

  Possible improvements:

  - Might be able to get a win by encoding recursive function calls into a
  single node. Not sure if the added complexity is worth the effort in a
  Real World program's flow.
  - We could take the time to sort the siblings into a sane order, and
  then implement a faster search. This might require keeping siblings as an
  array that is realloc()'d. Alternately, we could just move the
  most-recently matching node in a given set of children to the front of the
  list, since it's likely to hit multiple times in a row. The least-used
  nodes will float to the back of the list.
  - Could possibly doubly-link nodes, so if you have two largely similiar
  branches with different parents, they can exist in multiple places in the
  tree, and point to each other. This makes destruction harder, and initial
  adding would need a means to find duplicates. For extremely diverse data
  sets, the added effort could be a considerable win, but again, I suspect
  that the average usage patterns render this effort basically worthless.
  Tradeoffs, tradeoffs, tradeoffs...


When this .plan was written: 2004-08-06 11:43:01
.plan archives for this user are here (RSS here).
Powered by IcculusFinger v2.1.27
Stick it in the camel and go.