So here was my solution. I used C++, which was pretty daring for me, but in this case, it helped simplify the expression of my solution sufficiently. I haven't debugged this yet, but it does compile. Ok, here are some basic assumptions about our data set. All of this is about tradeoffs and optimizing for the Most Likely Case, after all: - Most of your callstacks are going to be the same. Lots of your program time is generally spent in a few places. Lots of callstacks are going to be complete duplicates in almost any practical sampling of a program, so storing them multiple times is wasteful. Obviously we're going to want to figure out if a callstack has been seen before and feed back a previous ID instead of building a new one. - Most of your callstacks are going to share a common ancestry...i.e. - they'll probably all have main() at the top (or one of a few thread entry points). After that, they probably have other functions in common. For example: void x(void) { y(); z(); } void main(void) { x(); } There you might have two callstacks: y -> x -> main and z -> x -> main The only difference is the call to y() or z(). The rest is duplication. Storing the duplicate parts is wasted space if we can share this data between callstack IDs. So we'll make each element of a callstack into a node in a tree: typedef void *PTR; // "PTR *" is less confusing than "void **". class CallstackNode { public: CallstackNode(PTR _ptr=0, CallstackNode *p=NULL, size_t d=0) : ptr(_ptr), depth(d), parent(p), children(NULL), sibling(NULL) {} ~CallstackNode(); PTR ptr; size_t depth; CallstackNode *parent; CallstackNode *children; CallstackNode *sibling; }; // CallstackNode ... Some of these elements eat a little more memory, but save us some CPU time later. For example, "depth" is how far down the tree a node is, which we could get from walking back up the tree, but since this information is convenient at the time of creation, we save it so framecount() is fast. Again, tradeoffs. Now, my required API is wrapped in a C++ class. This is mainly so I can use the destructor to automatically delete the tree when we're done, but that was just a question of convenience; this could be done with a global variable just as easily. class CallstackManager { public: typedef CallstackNode *callstackid; // Consider this opaque. callstackid add(PTR *ptrs, size_t framecount); void done_adding(); size_t framecount(callstackid id); void get(callstackid id, PTR *ptrs); protected: CallstackNode root; }; // CallstackManager You'll notice that "callstackid", our opaque descriptor for a given callstack, is just a CallstackNode. This will be the deepest leaf of the tree (the "y" in "y -> x -> main"). The implementation... CallstackManager::callstackid CallstackManager::add(PTR *ptrs, size_t framecount) { CallstackNode *parent = &this->root; // top of tree. CallstackNode *node = parent->children; // root node is placeholder. size_t origframecount = framecount; // assume everything is coming from main(), so start from the back so // we put it at the top of the tree. This will result in less dupes, // as nodes that have common ancestry will share common nodes. ptrs += framecount; while ((node != NULL) && (framecount)) { if (node->ptr != *ptrs) // non-matching node; check siblings. node = node->sibling; else // matches, check next level... { ptrs--; framecount--; parent = node; node = node->children; } // else } // while // (framecount == 0) here means a complete match with existing branch. while (framecount) // build any missing nodes... { node = new CallstackNode(*ptrs, parent, origframecount - framecount); node->sibling = parent->children; parent->children = node; parent = node; ptrs--; framecount--; } // if return((callstackid) node); // bottom of the callstack ("main", etc). } // CallstackManager::add ...So we start at the top of the tree, see if (say) "main" is in there. Yes? Move to main's children nodes and look at the next element of the callstack. Repeat while each element matches a node. If you hit one that doesn't, this is a new callstack, so create all the missing child nodes. When done, return a pointer to the bottom node. void CallstackManager::done_adding() { // no-op in this implementation } // CallstackManager::done_adding ...This implementation doesn't need any post-processing. size_t CallstackManager::framecount(callstackid id) { return(((CallstackNode *) id)->depth); } // CallstackManager::framecount ...We store the depth of a given node, so framecount() runs in O(1) time. void CallstackManager::get(callstackid id, PTR *ptrs) { CallstackNode *node = (CallstackNode *) id; while (node) { *ptrs = node->ptr; ptrs++; node = node->parent; } // while } // CallstackManager::get ...get() runs in O(framesize) time; it just walks back up the tree from the node returned by add(), filling in the caller's array. Most stacks in a real-world program are going to be small...100 frames would be considered large in many cases, but this will still work with recursive functions that go thousands of frames deep. ...finally, the cleanup is simple, too: // CallstackManager automatically deletes the root node in its default // destructor, which causes the whole tree to collapse. CallstackNode::~CallstackNode() { CallstackNode *kid = this->children; while (kid) { CallstackNode *next = kid->sibling; delete kid; kid = next; } // while } // CallstackNode destructor That's it. The drawbacks of this method: - Each node takes some amount of memory. There may be other approaches that can keep a callstack as an array of pointers and use less memory overall. You could eat more CPU and beat up on the cache more by removing the "depth" field, and maybe "sibling". - All these node creations cause a lot of initial malloc pressure, and possibly a lot of memory fragmentation. After the initial flurry, the memory allocations should drop off as previous-seen nodes start showing up. Possible improvements: - Might be able to get a win by encoding recursive function calls into a single node. Not sure if the added complexity is worth the effort in a Real World program's flow. - We could take the time to sort the siblings into a sane order, and then implement a faster search. This might require keeping siblings as an array that is realloc()'d. Alternately, we could just move the most-recently matching node in a given set of children to the front of the list, since it's likely to hit multiple times in a row. The least-used nodes will float to the back of the list. - Could possibly doubly-link nodes, so if you have two largely similiar branches with different parents, they can exist in multiple places in the tree, and point to each other. This makes destruction harder, and initial adding would need a means to find duplicates. For extremely diverse data sets, the added effort could be a considerable win, but again, I suspect that the average usage patterns render this effort basically worthless. Tradeoffs, tradeoffs, tradeoffs... --ryan.