So here was my solution. I used C++, which was pretty daring for me, but in
 this case, it helped simplify the expression of my solution sufficiently.
 I haven't debugged this yet, but it does compile.

Ok, here are some basic assumptions about our data set. All of this is about
 tradeoffs and optimizing for the Most Likely Case, after all:

 - Most of your callstacks are going to be the same. Lots of your program
   time is generally spent in a few places. Lots of callstacks are going to be
   complete duplicates in almost any practical sampling of a program, so
   storing them multiple times is wasteful. Obviously we're going to want to
   figure out if a callstack has been seen before and feed back a previous
   ID instead of building a new one.

 - Most of your callstacks are going to share a common ancestry...i.e. -
   they'll probably all have main() at the top (or one of a few thread entry
   points). After that, they probably have other functions in common.

   For example:
     void x(void) { y(); z(); }
     void main(void) { x(); }

   There you might have two callstacks:
      y -> x -> main
         and
      z -> x -> main

   The only difference is the call to y() or z(). The rest is duplication.
   Storing the duplicate parts is wasted space if we can share this data
   between callstack IDs.


So we'll make each element of a callstack into a node in a tree:

typedef void *PTR;  // "PTR *" is less confusing than "void **".

class CallstackNode
{
public:
    CallstackNode(PTR _ptr=0, CallstackNode *p=NULL, size_t d=0)
        : ptr(_ptr), depth(d), parent(p), children(NULL), sibling(NULL) {}
    ~CallstackNode();
    PTR ptr;
    size_t depth;
    CallstackNode *parent;
    CallstackNode *children;
    CallstackNode *sibling;
}; // CallstackNode


... Some of these elements eat a little more memory, but save us some CPU
time later. For example, "depth" is how far down the tree a node is, which
we could get from walking back up the tree, but since this information is
convenient at the time of creation, we save it so framecount() is fast. Again,
tradeoffs.

Now, my required API is wrapped in a C++ class. This is mainly so I can use
the destructor to automatically delete the tree when we're done, but that was
just a question of convenience; this could be done with a global variable
just as easily.

class CallstackManager
{
public:
    typedef CallstackNode *callstackid;  // Consider this opaque.

    callstackid add(PTR *ptrs, size_t framecount);
    void done_adding();
    size_t framecount(callstackid id);
    void get(callstackid id, PTR *ptrs);

protected:
    CallstackNode root;
}; // CallstackManager


You'll notice that "callstackid", our opaque descriptor for a given callstack,
is just a CallstackNode. This will be the deepest leaf of the tree (the
"y" in "y -> x -> main").


The implementation...

CallstackManager::callstackid
CallstackManager::add(PTR *ptrs, size_t framecount)
{
    CallstackNode *parent = &this->root;  // top of tree.
    CallstackNode *node = parent->children;  // root node is placeholder.
    size_t origframecount = framecount;

    // assume everything is coming from main(), so start from the back so
    //  we put it at the top of the tree. This will result in less dupes,
    //  as nodes that have common ancestry will share common nodes.
    ptrs += framecount;

    while ((node != NULL) && (framecount))
    {
        if (node->ptr != *ptrs)  // non-matching node; check siblings.
            node = node->sibling;
        else  // matches, check next level...
        {
            ptrs--;
            framecount--;
            parent = node;
            node = node->children;
        } // else
    } // while

    // (framecount == 0) here means a complete match with existing branch.

    while (framecount)  // build any missing nodes...
    {
        node = new CallstackNode(*ptrs, parent, origframecount - framecount);
        node->sibling = parent->children;
        parent->children = node;
        parent = node;
        ptrs--;
        framecount--;
    } // if

    return((callstackid) node);   // bottom of the callstack ("main", etc).
} // CallstackManager::add


...So we start at the top of the tree, see if (say) "main" is in there. Yes?
Move to main's children nodes and look at the next element of the callstack.
Repeat while each element matches a node. If you hit one that doesn't, this
is a new callstack, so create all the missing child nodes.

When done, return a pointer to the bottom node.


void CallstackManager::done_adding()
{
    // no-op in this implementation
} // CallstackManager::done_adding


...This implementation doesn't need any post-processing.


size_t CallstackManager::framecount(callstackid id)
{
    return(((CallstackNode *) id)->depth);
} // CallstackManager::framecount


...We store the depth of a given node, so framecount() runs in O(1) time.


void CallstackManager::get(callstackid id, PTR *ptrs)
{
    CallstackNode *node = (CallstackNode *) id;
    while (node)
    {
        *ptrs = node->ptr;
        ptrs++;
        node = node->parent;
    } // while
} // CallstackManager::get


...get() runs in O(framesize) time; it just walks back up the tree from the
node returned by add(), filling in the caller's array. Most stacks in a
real-world program are going to be small...100 frames would be considered
large in many cases, but this will still work with recursive functions that
go thousands of frames deep.

...finally, the cleanup is simple, too:

// CallstackManager automatically deletes the root node in its default
//  destructor, which causes the whole tree to collapse.
CallstackNode::~CallstackNode()
{
    CallstackNode *kid = this->children;
    while (kid)
    {
        CallstackNode *next = kid->sibling;
        delete kid;
        kid = next;
    } // while
} // CallstackNode destructor


That's it.

The drawbacks of this method:

- Each node takes some amount of memory. There may be other approaches that
  can keep a callstack as an array of pointers and use less memory overall.
  You could eat more CPU and beat up on the cache more by removing the "depth"
  field, and maybe "sibling".
- All these node creations cause a lot of initial malloc pressure, and
  possibly a lot of memory fragmentation. After the initial flurry, the memory
  allocations should drop off as previous-seen nodes start showing up.

Possible improvements:

- Might be able to get a win by encoding recursive function calls into a
  single node. Not sure if the added complexity is worth the effort in a
  Real World program's flow.
- We could take the time to sort the siblings into a sane order, and
  then implement a faster search. This might require keeping siblings as an
  array that is realloc()'d. Alternately, we could just move the most-recently
  matching node in a given set of children to the front of the list, since
  it's likely to hit multiple times in a row. The least-used nodes will
  float to the back of the list.
- Could possibly doubly-link nodes, so if you have two largely similiar
  branches with different parents, they can exist in multiple places in the
  tree, and point to each other. This makes destruction harder, and initial
  adding would need a means to find duplicates. For extremely diverse data
  sets, the added effort could be a considerable win, but again, I suspect
  that the average usage patterns render this effort basically worthless.
  Tradeoffs, tradeoffs, tradeoffs...

--ryan.