[manticore] 3d graphics multiprocessor

Wed Jun 19 02:41:23 EDT 2002

Yes, adding a conventional embeded processor for vertex workloads has been
done many times, it works. I think the first one I heard of was Intergraph
putting Alpha's on their high end board set. Pretty funny that the embeded
cpu in such systems was faster than the host machine's x86. I definately
think this could be a good way for manticore to get functionality and decent
performance quickly.

What I'm thinking of goes a bit farther than that. What I'm talking about is
doing away with the conventional pipeline, and reimplimenting it in parellel
code for a multi-processor (multiple cores on one die).

http://citeseer.nj.nec.com/owens00polygon.html is probibly the closest thing
I've found in literature, and as far as vertex workloads, is shown to have
great performance and staggering aggregate register bandwidth. However, it's
stream based memory model forces a large seperation between operand
registers and memory, so fill rate for zbuffering or texture mapping is less
than impressive. I also think their particular software implimentation could
be substancially improved.

What I'm thinking of more is a set cores with a collection of common fifo's,
a reasonable size cache, and a programable prefetch unit. The simplist
programing model is of a single computational kernel that is vectored
accross the fpu's. Temporary results like transformed vertexes, set-up
triangles, scan converted spans or blocks would be stored and consumed from
fifo's. I'm envisioning blocks would be guarded by fifo status, so that a
full fifo stalls the section of the kernel that fills it until another block
of the kernel consumes an element, rather than attempting to page streams to
main memory. I imagine a partitioned register file to support the single
kernel programming model while relaxing the number of ports required, and a
TTL instruction set for simplicity. Finally, there would be a similarly
partitioned cache, suitable for holding blocks of texture, z and other such
buffers. Each core need not be particularly super-scalar, tho 2-way would
probibly be an advantage. A IEEE 32bit fpu design would suffice, ideally
with some vector extensions for operations on wide and byte sized values to
substancially speed shading operations and packing/unpacking.

What's the point of all this? Although current hardware is amazingly fast,
even with current shader standards the functionality is very fixed. Current
polygon rates result in densities nearing 8 pixel triangles. Given another
cycle or 2, the very idea of a polygon will start to lose it's advantages.
Everyone seems to think the way to solve these trends is to reinvent RIB as
a realtime format. While I don't disagree that a renderman like format for
realtime input would be a step forward, I still think it falls short. Why
not truely free the pipeline by adopting a fully programable system, rather
than programamble islands in a fixed data flow? Remember the days when all
games rendered in software? There was huge variety of algorithms since each
game was free to organize what computation was available in a way most ideal
for their situation. Think of all the computer vision, signal processing,
analysis and visualization workloads that would benfit from such a chip
being available.

There are a staggering number of interesting algorithms in the literature
that while definately appealing for particular situations, are incompatable
with current hardware, limiting them to exercises in theory. Worse yet,
there are even far more that would be easily realizable with a single
limited addition to the pipeline, but it would be increasingly complex to
attempt to add more than a single extention. What I want is a chip designed
to run programs that fit their bandwidth pattern, and that doesn't have the
overhead that comes from the full generality and ILP extration that system
cpu's burden. I want a media processor.

The graphix card is really the only place such a chip might gain a foothold.
And it might be a good way for a small hobby group to transform as much of a
hardware problem as they can into a software problem with enough geek appeal
to muster a sizable open source effort.

But in the end, it's just a daydream, and I have the feeling that it seems
so appealing as a paper tiger that there must be some flaw, some reason no
one has explored it before. So that's why I'm hoping you guys can punch
holes in it to save me from spending more and more time considering it :P

jason watkins