Difference between revisions of "VQuake"

From Vogons Wiki
Jump to: navigation, search
(Created page with "VQuake is a version of Quake written specifically for the Rendition Verite V1000 accelerator. It works quite differently than GLQuake because the V1000 has unique strengths an...")
(No difference)

Revision as of 09:29, 24 February 2013

VQuake is a version of Quake written specifically for the Rendition Verite V1000 accelerator. It works quite differently than GLQuake because the V1000 has unique strengths and weaknesses. It is based upon the Quake software renderer.


Architecture

A usenet posting by former Rendition programmer Stefan Podell describes in detail how VQuake works in addition to why it is not optimal for V2x00.

Posted by Stefan Podell on October 30, 1997 at 04:34:56:

Okay, with my reputation tarnished by VQuake and VHexen2
not going any faster on the V2x00, I'm here to explain 
why this is the case.

First, the VQuake/VHexen2 engine can be roughly broken 
down into four parts (actually, most games are like 
this):

* Software setup of geometry commands
* Software creation of texture maps
* Transfer of textures and commands to the renderer
* Rendering

The first two parts are totally CPU dependent, the 
third is bus dependent, and the fourth, in my case,
is Verite dependent.

Now, some history on the design of VQuake.

When we first started working with Id on accelerated
Quake, Id's design was much like their design for Quake
2: that there would be a "driver" part of the code,
somewhat separated from the "main" game. (In Quake's
case, though, since it is a DOS game, this would be
accomplished with different executables rather than the
more elegant DLL model employed by Quake 2.) 

They had two 3D engine paths through the game. One was 
a traditional triangle/polygon based engine, which is
what they predicted all 3D accelerators would use, and
one was a fairly elaborate "span sorting" scheme which
the software renderer used.

So we set out accelerating the game using the polygon
interface. It looked great with filtering, but it
wasn't terribly fast. Even after doing polygon sorting
on the world polygons and turning off the Z buffer for
those, the performance was not great. (You must 
remember that the V1000 wasn't really designed to do
Z-buffering as its primary rendering style.) So Walt
Donovan, then with Rendition, and Michael Abrash, then
with Id, talked about using the software engine's span 
interface. 

For those who are interested, there have
been a few articles published by Michael on the guts
of the engine. I think Dr Dobb's Journal had the best,
most detailed one. To the best of my understanding,
the engine sorts all world surfaces (floors, walls,
ceilings, and sky) on each scanline of the screen, 
and keeps track of the edges of each span. When it
goes to draw a particular surface (polygon), it is
then guaranteed that every pixel it is drawing is the
only world pixel that will be drawn at those 
coordinates. (This is hard to explain...) Suffice to
say that when the world surfaces are drawn, exactly
the minimum possible number of pixels is drawn (i.e.,
a depth complexity of 1). Meanwhile, the Z buffer
has been filled with the correct Z values, but Z
comparison is not necessary, since we know the depth
complexity is one. Next, when we start drawing objects
(monsters, weapons, etc), we turn on the Z comparison
so that the objects are properly hidden by the world.

The reasoning behind this was that the Pentium was
pretty slow at drawing pixels, but fast at floating
point operations. By doing more of what the Pentium
was good at, Quake was able to do less of what the
Pentium was bad at. Overall, a performance win, with
the side benefit of allowing much more interesting 
scenes.

((((whew))))

Walt and Michael decided that since the Verite 1000
wasn't terribly good at Z-buffered pixels, that if
we let the Pentium take care of this span sorting, we
could reduce the number of pixels the Verite would 
draw. Furthermore, we'd be able to turn off the Z
compare function on the Verite. So now, rather
than having a depth complexity of around 1.5 
(about 450000 pixels at 640x480) pixels that draw
at the peak rate of about 10MP/s on the V1000, we
had a depth complexity of 1 (300000 pixels) that 
draw at a peak rate of about 17MP/s.

As with the software renderer, we could then turn on
the Z compare and draw all the more interesting 
objects. Yes these pixels would be as slow as always,
but there are generally far fewer of them.

So we set out writing new microcode to support Quake's
span data format. When we finally got this working,
sure enough, the performance was way better than the
original polygon-style engine.

Back to my four-part description of Quake's engine, we
traded more CPU work in stage one for less Verite work
in stage four, which ended up getting us a big win.

Note that when you increase the vertical resolution in
the game, the engine must sort more scanlines. And if
you increase the resolution in either direction, the
renderer must draw more pixels.

Also note that when Quake was written, P133's were 
pretty top-notch, and the software frame rates were
low enough that the span sorting was adequate to
keep up. With the Verite rendering much faster than
the Pentium could, suddenly the span sorting was the
bottleneck. It wasn't until we got to P200's that 
the Verite was busy most of the time.

So whether you're on a V1000 or V2x00, the CPU has
a lot of work to do.

Okay, that's part one.

Part two is texture maps. (This will be much shorter.)
The way the software renderer in Quake works is to take
a small texture tile (like a couple of bricks) and
duplicate it into a larger texture map for the world
surface while it is applying the light map (dynamic or
static). It then caches that texture map and draws 
with it. When it needs to draw that surface again, it
checks to see if the lights have changed (like when 
you fire a weapon). If they have, it must regenerate
the texture map and recache it.

The VQuake engine does the same thing, with the extra
step of having to download the texture map to video 
memory.

Quake also mipmaps these surfaces. The mipmap level
is chosen based on the size of the polygon (in pixels)
relative to the size of the texture map.

In VQuake, the texture cache is kept in video memory
along with the display buffers and Z buffer. The
quick equation for how much memory the display and Z
buffers take is Width * Height * 6 (3 buffers, each
16-bits deep). The rest is for texture maps (minus
about 128K for microcode).

So when you increase the resolution, two things 
happen that increase the demands on the CPU for
texture map generation. First, you have less texture
memory, so textures will fall out of the cache more
often, requiring regeneration. Second, higher 
resolution mipmaps will be chosen, further straining
the texture cache.

The assembly code for generating the textures is
darn near as good as it can be. I certainly can't
think of any instructions to remove, and Michael
Abrash, who wrote it, is a genius at this stuff.

We considered doing two pass lighting on the Verite,
but after some experiments decided the CPU could do
it faster.

So again, no matter the Verite chip, the CPU will be
very busy. (The texture mapping, by the way is the
primary reason that timedemo works so much better as
a real benchmark than timerefresh. In the demo
sequences, there's lots of combat going on, which
pushes the system much harder.)

Alright. That's part two.

Part three is the bus. As you know, the currently
available Verites use the PCI bus, and are able to
use DMA asynchronously, which does not use the 
CPU. The bus activity will steal some cycles from
the CPU, but not an appreciable amount.

Simple enough.

Finally, the renderer. The V2x00 chips are *much*
faster at drawing than the V1000. The fastest the
V1000 could go was 25MP/s. The V2100 goes 40MP/s 
and the V2200 goes 50MP/s. Adding features (Z,
alpha, fog, etc.) would slow the V1000 down a lot,
while having minimal impact on the V2x00 chips.

So I understand why people were expecting VQuake
and VHexen2 to go much faster on the V2x00. But
the fact of the matter is that a faster renderer
doesn't necessarily buy you much given the
architecture of this engine. And because of that,
we're working on a V2x00-specific version of the
engine, to take advantage of the extra pixel 
power, while lightening the load on the CPU.

I must admit that I was beginning to wonder if
my beliefs about the engine's behavior were 
really true. So I just did a weird hack on
VHexen2 to test something:

I put in a check at the time drawing commands
and texture maps are sent to the Verite to see
if the game was in "timedemo mode". If it was,
I just threw away the commands and continued.
Then, when timedemo was over, drawing would
kick back in and I would see the results. The
purpose of this was to simulate an *infinitely
fast* renderer and bus (something we'd all
like to have :-) (This is also known as a 
"speed of light" test.)

==============================================
Here are the results of running timedemo demo1
on my current VHexen2 build (beta 3 candidate).
I ran all the tests at 320x200, 512x384, and
640x480, with antialiasing set to 0 and to 7
at all resolutions.

The first three tests are with rendering turned
on, in other words, the numbers everyone has
been responding to so far. The last set of
numbers is my "infinitely fast" renderer.

(fps are antialias = 0/antialias = 7)

The first test was on my P166MMX (64MB RAM) 
with my V1000 reference board, which runs at 
the same speed as an Intergraph 3D 100.
(These seem slower to me than what I 
remember, but I can't find where I wrote 
down my old numbers, so I just re-ran
these.)
=====================
320x200 (51.0 / 43.1)
512x384 (26.7 / 24.2)
640x480 (16.9 / 15.5)
=====================

Next, the same PC with a V2200. The first
thing you'll notice is that it's a little
slower at low resolutions. I think this is
because the span microcode has a little extra
overhead on the 2200. It also doesn't
currently interleave buffers. I'm looking 
into fixing those things.
=====================
320x200 (48.1 / 41.9)
512x384 (32.9 / 28.5)
640x480 (22.0 / 20.1)
=====================

Next, a V2200 in a P2-300 (this computer has
a pretty sucky hard drive in it and 32MB RAM,
so there was more swapping going on than
on my 166MMX)
=====================
320x200 (71.0 / 63.4)
512x384 (43.3 / 36.1)
640x480 (27.9 / 23.5)
=====================

And finally, my P166MMX with my "infinitely
fast" renderer/bus test
=====================
320x200 (56.0 / 48.2)
512x384 (44.1 / 37.8)
640x480 (32.4 / 29.7)
=====================

So you can see that at low resolutions, a 
faster CPU gets you much more by way of 
performance than a faster renderer. And as
resolution increases, infinitely fast 
rendering is a good thing :-)

Again, all this adds up to us working on 
a V2x00 version of this engine. We'll tell
you more as we know more about its progress.

This is not to say that there's no room for
improvement in my current version of the 
game. So if I figure out any cool way to make
this go faster on a V2x00, I'll definitely
put it in. There is one thing Quake 2
does that I want to try in VHexen 2. I'll 
let you know.

Regards,
Stefan (maybe I should use a .plan file) Podell

_________________________________________________