Page 1 of 1

Jaguar programming tips and tricks

Posted: Tue Feb 18, 2014 4:57 am
by a31chris
Some Jaguar programming tips and tricks discovered through the years. So that these things are easy to find and never lost I am creating this sticky where I will gather them as I find them or as they are pointed out.

This first one may be the solution to the quoted problem:
While the 68k has the bus, the DSP and GPU can apparently run their own code, but they can not touch any of the other hardware, even on the same chip. (For instance, if the GPU wants to talk to the line buffers or even the video registers, even though they are on the same chip, it needs to acquire the bus --
And here we go...
Kskunk wrote:
Tursi wrote:Even though they have higher bus priority and can steal the bus away from the 68k, they still lose a cycle doing so and it's noticeable in throughput.
It's usually several cycles. You can't interrupt a 68K bus cycle in progress. (This is a 68K limitation.) The 68K bus cycle takes 8 Jag RISC cycles. So depending on your luck you could wait anywhere from 1 to 8 cycles. It's almost inevitable that the 68K has changed the current DRAM page, which makes it 6 to 13 cycles round trip.

This works out fine for simple 2D games, since the OP wants to lock up the bus while it works (the overhead is minimal) and if you're using the blitter, it's probably being activated at the end of a 68K bus cycle.

But once you're using the GPU and DSP, the 68K is a pretty bad idea. The DSP is also pretty difficult to use for all the same reasons. It locks up the bus for 6 cycles on reads, 12 for writes (because of the workaround for the write bug). It's better than the 68K because it can run code out of its SRAM, but it's often hard to find algorithms that do a ton of computation and almost no I/O. That's one reason the DSP is usually idle except for some sound synthesis.
Tursi wrote:While the 68k has the bus, the DSP and GPU can apparently run their own code, but they can not touch any of the other hardware, even on the same chip. (For instance, if the GPU wants to talk to the line buffers or even the video registers, even though they are on the same chip, it needs to acquire the bus -- at least according to my experimentation. Someone who can read the netlists may be able to confirm or deny that more accurately.)
Who do you know that can read netlists?

Both the GPU and DSP have a dedicated 'local bus' that can be accessed without using the main bus. On the GPU, the local bus contains SRAM, GPU control regs (for interrupts, divider, matrix, etc), the blitter, and line buffer writes (everything mapped from F02-F0F). This helps it set up polygons when scan converting without needing the bus.

The line buffer feature only works on 32-bit writes (no reads or 16-bit access), but apparently does not disturb the main bus. I'm far away from my Jaguar right now so I can't test if this really works. The netlists imply it was designed to work this way. This feature might be useful for some kind of special effect but I'm not creative enough to think of any.

On the DSP, the local bus contains SRAM, DSP control regs, math table ROM and some DSP peripherals (everything mapped from F12-F1F). This means the DSP is able to do CD access (at least the serial kind) and audio playback without touching the main bus. Off-chip stuff like joystick access obviously requires external bus cycles, but so do the UART and timers.

- KS
Emboldened areas added by moderator

This appears to not be so much something new as it is something perhaps overlooked by the above two having the conversation. From page 36 of the v8 Jaguar Tech ref manual:
To the GPU programmer the local RAM, local hardware registers, and external memory all appear in the same
address space. The GPU memory controller determines whether a transfer is local or external, and generates
the appropriate cycle. The only difference to the programmer is that only 32-bit transfers are possible within the
GPU local address space, whereas 8, 16, 32 or 64-bit transfers are permitted externally.
The local RAM sits on an internal GPU 32-bit bus. Also present on this bus are various GPU control registers,
and the Blitter control registers. When a GPU transfer occurs outside the local address space, a gateway
connects the local bus to the main bus. If a sixty-four bit transfer is requested, a special register is used for the
other half of the data.
The address space is organised as follows:
F02000 - F021FF graphics processor control registers
F02200 - F022FF Blitter registers
F02300 - F02FFF reserved
F03000 - F03FFF local RAM
F04000 - F0FFFF reserved
This local address space is also available to external devices via the I/O mechanism.
The GPU local bus can therefore perform transfers for three quite separate mechanisms. These are, in
decreasing order of priority:
- CPU I/O access
- Operand data transfer
- Instruction fetch

Display List Tricks

Posted: Tue Feb 18, 2014 5:00 am
by a31chris
Here is the programmer of Super Burnout Olliver Nallet's recounting of a Display List trick he used to get an incredible number of sprites on screen at once:
Because there was only 4 KB of memory on the GPU, I was hot-swapping portions of the assembly code, modules by modules, trying to reuse as much as possible the cached memory (something that is also done commonly on PS3 SPUs). Everything worked well together and I'm pretty sure I still had CPU cycles left even with more than 1000 sprites on the screen. By the way, I was displaying the 1000+ sprites on Jag with a trick on the display-list. The Jag was a killer in 2D (ok, at that time , but the only downside was that if the display list contained too many sprites, it actually ate bandwidth on the sprites to display,
creating sometime this wobbling effect on some line (due to the fact the jag didn't have screen buffer but was diplaying everything on the fly, nice design to save expensive memory
but with some constraints.
So the way I resolved that was to actually use the branches of the display list, by having 3 levels of branches I was actually able to split the screen in 8 horizontal bands of 30 pixels or so.
Then I just had to fill each of the sub-display lists separatly, thus I just had 125+ sprites per display list, but 1000+ total displayed on the screen. That way each horizontal line had almost full bandwidth to display the sprite. The nightware was on the GPU side though, every single sprite had to be split in pieces to be placed on the proper horizontal band with the propoer offset initialized. Sometime sprites like bosses could be split over the whole 8 bands. Everything was displayed with that, even the multiple level of tiles for the background.
Here is a link to the original post and thread courtesy of AtariAge. ... try1399401

GPU in Main bug workaround rules

Posted: Tue Feb 18, 2014 9:37 am
by a31chris
Discovered or perhaps rediscovered around 2006 by AtariOwl and Gorf here are the rules for working around the bug that stops the GPU from successfully running code out of main memory:

RISC in Main RAM rules:

Here they are in a nutshell.

Page is one block of 256 bytes.

All JUMP Instructions must sit on an address ending in 0,4,8 or C hex

All JUMP Instructions must jump to an address to an external page on 0,4,8 or C Hex

All JUMP Instructions must jump to an address with in a page on 2,6,A or E Hex

The JR instruction can sit any where

The JR instruction follows the same destination rules as jump.

all JUMP or JR instructions must be followed by two NOP's but certain instructions can be used in place of the first NOP.

Stay away from tight loops out in main ram.
 I was just copying memory in a tight loop, one load, one store, and the gpu was slower running from main than the 68k.

As soon as I put multiple load/stores into the loop, it got faster.

But if you are writing something like a strcpy in C for the gpu, that's the kind of assembly the compiler will generate.

I'm still a proponent for gpu in main ram, but you have to be careful what you are doing, not everything running on the gpu is faster. -JagMod
Since Owl already revealed it, I'll post once again main to local and local to main.

JUMP instructions only.....must sit on an address ending in 0 or 8 or from local to main and main to local. -Gorf

Further reading: ... 20in%20Mai

Re: Jaguar programming tips and tricks

Posted: Fri Feb 28, 2014 8:56 pm
by a31chris
Parallelism on the Jaguar.
Gorf wrote:if you run the GPU in the local only while the DSP runs in its local only and you run the 68k , you can run all three processors in parallel, but then you kill the band width of the bus with the 68k running. if you do not run the 68k and let the blitter and the OPL use the main bus while the other two run in local(GPU/DSP) you will get some pretty good results....however if you run the GPU out in main using some careful and properly thought out interrupt processing so that when the bliter and the OPL are running the GPU runs only from the local at the time( this would be probably when you are drawing the current frame and then setting up the next frame using the blitter to move the next amount of frame info into the GPU local) you should be able to pull off some amazing efficient processing. while the Blitter and the OPL are being set up for the next frame, run the GPU out in main for AI and game logic.

As far as bench marks, there are none for this particular method that I know of other than a few tests which have shown that in a tight loop the GPU out in main is slower than the 68k...however with an unrolled loop the GPU can achive about the same performance in main as in local.

It's all a balancing act. This of course is using the 68k only for setup of the system initially and for setting up a new game level and then killing it.

The 68k should not be used at all in the main loop and should only be used in between game processing for new levels or such. Any use otherwise will only result in hammering the bus and reducing it's bandwidth to one quarter and it's speed to half. a serious blow to the Jaguar's performance and the main reason why most of the games had really bad frame rate performance. Using the 68k at all, even for just a few instructions puts a serious dampning on the Jaguar's performance.
Kskunk wrote:I went back to the hardware and found a new texturing hack: In parallel, the blitter can generate addresses while the GPU reorders memory access to exploit page locality. It's faster per-pixel, but you lose so much GPU that small polygons suck. See, hard to go one post without tripping over new Jaguar hacks! If only Carmack had known blah blah 640x480...

The Blitter Trick

Posted: Mon Apr 21, 2014 8:02 pm
by a31chris
There is a rumor out there of a 'Blitter Hack' that Scatologiic found when developing Battlesphere that improves the Blitters performance. From the BS development diary:
Latest cut of Battlesphere™ is running just fine. Framerate is indeed up, thanks to the special hardware 'hack' devised by Scott and Myself. Nobody has thought of this little ditty before... it's too COOL! For what it's worth, this little trick would have easily made DOOM a 320x240 game at 20-30fps all the time.. This game is running so smooth now. Things are shaping up nice...

... in the 25-30fps just about all the time. Sure, flood the screen with ships, debris, explosions, and shots and we're down to 15 or so, but man does this thing haul... Heh heh, no one's gonna figure out the little magic trick it took to make that one happen.... Reminds me of the olden days of 800 programming where there things you could make the hardware do that the designers never dreamed of. This is so cool.

Framerate is stilll very high. We run constantly over 20FPS, usually between 30-60fps, depending on the amount of action onscreen. Our little Blitter Trick™ has insured that even with lots of explosions going off at once, the framerate is really high. We're quite proud of this little 'hack' we came up with. It really works!! Not that we were anything but screaming fast before... the load management going on between the processors by our custom engines is no slouch. It's also 'generic' enough that we'll re-use most of it for our next Jag title.
Here is some other clues in another interview by Scatologic that the blitter 'trick' improved polygon performance:
Scatologic wrote:Our polygon engine uses the blitter in some strange ways that make it about the fastest rendering engine anyone ever wrote. Heck, we beat on the Jaguar so hard that we had to put breaks in the screen video objects to give the DSP cycles to play the audio.
But at any rate this hack is most likely out there waiting to be found by someone else.

The Graphics Programming Black Book/3D anmation

Posted: Wed Apr 23, 2014 11:06 pm
by a31chris
The Graphics Programming Black Book. Written by John Carmack's friend and part of the original Quake development team.

This book is touted by Carmack himself and recommended by Thunderbird, Oppressor and Tursi lion as a must read for any serious Jag coder.
Tursi mentioned that it talks about the Quake engine and how it could be adapted for a linebuffer and he hypothesized this may be applicable for the Jaguar.

Scott Corley's guidelines to a 3D animation engine ... engine.php

Some more BLITTER tips

Posted: Mon Nov 17, 2014 8:54 pm
by a31chris
Kskunk wrote:There's another Jaguar video mode I found a long time ago while poking around with BJL. In the Jaguar docs there are some "gaps" where registers "should" be, such as between F00054 (HEQ) and F00058 (BG). If you set bit 2 in the undocumented register F00056, you get "black and white CRY" mode.

In this mode, C chooses a grayscale shade from 0-255 and I shades that intensity. This mode is well-supported by the blitter (using TOPNEN) and can produce a few interesting shading effects not possible in normal CRY mode. The downside is, obviously, no color.
Tursi wrote: I *did* find an unexpected [Blitter] combination that made a small performance boost, improving my previous best score - previously I was copying in phrase mode using 16-bit pixels, just because I really am using 16-bit color. I changed it to 32-bit pixels, and saw an improvement of about 20 pixels per frame (I'm not certain why the pixel depth should even matter in phrase mode...)

General tips

Posted: Wed Nov 19, 2014 8:48 pm
by a31chris
Gunstar wrote:I have to admit that I really like SuperCross3D too. Too bad about the terrible frame-rate, not so bad in practice mode or when your way ahead of the pack though...but that should have been the slowest speed, not the fastest (as far as overall framerate goes). I often think when I'm playing it that it seems very rushed and unfinished, but if they had just dropped some things, in this case less would have been better. For example, since the framerate sucks, they should have dropped the whole "Arena Screen" thing in the background, I'm sure that's eating up hords of processor time that the framerate could have used.
Thunderbird replied and wrote:Actually, the "Arena Screen" trick is pretty simple, and doesn't take up any extra CPU cycles (mostly). You just use the visible frame buffer as the source for the pixels on the frame currently being rendered. It's an old trick. It's not like a separate screen has to be rendered for that screen. If it were a rearview mirror or something, THEN it would have a different view which would require it's own rendering.

More BLITTER hints, rumors and allegations

Posted: Sat Mar 21, 2015 9:17 pm
by a31chris
The conventional wisdom is the Blitter cannot use the system bus very well to facilitate texture mapping. Another rumor is that a workaround has been found for this as well:
Gorf wrote: The texture mapping is the fault of the blitter, not the bus. the blitter had bugs and this was one that only allowed PIXEL mode texturing....even this has a work around now, so this is not even true anymore.

We know the blitter has full speed access to the ram in the GPU..yes, its way too little but its there.
The fact is the blitter not being able to read more than a pixel, not that the bus could not handle it.
This is no longer true anyway. Like every console new discoveries and workarounds come along and prove otherwise. We have discovered such workarounds.

What they knew then is very different to what WE know now. ... ga-saturn/
The Iron Soldier guys discovered a ‘hack’ which allowed the texture palette to be a texture source, doubling the speed of texture mapping for small textures
I have also experimented with various kinds of pixel transfer concepts, using combined Blitter / GPU routines, caching and data ordering techniques. In the end I worked out a routine that could make much higher use of the 64 bit / fast page mode characteristics than the usual pixel mode blitter stuff.

We worked on a new 3D engine designed around this routine to create a completely texture mapped racing game with a decent frame rate. Unfortunately, as with some other proposed projects, Atari was just too blind and unflexible to see what big step foward we could achieve and so we had to cancel development after a few months as we couldn't afford to proceed without Atari's support. At least some results from our research were used in IS2. -Marc

The Blitter Interrupt Stack

Posted: Sun Nov 12, 2017 8:32 am
by a31chris
Atari had suggested in one of their developer docs that the way to keep the Blitter (the processor in the Jaguar used to copy textures or draw shaded polugons) busy was to form a stack of commands, and just load them into the Blitter each time it finishes its current operation, so thats what I tried.

-Writing the Blitter data to a stack

-Enabling the Blitter Interrupt so that as soon as it finished its last operation it interrupted the processor in which the data was being prepared. In this case it was one of the fast RISC chips, the Graphics Processor Unit (GPU), which then picked up the next values stuck 'em in the Blitter and went back to its work.

Now this is great if all you have is large gouraud (a method for providing smoothly shaded polygons) or unshaded textured polygons, but if you have small polygons, the overhead makes it a false economy.

Oh and sometimes JUST sometimes it can mess up a load/store command (to read or write values to/from meory).

And guess what? If it messes up the wrong store, everything can freeze. So I did a quick re-write of the interrupt part of the engine and gained a not inconsiderable amount of performance (perhaps 30% more polys/sec with small polys).

At last... some GOOD news. ... 6.html?m=1

Instant access to second register bank

Posted: Tue Nov 21, 2017 11:36 pm
by a31chris
Because it's another tidbit programmers seem to have trouble finding it will be repeated here:

-Movefa and Moveta are MORE than just fast register backup mechanisms.


Instructions allowing access to the other bank via running from the opposite bank. I change ONE bit in a register and the Jaguar runs directly out of the alternate bank.'- Gorf

'Bit 14 of the FLAGS Register is responsible for the bank selection. Interrupts executes always on bank 0. Yeah! This closes my gap.' - energetic Padawan