You manage to shut the 68k off completely? My understanding is no one else has been able to do that and I believe still need to use the 68k at least for VBL. Not to mention the function sizes can be bigger.
Since it's been about 20 years (!) since I've thought about Jaguar fundamentals, it's hard for me to give absolute confirmation about specific details unless some code is dug up somewhere.
That said, I can quote myself from 20 years ago, thanks to someone posting this:
HVSCMAZE readme
In that readme (written by me), I apparently brag about the 68k being so shut down that it's shutty shut down down down, and the demo still runs full speed ahead. I think the example given in that message (forcing the 68k to be busy with the debugger, and the demo still running) means that I managed to completely take the 68k out of the picture, which was a big goal for all of our Jaguar stuff, as I recall. It was one of the very early lessons of the Jaguar - anything hitting the bus was a performance killer, and when the 68k was running, it just hit the bus for instructions all the time, so the goal was to make it go away. Likewise, having the GPU hit anything but its own 4k SRAM was undesirable, and should be avoided if possible. If the 68k was still servicing vertical blank interrupts, well, perhaps it was, that's more detail than I can remember at this point.
I'm certain that anybody doing homebrew has a much better mental picture of bus priorities, memory cycle hits, and real world performance numbers than I do at the moment. If I make a claim that sounds impossible, it very well may be. Best case is the code speaks for itself, if it's ever found. I'm pretty sure everything I ever knew about every corner of the Jaguar is expressed in that code...
By the way, here is another quote from me from 1995. This is from a message I sent to J Patton and Normen Kowalewski describing the GPUMGR:
"The 68000 plays no part in this except for initialization, so it can be stopped while an entire game runs in the GPU."
What about large programs? Does not paging code in an out of the local kill that speed difference for large programs? There are a few who found that running the time critical stuff in local and the non time critical stuff out in main(gpu) is faster rather than the cycles that are lost flipping absolutely everything in and out of the local.
Having time-critical code local to the GPU and running everything else out of non-local would certainly be faster than flipping everything in and out all the time. I think most overlay approaches tend to be very brute-force like this, so you're constantly forced to flip absolutely everything in and out, and there is no huge efficiency gain.
The GPUMGR didn't work that way, though. The code chunks were completely independent, relocatable sections of code. It wasn't position-independent code, instead the GPUMGR would patch absolute addresses when the code chunk was loaded to fix up references (subroutine calls, etc.) to other code chunks, depending on where those chunks were loaded. (this dynamic linking sounds expensive, but there are surprisingly few fixups required generally.)
This way, you could have any subset of code chunks loaded at one time. The code chunks were timestamped when they were referenced, so that if there was a cache miss, the least-recently-used chunk would be blown out. As it happens, the most performance-critical code tends to get called a lot, and so it stays in cache. I have some figures of getting 88% cache hit, which means 88% of the time, code is being run full speed out of GPU SRAM and not hitting the bus at all. The misses (the other 12%) would have to load the code into SRAM, of course, but I think pulling all of the code into SRAM and then executing it at full speed is faster than executing RISC code over the main bus. In both cases, the code has to be loaded over the bus... so load it as quickly as you can...
For me, personally,
the appeal of the LRU caching is that I don't have to constantly try to guess which part of the code is really performance critical. The critical code magically stays in SRAM, and the non-critical code can be as big as it needs to be. Then when new features are added, or big chunks of code are reworked, some code becomes less important, and some new code is now performance-critical. The now-important stuff will stay in cache via magic. And, of course, if a game has many performance critical parts that are each critical at a different point in time, the important bits can be all up in the SRAM during their moment of glory, and get kicked out when they go cold and it's time for some other bits to shine.
As for the "large function" problem - the GPUMGR handled things pretty granularly, think of it basically as a subroutine-by-subroutine basis, so there wasn't really an issue of some key piece of code being too large to be paged in and out. Any huge piece of code would typically be made up of reasonably-sized subroutines, and the important stuff would stay in-cache and hum along.
When you get time, in the main Jaguar forum, not the programming one here can you tell us about the history of High Voltage Software, how you got involved, how the company got involved with Atari etc. HVS got their start as a company with Atari did they not?
I'm not much of a historian, but I wouldn't mind reminiscing about some things in the main forum. With the technical stuff, we can test and prove things. With the history, everybody has their own view of events, and an incomplete set of data to work from. So I won't chime in on unsettled historical drama, if any (I can see some glimpses of such things in some posts, but hey, it's the Internet, it happens... :). If I can add some pieces to the overall story, that would be fun.
Its really good to have you hear. We hope you have a great 4th. :)
Had a great 4th - just got back from the fireworks. As you can see from the timestamps on my emails, I've got a few days off work here, and I'm hitting these forums after my family goes to bed :) Thanks for the good questions, this has been a lot of fun so far...
Scott