3DO ZONE Forums

Posted: **Fri Jul 04, 2014 5:21 am**

Hey there -
Scott Corley checking in. I got a nice email a while back letting me know about the Jaguar homebrew community, and asking about the old GPUMGR and other tools...

Good to see McGroarty and Adisak chiming in here.

I've been searching some old hard drives for code, but in the mean time, I can try to answer some questions about the GPU management tool, its integration with the C compiler, what was used and not used on Ruiner, etc...

Scott

Posted: **Fri Jul 04, 2014 12:18 pm**

Hi Scott.

I had to look you up to find out who you were!

I'm simply an Atari Jag gamer. Have always been a fan of the Jag tbh, since it came out.

Turns out you were heavily involved in the design of Ruiner - a game I only became acquainted with in the last 5 years or so tbh. I didn't own it back in the day (Jag stuff was a bit low on the ground here in the UK, at the time). I really like it though. It's a fun game. The Tower level is very Alien Crush/Dragon's Fury, which is certainly no bad thing.

I think, the only criticism I had towards the game myself, is the zero-frame execution of the flippers. They're not pressure sensitive at all, which hurts the game somewhat unfortunately.

Discussion is often split between the preference between Ruiner and Pinball Dreams, but although the playability aspect has always fallen in Dreams' favour, I much prefer the look and feel of Ruiner.

Anyway, enough rabble... welcome to the forums!

Posted: **Fri Jul 04, 2014 3:57 pm**

gpumgr wrote:Hey there -
Scott Corley checking in. I got a nice email a while back letting me know about the Jaguar homebrew community, and asking about the old GPUMGR and other tools...

Good to see McGroarty and Adisak chiming in here.

I've been searching some old hard drives for code, but in the mean time, I can try to answer some questions about the GPU management tool, its integration with the C compiler, what was used and not used on Ruiner, etc...

Scott

The GPU manager tool would be very very cool to hear about. However its need is probably less critical since the hardware bug that stops the Jaguars GPU from executing code from main ram successfully now has a workaround for it. If you are interested you can see this workaround in the pinned thread in this forum 'Jaguar programming tips and tricks'

What is REALLY needed is any information you may have on the workaround you guys created for the compiler comparison bug the risc gcc had.

Also in this thread here Mike Fulton talks about Atari's decision to make an risc gcc and how what HVS accomplished was a complete surprise to them:

http://3do.cdinteractive.co.uk/viewtopi ... =14&t=3484

And welcome to the forums Mr Corley! It is very nice having you here! We hope you enjoy your stay. You will most definitely be treated like royalty!

Posted: **Fri Jul 04, 2014 6:53 pm**

The GPU manager tool would be very very cool to hear about. However its need is probably less critical since the hardware bug that stops the Jaguars GPU from executing code from main ram successfully now has a workaround for it. If you are interested you can see this workaround in the pinned thread in this forum 'Jaguar programming tips and tricks'

The main goal of the GPUMGR was to avoid executing any code from main RAM. The 68k would be used to load up the GPUMGR and start it, then the 68k would be halted never to be heard from again, and then the entire program would run from the GPU cache. The GPU would page code into itself from main RAM when needed, but otherwise would not touch the main RAM, leaving the bus free for everything else. This should give significant performance improvements over running RISC code from main RAM, as the RISC instructions are all in local cache, the 68k doesn't ever hit the bus, and the cache misses (requiring new code blits) are infrequent.

What is REALLY needed is any information you may have on the workaround you guys created for the compiler comparison bug the risc gcc had.

Speaking of cache misses... the information my brain has about these things has clearly been moved to the biological equivalent of tape backup, but the info is coming back. I saw McGroarty quoted in one of these messages talking about how mind blowing it is to look back at this stuff, and how he misses it... I'm at the beginning stages of that, I haven't thought about these things in a long time but it's already reminding me of the lengths we all used to go to to get things done, it's having an impact on my current day work (in a good way...), recalling that in seemingly impossible situations there are always great rewards if you are willing to walk through the fog. Back then I don't think any of us thought anything was impossible, we just ran full speed ahead (looks like that spirit is still alive in this modern day Jaguar community :)

Scott

Posted: **Sat Jul 05, 2014 12:00 am**

gpumgr wrote:
The GPU manager tool would be very very cool to hear about. However its need is probably less critical since the hardware bug that stops the Jaguars GPU from executing code from main ram successfully now has a workaround for it. If you are interested you can see this workaround in the pinned thread in this forum 'Jaguar programming tips and tricks'
The main goal of the GPUMGR was to avoid executing any code from main RAM. The 68k would be used to load up the GPUMGR and start it, then the 68k would be halted never to be heard from again, and then the entire program would run from the GPU cache. The GPU would page code into itself from main RAM when needed, but otherwise would not touch the main RAM, leaving the bus free for everything else.

You manage to shut the 68k off completely? My understanding is no one else has been able to do that and I believe still need to use the 68k at least for VBL. Not to mention the function sizes can be bigger.

This should give significant performance improvements over running RISC code from main RAM, as the RISC instructions are all in local cache, the 68k doesn't ever hit the bus, and the cache misses (requiring new code blits) are infrequent.

Scott

What about large programs? Does not paging code in an out of the local kill that speed difference for large programs? There are a few who found that running the time critical stuff in local and the non time critical stuff out in main(gpu) is faster rather than the cycles that are lost flipping absolutely everything in and out of the local.

When you get time, in the main Jaguar forum, not the programming one here can you tell us about the history of High Voltage Software, how you got involved, how the company got involved with Atari etc. HVS got their start as a company with Atari did they not?

Its really good to have you here. We hope you have a great 4th.

Posted: **Sat Jul 05, 2014 6:06 am**

You manage to shut the 68k off completely? My understanding is no one else has been able to do that and I believe still need to use the 68k at least for VBL. Not to mention the function sizes can be bigger.

Since it's been about 20 years (!) since I've thought about Jaguar fundamentals, it's hard for me to give absolute confirmation about specific details unless some code is dug up somewhere.

That said, I can quote myself from 20 years ago, thanks to someone posting this:
HVSCMAZE readme

In that readme (written by me), I apparently brag about the 68k being so shut down that it's shutty shut down down down, and the demo still runs full speed ahead. I think the example given in that message (forcing the 68k to be busy with the debugger, and the demo still running) means that I managed to completely take the 68k out of the picture, which was a big goal for all of our Jaguar stuff, as I recall. It was one of the very early lessons of the Jaguar - anything hitting the bus was a performance killer, and when the 68k was running, it just hit the bus for instructions all the time, so the goal was to make it go away. Likewise, having the GPU hit anything but its own 4k SRAM was undesirable, and should be avoided if possible. If the 68k was still servicing vertical blank interrupts, well, perhaps it was, that's more detail than I can remember at this point.

I'm certain that anybody doing homebrew has a much better mental picture of bus priorities, memory cycle hits, and real world performance numbers than I do at the moment. If I make a claim that sounds impossible, it very well may be. Best case is the code speaks for itself, if it's ever found. I'm pretty sure everything I ever knew about every corner of the Jaguar is expressed in that code...

By the way, here is another quote from me from 1995. This is from a message I sent to J Patton and Normen Kowalewski describing the GPUMGR:

"The 68000 plays no part in this except for initialization, so it can be stopped while an entire game runs in the GPU."

What about large programs? Does not paging code in an out of the local kill that speed difference for large programs? There are a few who found that running the time critical stuff in local and the non time critical stuff out in main(gpu) is faster rather than the cycles that are lost flipping absolutely everything in and out of the local.

Having time-critical code local to the GPU and running everything else out of non-local would certainly be faster than flipping everything in and out all the time. I think most overlay approaches tend to be very brute-force like this, so you're constantly forced to flip absolutely everything in and out, and there is no huge efficiency gain.

The GPUMGR didn't work that way, though. The code chunks were completely independent, relocatable sections of code. It wasn't position-independent code, instead the GPUMGR would patch absolute addresses when the code chunk was loaded to fix up references (subroutine calls, etc.) to other code chunks, depending on where those chunks were loaded. (this dynamic linking sounds expensive, but there are surprisingly few fixups required generally.)

This way, you could have any subset of code chunks loaded at one time. The code chunks were timestamped when they were referenced, so that if there was a cache miss, the least-recently-used chunk would be blown out. As it happens, the most performance-critical code tends to get called a lot, and so it stays in cache. I have some figures of getting 88% cache hit, which means 88% of the time, code is being run full speed out of GPU SRAM and not hitting the bus at all. The misses (the other 12%) would have to load the code into SRAM, of course, but I think pulling all of the code into SRAM and then executing it at full speed is faster than executing RISC code over the main bus. In both cases, the code has to be loaded over the bus... so load it as quickly as you can...

For me, personally, the appeal of the LRU caching is that I don't have to constantly try to guess which part of the code is really performance critical. The critical code magically stays in SRAM, and the non-critical code can be as big as it needs to be. Then when new features are added, or big chunks of code are reworked, some code becomes less important, and some new code is now performance-critical. The now-important stuff will stay in cache via magic. And, of course, if a game has many performance critical parts that are each critical at a different point in time, the important bits can be all up in the SRAM during their moment of glory, and get kicked out when they go cold and it's time for some other bits to shine.

As for the "large function" problem - the GPUMGR handled things pretty granularly, think of it basically as a subroutine-by-subroutine basis, so there wasn't really an issue of some key piece of code being too large to be paged in and out. Any huge piece of code would typically be made up of reasonably-sized subroutines, and the important stuff would stay in-cache and hum along.

When you get time, in the main Jaguar forum, not the programming one here can you tell us about the history of High Voltage Software, how you got involved, how the company got involved with Atari etc. HVS got their start as a company with Atari did they not?

I'm not much of a historian, but I wouldn't mind reminiscing about some things in the main forum. With the technical stuff, we can test and prove things. With the history, everybody has their own view of events, and an incomplete set of data to work from. So I won't chime in on unsettled historical drama, if any (I can see some glimpses of such things in some posts, but hey, it's the Internet, it happens... :). If I can add some pieces to the overall story, that would be fun.

Its really good to have you hear. We hope you have a great 4th. :)

Had a great 4th - just got back from the fireworks. As you can see from the timestamps on my emails, I've got a few days off work here, and I'm hitting these forums after my family goes to bed :) Thanks for the good questions, this has been a lot of fun so far...

Scott

Posted: **Sat Jul 05, 2014 7:05 am**

gpumgr wrote:Having time-critical code local to the GPU and running everything else out of non-local would certainly be faster than flipping everything in and out all the time. I think most overlay approaches tend to be very brute-force like this, so you're constantly forced to flip absolutely everything in and out, and there is no huge efficiency gain.

The GPUMGR didn't work that way, though. The code chunks were completely independent, relocatable sections of code. It wasn't position-independent code, instead the GPUMGR would patch absolute addresses when the code chunk was loaded to fix up references (subroutine calls, etc.) to other code chunks, depending on where those chunks were loaded. (this dynamic linking sounds expensive, but there are surprisingly few fixups required generally.)

This way, you could have any subset of code chunks loaded at one time. The code chunks were timestamped when they were referenced, so that if there was a cache miss, the least-recently-used chunk would be blown out. As it happens, the most performance-critical code tends to get called a lot, and so it stays in cache. I have some figures of getting 88% cache hit, which means 88% of the time, code is being run full speed out of GPU SRAM and not hitting the bus at all. The misses (the other 12%) would have to load the code into SRAM, of course, but I think pulling all of the code into SRAM and then executing it at full speed is faster than executing RISC code over the main bus. In both cases, the code has to be loaded over the bus... so load it as quickly as you can...

For me, personally, the appeal of the LRU caching is that I don't have to constantly try to guess which part of the code is really performance critical. The critical code magically stays in SRAM, and the non-critical code can be as big as it needs to be. Then when new features are added, or big chunks of code are reworked, some code becomes less important, and some new code is now performance-critical. The now-important stuff will stay in cache via magic. And, of course, if a game has many performance critical parts that are each critical at a different point in time, the important bits can be all up in the SRAM during their moment of glory, and get kicked out when they go cold and it's time for some other bits to shine.

As for the "large function" problem - the GPUMGR handled things pretty granularly, think of it basically as a subroutine-by-subroutine basis, so there wasn't really an issue of some key piece of code being too large to be paged in and out. Any huge piece of code would typically be made up of reasonably-sized subroutines, and the important stuff would stay in-cache and hum along.

What you have just described is unlike anything I have ever heard of before. You guys were absolutely amazing. My hats off to all of you.

Posted: **Sat Jul 05, 2014 9:50 am**

Can you tell us anything about what this tool did?

The code was post processed with GCCGPUM (HVS tool).

Posted: **Sat Jul 05, 2014 11:10 am**

Hi Scott nice to see you here.

Tell us about the tool how it would work, in the maze demo text there is written that it was an automatic process so I guess write code, compile, process and run without any manual steps to edit anything.

This seems to be a complex system.

Did you compile the code with the risc gcc to assembly (.S) code and then your tool did process that or did it do something before the risc compiler started to compile ?

TXG/MNX (Rene)

Posted: **Sun Jul 06, 2014 1:17 am**

txg/mnx wrote:Hi Scott nice to see you here.

Tell us about the tool how it would work, in the maze demo text there is written that it was an automatic process so I guess write code, compile, process and run without any manual steps to edit anything.

This seems to be a complex system.

Did you compile the code with the risc gcc to assembly (.S) code and then your tool did process that or did it do something before the risc compiler started to compile ?

TXG/MNX (Rene)

Did you compile the code with the risc gcc to assembly (.S) code

You got it :)

The C was compiled by GCC to .S files, then the post-processing tool would enhance the .S, and the result would be assembled with MADMAC. I don't actually remember the details of the dreaded GCC comparison bug - in the HVSCMAZE message, I refer to a workaround for this bug. I don't know if I fixed it in post-processing, or if I just avoided the bug by changing the C code in the demo in the hopes that the bug would be fixed on the compiler side some day. If I refresh my memory a bit on the nature of the compiler comparison bug, I may be able to recall what my solution was for that.

I know that for a time the compiler had a loop bug, where a for-loop would not iterate enough times. That one might have been fixed on the compiler side. Hopefully it was.

The main job of the post-processing was to add meta-data to the assembly code. The main bit of data added was a binary tag that would indicate where it was safe to split the binary assembled output into chunks. I could determine where each function started in the GCC assembly output, and I would know that it was safe to split the code at those locations, so a static 32-bit magic number was inserted in those spots. After the code was assembled, another post-processing step would split the binary code image into chunks based on where those marks were. Any memory references would be modified in a way that made it easy to fix them up - if a jump to a subroutine was needed, the absolute memory location that the assembler came up with would be modified in the final binary so that it was ready to be handled properly by GPUMGR. The basic mechanism was that jumps between chunks were handled via the GPUMGR. The jump would enter the GPUMGR, the GPUMGR would know the actual destination address of the other chunk in SRAM. If the chunk wasn't there, it would load it, then jump there. If there wasn't room to load the chunk, older chunks would be flushed out to make room. This is the same way the GPUMGR worked for code written entirely in assembly, but in that case, you'd just put in macros in the code where you wanted splits to happen (between functions, with some discretion. If you wanted 5 small functions all in the same chunk, you could just put the splits on either end of those 5 functions, and then you could have relative jumps between those subroutines without going through the GPUMGR). I believe that absolute jumps that were intra-chunk would also be fixed up, so if a chunk had a long jump within itself, it would be fixed to the correct address at load time so that it would work full speed with no problem. Bear in mind that even though the inter-chunk jump sequence described above sounds like a lot of work, the expectation was a very high hit rate, so the common case of inter-chunk jumps between already-resident chunks was quite fast.

I'm probably (definitely!) getting more specific than my actual recollection allows... there are some questions that crop up here... I'm sure anyone reading this is thinking... "wha... you what? And that worked?" so yeah, I'd love to find the code and take a look at it. Another fun part is that the loader was of course all done in optimized hand-coded risc assembly with all of the joys of avoiding register scoreboard smashes and utilizing delay slots and all of that... so it would be great to see the code. The GPUMGR also had to be very careful not to trash registers itself, since it was handling function calls between chunks. It was originally designed to work with code written in assembly, rather than GCC compiled code, and there were no register conventions for function calls in assembly (at least not in mine)... it's like "yeah this function will get what it needs out of r12 because... that's where the value happens to be when I call that function! (and moving it to r0 would cost me 2 cycles!)" so the GPUMGR couldn't trash r12 or anything else on the way, and it certainly did not stash and restore all of the registers on every function call!

There's another project that hasn't been mentioned here, which was my GPU debugger. I was able to set breakpoints in GPU code managed by GPUMGR (at source level), break in WDB, and step through the GPU code with source in the debugger. There may be tools to do that now, but I believe at the time the stock tools could not set regular breakpoints in the GPU, run full speed to the breakpoint, and then source-level step-debug the GPU in WDB. For an additional level of difficulty, keep in mind that this was source-level stepping through GPUMGR chunks that could be loaded at different locations in SRAM at any given time...

Scott

Posted: **Sun Jul 06, 2014 5:41 am**

gpumgr wrote:There's another project that hasn't been mentioned here, which was my GPU debugger. I was able to set breakpoints in GPU code managed by GPUMGR (at source level), break in WDB, and step through the GPU code with source in the debugger. There may be tools to do that now, but I believe at the time the stock tools could not set regular breakpoints in the GPU, run full speed to the breakpoint, and then source-level step-debug the GPU in WDB. For an additional level of difficulty, keep in mind that this was source-level stepping through GPUMGR chunks that could be loaded at different locations in SRAM at any given time...

Scott

Since Brainstorm did not release any of the source codes for the toolchain no one in the community had the skill set to build a modern assembler for it until recently. Its all been DOSbox until about six years ago. No one at least in the publics eye has done anything new really with the WDB setup so what you have will probably be brand new, never seen before and absolutely state of the art to this community. Just like your gpu manager.

As for the comparison bug it seemed to be a stumbling block for Atari, so much so they say they did not distribute the risc gcc to anyone, even to most of their in-house. You guys seemed to have blown right through whatever problems they had without any difficulty so we'll see what turns up.

Posted: **Mon Jul 07, 2014 12:42 pm**

This is so interesting I really hope Scott can find or remember more stuff like this.

The WDB with GPU function is awesome, indeed never saw such thing, I don't believe anyone ever thought about that either.

Posted: **Wed Jul 09, 2014 2:37 am**

gpumgr wrote:... so it would be great to see the code.

Hello Scott. I've been saying this for years.

Posted: **Wed Jul 09, 2014 6:08 pm**

Hi MegaData,

Are you now near the stuff so you dig for the files like the HVSMAZE ?

Scott is the writer of this code, maybe Scott can tell us if this demo and maybe sources maybe released to the public.

Posted: **Tue Jul 15, 2014 1:25 am**

MegaData wrote:
gpumgr wrote:... so it would be great to see the code.
Hello Scott. I've been saying this for years.

Well, as Rene knows, I have one hard drive left to look at.
There was a period of time that I did some work at home for HVS - but probably not much. I know I had an Alpine board set up in my Chicago apartment at one time, but I may have just used it to mess around. We moved stuff around on floppies back then so it wasn't like I could just do a "git pull" to sync stuff remotely

We spent a fortune on FedEx to submit milestones (often driving to O'Hare late at night to drop the package at the airport, the latest possible time you could overnight a package

The mystery hard drive I have is an old internal SCSI drive. I am working on getting an internal SCSI controller card to see if there is anything on it.

As for the rights to everything... I can say that the HVSCMAZE demo source code was from a demo published in Dr. Dobbs Journal, so as far as the original C code it's probably safe to distribute (I am not a lawyer etc).

Scott

Posted: **Thu Jul 17, 2014 9:15 am**

gpumgr wrote:
MegaData wrote:
gpumgr wrote:... so it would be great to see the code.
Hello Scott. I've been saying this for years.
Well, as Rene knows, I have one hard drive left to look at.
There was a period of time that I did some work at home for HVS - but probably not much. I know I had an Alpine board set up in my Chicago apartment at one time, but I may have just used it to mess around. We moved stuff around on floppies back then so it wasn't like I could just do a "git pull" to sync stuff remotely We spent a fortune on FedEx to submit milestones (often driving to O'Hare late at night to drop the package at the airport, the latest possible time you could overnight a package

The mystery hard drive I have is an old internal SCSI drive. I am working on getting an internal SCSI controller card to see if there is anything on it.

As for the rights to everything... I can say that the HVSCMAZE demo source code was from a demo published in Dr. Dobbs Journal, so as far as the original C code it's probably safe to distribute (I am not a lawyer etc).

Scott

Hi, just get a cheap adaptec controller from ebay. I got some laying around here but shipping to the US is more expensive then getting one overthere.

3DO ZONE Forums

Hello from Scott

Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott

Re: Hello from Scott