Aqua & UI speed theories

suzerain · Jan 8, 2002

Hello.

I've been using Mac OS X now for a month or so on my laptop (Titanium/400). Speed is reasonable...in other words, it's faster to use. But we all know that the speed of UI operations isn't what it was on OS 9 or what it is on Windows XP, either.

Now, first of all, Aqua has a lot of overhead that XP and 9 don't, which is the obvious reason for the speed hit. But, rather than comparing it to them, I have a question about how optimized X is right now.

I'm aware that, because of the PDF overhead, each open window consumes something like 1.5 MB - 2 MB of RAM (even if it's just plain text), and it will likely never be as fast as an OS that stores its windows as metadata rather than images. Nevertheless...

Basically, after I had to use 9 the other day, and I saw how fast it was, I remembered that QuickDraw on OS 9 was hardware accelerated system-wide.

So, the question, if anyone has any insight, is: is Aqua at all hardware accelerated? Made me think when I saw new iMacs yesterday with 32 MB of VRAM that it certainly ought to be.

I'm hopeful that in 10.1.2 or some other future version, we might see a drastic speed improvement if Apple offloads the UI drawing to the graphics chip.

Or, if it's already hardware accelerated, then we'll just need to wait for the hardware to catch up to the OS.

Dradts · Jan 8, 2002

As far as I know, Aqua ISN'T hardware accelerated yet at all!

suzerain · Jan 8, 2002

Well, that's good news. Because it means it's theoretically possible for it to be majorly sped up at some point, assuming it's possible to hardware accelerate...

LordOphidian · Jan 8, 2002

Quartz (not aqua) is not currently hardware accelerated (at all?). Most video cards don't support 2d acceleration of a lot of what Quartz does, or at least this is how it was explained to me. They do support 3d acceleration (OpenGL) so 3d stuff is accelerated but for now 2d in OS X is full software.

Should be interesting if Apple figgures out how to hardware accelerate Quartz.

Dradts · Jan 8, 2002

I guess the most important thing, window resizing, won't be hardware acceleratable

Krevinek · Jan 8, 2002

Don't take offense of the topic, I just couldn't resist

Anyways, I will explain what is actually causing all this... most of the speculation is pure BS, and I have exact reasons why. I have spent over 4 months reverse-engineering the video driver APIs for MacOS X for a Voodoo for X project, so I learned a few things about how MacOS X does things.

1) IOGraphics does all the actual talking to a video card. Quartz is really just a window server that uses PDF. The size of the 'surfaces' or windows has always been rather large when working with 32-bit. QuickDraw used a pixelmap buffer for every single window, and so the difference between Quartz' windowing and QuickDraw's windowing really isn't anything important. Win9x/ME/NT/2k and *nix uses the same thing, a big buffer o' pixels.

2) IOGraphics *IS* accelerated by the normal 2D acceleration features of a supported video card. However, you still need the bandwidth to push those pixels over the PCI or AGP bus. Since Quartz passes on the rasterized pixels to IOGraphics for blitting to the framebuffer, Quartz is accelerated just as much as QuickDraw is in most cases: at the time of blitting.

3) The reason why live window resize is slow (on any OS), is because of the fact the whole window is getting pushed across the bus repeatedly. 30fps at 1-5MB goes to about 30MB-150MB/sec of data flying across the bus, not fun. Not to mention Quartz has a little lag time fetching the data for rasterization of the window with the new size.

4) Quartz itself will never be hardware accelerated unless Apple produces a card that overrides the IOKit video system, since it is already setup for Quartz to use IOGraphics for blitting after it has rasterized. However, utilizing Altivec properly in G4 systems does make a difference.

5) Overall, the best method to store window data is as raw pixelmap data. Sure it is RAM intensive, but it is certainly a lot faster when pumping the data out to video. Raw pixelmap data doesn't need any alteration or rasterization beforehand. Metadata-based systems like PDF need to be pumped through an engine of some kind before being pumped out to video.

I hope this clears up some of the issues surrounding 2D acceleration in MacOS X. I personally think we have gotten so used to raw pixelmap QuickDraw in MacOS 9.x, we are still scratching our heads as to why this new system is slower. It is slower, but as it stands, it is really in the hands of optimizers of the Quartz engine and video card driver makers to squeeze any more performance out of the same machine. The acceleration is essentially the same as before, the pixels are just going through extra stages of processing before being given to the card.

theed · Jan 8, 2002

What is it that you expect to be accelerated? About the only thing that is really HW accelerated on video cards right now is 2d blit copy, if I'm stating things correctly. When you drag a thing across your screen, there is one call to the video card to move a rectangle to a new location. Anything that it uncovers has to be redrawn by something. In 9 it would be the app, as the app would be told "hey, this space is dirty, fix it" and then the app would activate to redraw. This live dragging (If you used PowerWindows) was waaaaaay slower than OS X's as OS X stores each window in the OS wether it's displayed or not. So now when something is uncovered, the OS puts the image in rather than having to activate the App. It's sweet. It's fast. It's live.

Now the drop shadaws, and anything with translucency negate the bitmap copy concept, and thus sequire every pixel displaying more than one layer to be completely re-rendered in software. Eventually when the video card makers care about 2D again, they'll put n-layer alpha channel hardware acceleration in. Which would make Aqua (more appropriately Quartz) supa-fast. But the deal is that 9 didn't live drag, so it was dealing with orders of magnitude sell drawing stuff, and XP doesn't do nearly as much translucency, so they aren't as taxing to draw, and thus the amount of stuff that's rendered in software for them is minute compared to OS X. That, and Windows tosses interrupts around for it's redraw code which makes the screen seem super responsive, but makes your mp3's skip if you scroll too hard in IE, so that may be part of your experience.

I'd like the scrolling of windows to be multi-threaded, in that I nat instant response of things that are HW accelerated, and let the rest catch up. But this would likely make for uglies while moving things. And that would look as bad or worse than the speed hit that dragging currently entails.

All said, live dragging is hard. Live dragging with translucency is ludicrous. This is OS ludicrous. I love it. ... I'm ludicrous.

Krevinek · Jan 8, 2002

Yes, essentially the only 2D acceleration features of a card are for 2D blitting to the card's framebuffer from RAM, from the card to another part on the card, and from the card to RAM. Also, you can do fill blits (such as filling a rect with all black for example) which are very fast.

One example of good OS 9 live dragging/resizing is Hotline 1.8.5, but you will notice it doesn't EVER redraw until it is done resizing, and even then it can get a little jittery on older systems.

theed · Jan 8, 2002

I posted before I saw your post, so now I'll ask you what you think:

Much of the 2D blitting is VRAM to VRAM and shouldn't touch the PCI or AGP bus.

There are many levels of bitmaps that might get displayed onscreen, could more of those be cached on the video card itself for improved performance? By Quartz, not, by the application or something ghey.

There are some performance improvements that we can expect just from software tweaking, but major improvements require vid-card makers to introduce new instructions / API's to their cords.

Krevinek · Jan 9, 2002

Originally posted by theed
Much of the 2D blitting is VRAM to VRAM and shouldn't touch the PCI or AGP bus.

Bzzt, not quite. 2D and 3D have be processed by the CPU before passing it on to the card. This means that there has to be a RAM representation that is pushed across the bus. So even if we drew directly to VRAM (caching) and then moved the image, it would actually be SLOWER than just storing it in RAM and blitting directly to the card's framebuffer (2 blits vs 1). RAM is faster than pushing data to the card, so it is better to render it there.

So, with the layers (IOKit <- IOGraphics <- Quartz), it is just more efficient to store a RAM representation, and blit to the card using IOGraphics when needed.

There are many levels of bitmaps that might get displayed onscreen, could more of those be cached on the video card itself for improved performance? By Quartz, not, by the application or something ghey.

Nope. For similar reasons as above. Even if you cached it, you still have the problem of when it changes. Since blitting only occurs when something changes currently, you still run into the 2 blits vs 1 blit issue like above. Keeping it in RAM and then blitting to the card is the best way.

There are some performance improvements that we can expect just from software tweaking, but major improvements require vid-card makers to introduce new instructions / API's to their cords.

Introducing new APIs to the cards would require Apple to write a new driver structure, and to change the whole IOGraphics setup. Things currently are so set in many OSes (and since Macs are a small slice of ATi's and NVidia's sales) that you probably won't see the improvements that will affect OS X specifically (PDF stuff).

Personally, I think people should realize that live redraw in 32-bit is just plain slow. Yes, there are ways to speed it up, but all of them are hardware changes. Increased AGP buses, better system bus, better CPU, better RAM, new vid card API... but the truth is, no matter what, pushing over a MB of data 4 bytes at a time over an AGP bus multiple times is not going to be as fast as people think it should be for awhile. Just give it time. The G5 will probably boost it, and so will other things, it is just that right now, the only tweaks that will work for older systems are software tweaks. Hardware tweaks will only affect new hardware, and possibly new cards, but don't expect new APIs on cards anytime soon.

theed · Jan 9, 2002

Don't get me wrong, I have all the respect in the world for you Krev, but I just don't buy the 2 blits vs 1 thing.

Everything I've read from BeOS drivers through mkLinux says that on an opaque section, like the contents of a web browser window, when you scroll the slider, the content that doesn't change and simply moves is blitted VRAM to VRAM with a single API call. Then the new section that wasn't previously onscreen will be copied from RAM or created on the fly by the application that owns the window. This is the very nature of hardware acceleration, and this much IS done on onything that runs X. I've felt how X windows runs when it copies from RAM like you state. You can actually see the window redraw. OS X has vory basic HW acceleration done.

But since the API's for different cards do alpha channels differently, and all of them do it weakly, there is not a reasonable native IOGraphics call to hardware accelerate anything with translucency involved.

I'll try to research my points. At this time, I stand in disagreement with some of your points.

Krevinek · Jan 9, 2002

Don't get me wrong either... I was only trying to counter your statement of 'most blits are VRAM to VRAM' which is bunk. You can only do a screen-to-screen copy with data already in the framebuffer, which definitely accounts for quite a bit less than half of the blitting actually done by an OS/app. But you are absolutely right about scrolling, and even live window dragging takes advantage of this. If you had a cache of constantly changing data, you would encounter the 2 blit problem as I described.

OS X's graphics model also prevents caching, and caching does have problems. You would run into problems if the cache ran out, dumping stuff into RAM, then having to manage RAM/VRAM to put the 'more important' stuff in the VRAM cache... etc. The overhead is generally not worth it.

So the result: You have a screen buffer in VRAM, and your window buffers in RAM (which have to then be pushed with new data/etc).

But the true issue is this: video cards under OS X are just told: Do an X type blit, with a rect of {Y,Z}, with data set T. So, proper optimization of when to blit, and how, is up to the manager on top of IOGraphics (X Windows, Quartz, QuickDraw for Carbon) to determine.

Sorry if I made a confusing post

SCrossman · Jan 9, 2002

All of these complaints about the live resizing in OS X. Yes, the resizing of a window in OS X does not keep up with the cursor. But, the window(s) being updated do not blink or redraw incessantly like in win2k or xp. It is smooth all the way in X, albeit slower. Also in win2k or XP when dragging a window around, all underlying windows must refresh often, kind of like how a classic window updates in X. OS X does it much better than XP. Try it yourself and you will see.

Krevinek · Jan 9, 2002

Still doesn't mean there aren't ways to improve it, even if it is the best out there so far. All I am trying to do is explain what is really going on behind the curtain so that people don't bash Apple for the wrong things. Just look at the petition for older ATi cards, most of the bad performance is compared to OS 9 with really aging cards, and only the II series is truly unsupported, while they are complaining about the Pro/Mobility series which is supported in the older machines, just underpowered. Sure there is room for improvement on the support (which Apple said they won't continue), but they are bashing for a lack of drivers for the II/Pro/Mobility chipsets when only the II+ and the IIc used in the beige and the first couple iMacs are without true support (AFAIK). It could be bundled into the Pro kexts, I am not sure since I never bothered to double-check on that part.

They should be bashing for poor performance without a good explination, but they are bashing for lack of support, which is wrong.

This thread is similar (from how I read a couple posts)... they should be bashing for not having comprehensive support for 2D accelerators with the nice features, but they are bashing for lack of acceleration, which is also wrong.

Just my (many) rants.

TommyWillB · Jan 9, 2002

Originally posted by Krevinek
...I hope this clears up some of the issues surrounding 2D acceleration in MacOS X...

Yes it does, but what does the word "blitting" mean?

Krevinek · Jan 9, 2002

The term 'blitting', to 'blit' or a 'blitter' is associated with taking a large number of pixels (usually a window or something), and pushing them onto the display or a framebuffer (which stores the pixels for a display). I don't know the origins, but it does date back to the 2D days of sprites

dfiler · Jan 10, 2002

I'm sorry but I too must call Krevinek on his info. However, its understandable since 99% of the posts on this subject are off the mark.

2D video acceleration relies greatly on VRAM to VRAM operations. When dragging windows around the screen, only the shadows and other translucent areas need to be recalculated and transfered over the bus to a graphics card. The bus is only used for refreshing window contents when it is noted that the contents have changed and have requested an update.

I have yet to find an App in OS X that uses hardware accelerated scrolling. This simply involves caching the scrollable area to the VRAM and then blitting the viewable area to buffer from which the screen is drawn. This is something that Apple needs to tackle in the quartz API.

These two forms of acceleration are about all that normal video cards offer while in 2D. With our new double-buffered OS relying heavily on translucency, maybe we'll see cards and drivers redesigned to accelerate the new graphics bottlenecks.

Most of the lag is from compositing transparent on screen elements into double buffered windows. This is something better handled by the CPU at present since Macs don't have insanely great bus speeds. Its not that quartz is poorly optimized even though there is always room for improvement. Its more that quartz is trying to do more sophisticated things which require more processing power. As our computers become faster, this will become a non-issue. What's up to debate is, did Apple make the wrong tradeoff between a performance on current hardware and building something that won't need to be re-engineered into something better in the near future...

How's that for a first post

Krevinek · Jan 10, 2002

I have yet to find an App in OS X that uses hardware accelerated scrolling. This simply involves caching the scrollable area to the VRAM and then blitting the viewable area to buffer from which the screen is drawn. This is something that Apple needs to tackle in the quartz API.

Take scrolling for a spin on an unsupported card, such as a V3, and you will notice Quartz just does what theed stated earlier, it moves the unchanged content, and then refreshes the 'dirty' region.

It has been confirmed that moving windows/scrolling does use the VRAM to VRAM blit, but the caching you refer to isn't possible with the current IOGraphics interface. Plus, I again state that it would add unneeded complexity, since not everyone has 60+MB of VRAM to blow on caching the contents of a window. You will notice that even ATi/3Dfx support panels in OS 9 don't really support caching of anything but fonts and images (which are static), so this isn't exactly a great loss to OS X.

Acceleration may rely greatly on screen to screen blits (VRAM to VRAM as everyone keeps calling it) for live scrolling and dragging, but it doesn't constitute the majority of the blits done by the OS. In OS X, you will notice that when a moved window is drawn, there are actually 2 blits for each frame (the screen to screen blit of the window itself, and then the frame around the window, which includes at least part of the title bar, IOGraphics can be told to blit multiple rects at once, so I am counting it as one blit, even though the card does 4-5), the same goes for scrolling. So even if every single blit was moving and scrolling, under OS X, it is split 50/50. Throw in any sort of changing content (pulsating buttons, progress bars/etc) and you throw the balance in favor of RAM to screen blitting.

Also, cards do offer more than screen-to-screen blitting, it offers a way to quickly and efficiently blit 2D data to the screen. It is easier taking a pixmap and just writing the data to the same address space than it is to have to calculate bounds/etc on your own and write to the appropriate address. All 2D data is given to a specific register on the card, and on 3Dfx cards, you just keep writing to that register. The dstX, dstY, dstHeight, and dstWidth registers tell the card how to distribute the pixels. So you do get a nice boost there, and you can see the difference by turning QD acceleration off and on a couple of times on an older Beige G3 or 603/4 Mac with your vid card.

Krevinek · Jan 10, 2002

Heh, I just realized something... something that could explain the performance of live resizing on machines with ATi cards (I got a chance to play with a new G4 with a GeForce 2MX which didn't really have any resizing issues), as well as overall performance of ATi/3Dfx cards vs the PC counterparts.

The thing that everyone, including me overlooked is the problem of endian-ness on video cards. All 3Dfx cards are little-endian, most if not all ATi cards are little-endian, all Mac-compatible NVidia chipsets can be set either way. So, the thing is, you get endian swapping overhead when blitting to the card with ATi cards. I don't know when or if ATi actually produced a chip that could be big-endian, but if not, that would help explain some of the performance issues. (And why 3Dfx/ATi cards can never keep up with the PC counterparts on similar hardware, because the Mac is doing more work talking to the card)

As an off-topic trick question for programmers out there: Which is faster when endian swapping? Shifting/AND-ing/OR-ing or doing multiple load/stores of each byte? To give you an idea... for a 4-byte endian swap, you need 4 shifts, 4 ANDs, 3 ORs a load, and a store/output, *or* 4 loads, and 4 stores, and possibly a load and an output (depends on the interface/code). Try to take a guess, some might be surprised which is faster (in real conditions).

theed · Jan 10, 2002

I can not verify what you state about endianness on video cards, but since they are a hardware solution and the data should bo thrown across the bus in a largely parallel manner, all you have to do is wire the pins on the card backwards to change endianness.

And then there's always the fact that the PPC can work in either big or little endian mode, which could be a plus in these circumstances. I have a HARD time believing that this is actually an issue.

Aqua & UI speed theories

Owner, Mac Game Database

Official Mac User

Owner, Mac Game Database

Adjutant On-Line

Official Mac User

Evil PPC Tweaker

Registered

Evil PPC Tweaker

Registered

Evil PPC Tweaker

Registered

Evil PPC Tweaker

MacTech

Evil PPC Tweaker

Registered

Evil PPC Tweaker

Registered

Evil PPC Tweaker

Evil PPC Tweaker

Registered