Quick math benchmark for you to run

rharder · Aug 10, 2001

If you've been following the "Arg! Twice as slow as Pentium" thread, you'll be familiar with my discovery that my 500Mhz G4 was taking twice as long as a Pentium to do addition and multiplication of doubles. This was using Project Builder under OS X and Yellow Box for Windows NT.

I've created a very small benchmark program that performs 10 million operations. Actually it's advancing to the next the random number stream, but that's not important for this benchmark.

It's a bunch of additions and multiplications of doubles and longs.

Please try it out and let me know how fast it ran on whatever system you want to compile it on.

There's just one file: quickbench.c. You can compile it with

Code:

% [b]gcc quickbench.c[/b]

on most systems. You might also try cc quickbench.c.

Using the Cygwin bash shell and gcc compiler on a 733Mhz Pentium III, this test took 3.0 minutes. That's considerably longer than the 1.8 minutes it took using Yellow Box on Windows NT. I'm starting to wonder how much compilers affect the efficiency of code.

-Rob

rharder · Aug 10, 2001

Rerunning this trimmed-down code in Yellow Box for Windows now takes 2.0 minutes. Weird.

-Rob

theed · Aug 10, 2001

This program takes about 3 minutes to run on some systems.

All that math was run in 1.6 minutes.

Dual 450 (only 1 processor was in use) running under classic. Compiled by CodeWarrior Pro 5. No particular optimization, bit with code this concise it seldom makes a difference.

As for the creep up in Windows, there is a memory access bottleneck that occurs with really tight code under the x86 architecture. Contiguous locations in RAM are slow to read. That's my best guess. I remember the guys at Be bitching about it.

theed · Aug 10, 2001

same executable in a 344MHz G3 upgrade took 2.2 Minutes. 9,1 native

Maybe Apple is trying to get sales of CodeWarrior up?!?

theed · Aug 10, 2001

This program takes about 3 minutes to run on some systems.

All that math was run in 3.7 minutes.

same dual 450 used before, cc, no optmization flags, only one processor was in use during the test. (I watched)

I'll bet you $10 that if you port this puppy to java and run it there it'll be faster than cc and gcc. The java runtime is tight. I'd do it myself, but I don't even remember how to make a class in java right now (hangs head in shame)

Kartoffel · Aug 10, 2001

I'm nowhere near a Mac currently, but here's my results with Cygwin in Win98 on a 700MHz P3:
<pre>
% gcc -o qb0.exe quickbench.c
% gcc -O2 -o gb2.exe quickbench.c
% gcc -O3 -o qb3.exe quickbench.c
% ll qb*
-rwxr-xr-x 1 kart 544 21082 Aug 10 13:23 qb0.exe
-rwxr-xr-x 1 kart 544 20570 Aug 10 13:23 qb2.exe
-rwxr-xr-x 1 kart 544 21594 Aug 10 13:23 qb3.exe

% time ./qb0.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 3.0 minutes.
real 3m1.810s
user 0m0.000s
sys 0m0.000s

% time ./qb2.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 2.2 minutes.
real 2m13.630s
user 0m0.000s
sys 0m0.000s

% time ./qb3.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 1.6 minutes.
real 1m35.350s
user 0m0.000s
sys 0m0.000s

</pre>
Moral of the story - it pays to optimize, even with cruddy old gcc on Windows 98 ;-)

Theed: I'll post some benchmarks under BeOS tonight.

theed · Aug 10, 2001

Be fixed the issue, if you're on Intel, the memory accesses that seem contigouos are actually intirspersed in memory by some pattern so that an application's sequential memory accesses actually bounce to different memory spaces. Wnidows has apparently been doing similar for some time. Since this is done by the OS, it took writing another intel OS to rediscover the issue. I can anly imagine Linux has solved the same issue. But it was a cheesey guess as to why seemingly more concise code would slow dow 10%

I'd love to see Be beat the snot out of windows on the same hardware. Please do try that puppy.

I'll be waiting.

Can any one write this thing as a java dealy? I really don't feel like re-learning java tonight.

My woman hates when I do things like that on Friday night.

and more CWPro5 on classic on a 450 G4:
1.3 min fully optimized generic PPC
1.2 min g4 optimized + peephole optimization. I really didn't think that would help so much. Thanks for ... antagonizing I guess.

Kartoffel · Aug 10, 2001

More results!

BeOS, 600 MHz P3, gcc 2.9-beos-000224

% time ./qb0
real 4m16.921s
user 4m8.320s
sys 0m8.086s

% time ./qb2
real 2m43.868s
user 2m38.561s
sys 0m5.012s

% time ./qb3
real 2m17.417s
user 2m12.906s
sys 0m4.248s

BeOS 1200 Mhz Thunderbird, gcc 2.9-beos-000224

% time ./qb0
real 1m47.667s
user 1m43.153s
sys 0m0.741s

% time ./qb2
real 1m16.493s
user 1m13.279s
sys 0m0.533s

% time ./qb3
real 0m49.667s
user 0m47.153s
sys 0m0.341s

MacOS X, 500 Mhz G3 (iBook), gcc 2.95.2

% time ./qb0
real 3m19.116s
user 3m17.120s
sys 0m0.310s

% time ./qb2
real 1m33.585s
user 1m32.500s
sys 0m0.110s

% time ./qb3
real 1m33.578s
user 1m32.770s
sys 0m0.080s

Or to put it another way, let's compare how many clock cycles each machine took. (I just multiplied the CPU speed by the time it took... yeah, i know it's rough. Also note that on the Mac there's no real difference between -O2 and -O3)

the -O0 case:
MacOS X, 500 MHz G3: 3m17s (~9.85e10 cycles)
BeOS, 600 MHz P3: 4m8s (~14.8e10 cycles)
Win98, 700 MHz P3: 4m2s (~17.3e10 cycles)
BeOS, 1200 MHz Tbird: 1m43s (~12.4e10 cycles)

the -O2 case:
MacOS X, 500 MHz G3: 1m33s (~4.65e10 cycles)
BeOS, 600 MHz P3: 2m38s (~9.48e10 cycles)
Win98, 700 MHz P3: 2m13s (~9.31e10 cycles)
BeOS, 1200 MHz Tbird: 1m13s (~8.76~e10 cycles)

the -O3 case:
MacOS X, 500 MHz G3: 1m32s (~4.60e10 cycles)
BeOS, 600 MHz P3: 2m13s (~7.98e10 cycles)
Win98, 700 MHz P3: 1m35s (~6.65e10 cycles)
BeOS, 1200 MHz Tbird: 0m47s (~5.64e10 cycles)

Kartoffel · Aug 10, 2001

Optimizing is critical for good performance on <i>any</i> system.

Even though a riced up thunderbird finished the bench mark a lot faster than the iBook, the PPC processor performed the same task using fewer clock cycles.

Some of it's caused by the varying degrees of optimization that are possible on each arch. But still -- Megahertz really isn't the final measure of speed. Apple's "megahertz myth" claim survives unscathed! At least in this test, they're right.

Notice that the AMD Thunderbird is getting more work done per clock cycle than the Pentium 3's, too ;-)

<slashdot_mode>wooh, this is my 42nd post!</slashdot_mode>

.dev.lqd · Aug 11, 2001

[localhost:~] crim% ./quickbench &
[3] 321
[localhost:~] crim%
This program takes about 3 minutes to run on some systems.
./quickbench &
[4] 322
[localhost:~] crim%
This program takes about 3 minutes to run on some systems.

All that math was run in 1.6 minutes.

All that math was run in 1.6 minutes.

That's -two- instances finishing SIMULTANEOUSLY in 1.6 minutes. Both processes hovered around 95% CPU utilization, but then again they were two seperate processes vying for clock cycles.

That was done on a G4-500DP. Were someone to multithread this bitch, I could imagine us easily clocking in around 40 seconds. Whoever had that dual800 system should get in on this, if we really want to spank them

Kartoffel · Aug 11, 2001

*.dev.lqd said: That's -two- instances finishing SIMULTANEOUSLY in 1.6 minutes. Both processes hovered around 95% CPU utilization, but then again
they were two seperate processes vying for clock cycles

The benchmark has one thread. You ran 2 instances of the bench mark simultaneously on a system with 2 CPUs.

Similar results are possible on a dual P3. The following is from the same 600MHz P3 I used for the single-process tests. It can finish a single instance of quickbench in about 2m17s.

[~/qb]$ ls
dual qb0 qb2 qb3 quickbench.c

[~/qb]$ cat dual
#!/bin/sh
echo -n "Starting first benchmark..."; ./qb3 &
echo -n "Starting second benchmark..."; ./qb3

[~/qb]$ time ./dual
Starting first benchmark...Starting second benchmark...
This program takes about 3 minutes to run on some systems.
This program takes about 3 minutes to run on some systems.

All that math was run in 2.4 minutes.
All that math was run in 2.4 minutes.

real 2m24.271s
user 4m27.284s
sys 0m9.368s

Playing with computers is fun. Anyone wanna loan me a dual G4?

.dev.lqd · Aug 11, 2001

The benchmark has one thread. You ran 2 instances of the bench mark simultaneously on a system with 2 CPUs.

Similar results are possible on a dual P3. The following is from the same 600MHz P3 I used for the single-process tests. It can finish a single instance of quickbench in about 2m17s.

Did I insinuate otherwise?

I believe my words were "were someone to multithread this bitch..." as in modify the code to branch off and run two beasties at once, I would expect a significant performance boost. I don't have any experience with writing multithreaded code... so I was hoping someone would step up to the plate so I could get some example code whose basis I already understand

Kartoffel · Aug 11, 2001

No problem, dev.lqd... i didn't mean to imply that you didn't understand

Just saying that the behaviour you pointed out is not specific to MacOS X.

rharder · Aug 13, 2001

This is fascinating.

I thought I'd told Project Builder on OS X to use Level 3 optimization but saw no improvement in speed. Anyone care to replicate this?

I'll start working on the Java version...

-Rob

rharder · Aug 13, 2001

Here's the java version along with the c version.

It took 4.2 minutes on a 733Mhz PIII with JDK1.4 beta. It took just as long with the -O optimization, which is not surprising since modern JVM's do a lot of their own optimizing anyway.

-Rob

rharder · Aug 13, 2001

I can confirm on a 733Mhz PIII that the C version took 1.6 minutes with -O3 and 3.0 minutes without it under the Cygwin bash shell and gcc compiler.

-Rob

knighthawk · Aug 14, 2001

I decided as an experiment to put a Cocoa frame around Quickbench. I wanted to have a progress bar, and in the code I have it incrementing, but the calculations take up all of the processing so it does not show up until the end. (useless, yeah). If I forced Cocoa to update the Progress Bar, it would take about 11 minutes longer, so I captioned the force-update in the source.

Attached is the compiled app. In the next post, I will attach the PB/IB source.

The results?

1.983 minutes on a G4 400mhz AGP desktop model.

Playing around, I discovered a big difference in benchmark times when you change the target from development to deployment (with optimization 3 and debug messages off). It was about 4.35 minutes with the development target.

I attached the compiled app so that other people that are not programmers could try out this benchmark and post their results.

knighthawk · Aug 14, 2001

This is the Project Builder and Interface Builder source code for the Quickbench.

I stuffed both of these files using DropStuff 6.01, so they should be able to be opened on any OS X Mac. They are really SIT files, not ZIPs.

knighthawk · Aug 14, 2001

oops, I hit the "Submit", not the Browse...

HERE is the source

rharder · Aug 14, 2001

Thanks, knighthawk.

I think we're learning a valuable lesson about how to do a final compile on our code! Who knew? Okay, optimization levels are well-known, but Wow! what a difference in performance!

And it appears that the G4 is not twice as slow as the Pentium, which is what started this whole thing.

-Rob

Quick math benchmark for you to run

Do not read this sign.

Attachments

Do not read this sign.

Registered

Registered

Registered

5038 Member

Registered

5038 Member

5038 Member

Angry Member

5038 Member

Angry Member

5038 Member

Do not read this sign.

Do not read this sign.

Attachments

Do not read this sign.

Registered

Attachments

Registered

Registered

Attachments

Do not read this sign.