# Quick math benchmark for you to run



## rharder (Aug 10, 2001)

If you've been following the "Arg! Twice as slow as Pentium" thread, you'll be familiar with my discovery that my 500Mhz G4 was taking twice as long as a Pentium to do addition and multiplication of doubles. This was using Project Builder under OS X and Yellow Box for Windows NT.

I've created a very small benchmark program that performs 10 million operations. Actually it's advancing to the next the random number stream, but that's not important for this benchmark.

It's a bunch of additions and multiplications of doubles and longs.

Please try it out and let me know how fast it ran on whatever system you want to compile it on.

There's just one file: quickbench.c. You can compile it with 
	
	



```
% [b]gcc quickbench.c[/b]
```
 on most systems. You might also try *cc quickbench.c*.

Using the Cygwin bash shell and gcc compiler on a 733Mhz Pentium III, this test took 3.0 minutes. That's considerably longer than the 1.8 minutes it took using Yellow Box on Windows NT. I'm starting to wonder how much compilers affect the efficiency of code.

-Rob


----------



## rharder (Aug 10, 2001)

Rerunning this trimmed-down code in Yellow Box for Windows now takes 2.0 minutes. Weird.

-Rob


----------



## theed (Aug 10, 2001)

This program takes about 3 minutes to run on some systems.

All that math was run in 1.6 minutes.

Dual 450 (only 1 processor was in use) running under classic.  Compiled by CodeWarrior Pro 5.  No particular optimization, bit with code this concise it seldom makes a difference.

As for the creep up in Windows, there is a memory access bottleneck that occurs with really tight code under the x86 architecture.  Contiguous locations in RAM are slow to read.  That's my best guess.  I remember the guys at Be bitching about it.


----------



## theed (Aug 10, 2001)

same executable in a 344MHz G3 upgrade took 2.2 Minutes.  9,1 native

Maybe Apple is trying to get sales of CodeWarrior up?!?


----------



## theed (Aug 10, 2001)

This program takes about 3 minutes to run on some systems.

All that math was run in 3.7 minutes.

same dual 450 used before, cc, no optmization flags, only one processor was in use during the test.  (I watched)

I'll bet you $10 that if you port this puppy to java and run it there it'll be faster than cc and gcc.  The java runtime is tight.  I'd do it myself, but I don't even remember how to make a class in java right now (hangs head in shame)


----------



## Kartoffel (Aug 10, 2001)

I'm nowhere near a Mac currently, but here's my results with Cygwin in Win98 on a 700MHz P3:
<pre> 
% gcc -o qb0.exe quickbench.c
% gcc -O2 -o gb2.exe quickbench.c
% gcc -O3 -o qb3.exe quickbench.c
% ll qb*
-rwxr-xr-x 1 kart   544     21082 Aug 10 13:23 qb0.exe
-rwxr-xr-x 1 kart   544     20570 Aug 10 13:23 qb2.exe
-rwxr-xr-x 1 kart   544     21594 Aug 10 13:23 qb3.exe

% time ./qb0.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 3.0 minutes.
real 3m1.810s
user 0m0.000s
sys 0m0.000s

% time ./qb2.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 2.2 minutes.
real 2m13.630s
user 0m0.000s
sys 0m0.000s

% time ./qb3.exe
This program takes about 3 minutes to run on some systems.
All that math was run in 1.6 minutes.
real 1m35.350s
user 0m0.000s
sys 0m0.000s

</pre> 
Moral of the story - it pays to optimize, even with cruddy old gcc on Windows 98  ;-)

Theed: I'll post some benchmarks under BeOS tonight.


----------



## theed (Aug 10, 2001)

Be fixed the issue, if you're on Intel, the memory accesses that seem contigouos are actually intirspersed in memory by some pattern so that an application's sequential memory accesses actually bounce to different memory spaces.  Wnidows has apparently been doing similar for some time.  Since this is done by the OS, it took writing another intel OS to rediscover the issue.  I can anly imagine Linux has solved the same issue.  But it was a cheesey guess as to why seemingly more concise code would slow dow 10%

I'd love to see Be beat the snot out of windows on the same hardware.  Please do try that puppy.    I'll be waiting. 

Can any one write this thing as a java dealy?  I really don't feel like re-learning java tonight.    My woman hates when I do things like that on Friday night.

and more CWPro5 on classic on a 450 G4:
1.3 min fully optimized generic PPC
1.2 min g4 optimized + peephole optimization.  I really didn't think that would help so much.  Thanks for ... antagonizing I guess.


----------



## Kartoffel (Aug 10, 2001)

More results!  

BeOS, 600 MHz P3, gcc 2.9-beos-000224

% time ./qb0
 real    4m16.921s
 user    4m8.320s
 sys     0m8.086s 

% time ./qb2
 real    2m43.868s
 user    2m38.561s
 sys     0m5.012s

% time ./qb3
 real    2m17.417s
 user    2m12.906s
 sys     0m4.248s 

BeOS 1200 Mhz Thunderbird, gcc 2.9-beos-000224

% time ./qb0
 real    1m47.667s
 user    1m43.153s
 sys     0m0.741s

% time ./qb2
 real    1m16.493s
 user    1m13.279s
 sys     0m0.533s

% time ./qb3
 real    0m49.667s
 user    0m47.153s
 sys     0m0.341s


MacOS X, 500 Mhz G3 (iBook), gcc 2.95.2

% time ./qb0
 real    3m19.116s
 user    3m17.120s
 sys     0m0.310s

% time ./qb2
 real    1m33.585s
 user    1m32.500s
 sys     0m0.110s

% time ./qb3
 real    1m33.578s
 user    1m32.770s
 sys     0m0.080s

Or to put it another way, let's compare how many clock cycles each machine took.  (I just multiplied the CPU speed by the time it took... yeah, i know it's rough.  Also note that on the Mac there's no real difference between -O2 and -O3)


the -O0 case:
  MacOS X, 500 MHz G3:  3m17s (~9.85e10 cycles)  
  BeOS, 600 MHz P3:     4m8s (~14.8e10 cycles)
  Win98, 700 MHz P3:    4m2s (~17.3e10 cycles)
  BeOS, 1200 MHz Tbird: 1m43s (~12.4e10 cycles)

the -O2 case:
  MacOS X, 500 MHz G3:  1m33s (~4.65e10 cycles)  
  BeOS, 600 MHz P3:     2m38s (~9.48e10 cycles)
  Win98, 700 MHz P3:    2m13s (~9.31e10 cycles)
  BeOS, 1200 MHz Tbird: 1m13s (~8.76~e10 cycles)

the -O3 case:
  MacOS X, 500 MHz G3:  1m32s (~4.60e10 cycles)  
  BeOS, 600 MHz P3:     2m13s (~7.98e10 cycles)
  Win98, 700 MHz P3:    1m35s (~6.65e10 cycles)
  BeOS, 1200 MHz Tbird: 0m47s (~5.64e10 cycles)


----------



## Kartoffel (Aug 10, 2001)

Optimizing is critical for good performance on <i>any</i> system.

Even though a riced up thunderbird finished the bench mark a lot faster than the iBook, the PPC processor performed the same task using fewer clock cycles.

Some of it's caused by the varying degrees of optimization that are possible on each arch.  But still -- Megahertz really isn't the final measure of speed.  Apple's "megahertz myth" claim survives unscathed!  At least in this test, they're right.

Notice that the AMD Thunderbird is getting more work done per clock cycle than the Pentium 3's, too ;-)

&lt;slashdot_mode&gt;wooh, this is my 42nd post!&lt;/slashdot_mode&gt;


----------



## .dev.lqd (Aug 11, 2001)

[localhost:~] crim% ./quickbench &
[3] 321
[localhost:~] crim% 
This program takes about 3 minutes to run on some systems.
./quickbench &
[4] 322
[localhost:~] crim% 
This program takes about 3 minutes to run on some systems.

All that math was run in 1.6 minutes.

All that math was run in 1.6 minutes.

That's -two- instances finishing SIMULTANEOUSLY in 1.6 minutes. Both processes hovered around 95% CPU utilization, but then again they were two seperate processes vying for clock cycles. 

That was done on a G4-500DP. Were someone to multithread this bitch, I could imagine us easily clocking in around 40 seconds. Whoever had that dual800 system should get in on this, if we really want to spank them


----------



## Kartoffel (Aug 11, 2001)

*.dev.lqd said: _That's -two- instances finishing SIMULTANEOUSLY in 1.6 minutes. Both processes hovered around 95% CPU utilization, but then again 
they were two seperate processes vying for clock cycles_

The benchmark has one thread.  You ran 2 instances of the bench mark simultaneously on a system with 2 CPUs.

Similar results are possible on a dual P3.  The following is from the same 600MHz P3 I used for the single-process tests.  It can finish a single instance of quickbench in about 2m17s.

[~/qb]$ ls
dual  qb0  qb2  qb3  quickbench.c

[~/qb]$ cat dual
#!/bin/sh
  echo -n "Starting first benchmark..."; ./qb3 &
  echo -n "Starting second benchmark..."; ./qb3

[~/qb]$ time ./dual
Starting first benchmark...Starting second benchmark...
This program takes about 3 minutes to run on some systems.
This program takes about 3 minutes to run on some systems.

All that math was run in 2.4 minutes.
All that math was run in 2.4 minutes. 

real    2m24.271s
user    4m27.284s
sys     0m9.368s

Playing with computers is fun.  Anyone wanna loan me a dual G4?


----------



## .dev.lqd (Aug 11, 2001)

> The benchmark has one thread. You ran 2 instances of the bench mark simultaneously on a system with 2 CPUs.
> 
> Similar results are possible on a dual P3. The following is from the same 600MHz P3 I used for the single-process tests. It can finish a single instance of quickbench in about 2m17s.



Did I insinuate otherwise?  

I believe my words were "were someone to multithread this bitch..." as in modify the code to branch off and run two beasties at once, I would expect a significant performance boost. I don't have any experience with writing multithreaded code... so I was hoping someone would step up to the plate so I could get some example code whose basis I already understand


----------



## Kartoffel (Aug 11, 2001)

No problem, dev.lqd... i didn't mean to imply that you didn't understand 

Just saying that the behaviour you pointed out is not specific to MacOS X.


----------



## rharder (Aug 13, 2001)

This is fascinating.

I thought I'd told Project Builder on OS X to use Level 3 optimization but saw no improvement in speed. Anyone care to replicate this?

I'll start working on the Java version...

-Rob


----------



## rharder (Aug 13, 2001)

Here's the java version along with the c version.

It took 4.2 minutes on a 733Mhz PIII with JDK1.4 beta. It took just as long with the -O optimization, which is not surprising since modern JVM's do a lot of their own optimizing anyway.

-Rob


----------



## rharder (Aug 13, 2001)

I can confirm on a 733Mhz PIII that the C version took 1.6 minutes with -O3 and 3.0 minutes without it under the Cygwin bash shell and gcc compiler.

-Rob


----------



## knighthawk (Aug 14, 2001)

I decided as an experiment to put a Cocoa frame around Quickbench.  I wanted to have a progress bar, and in the code I have it incrementing, but the calculations take up all of the processing so it does not show up until the end.  (useless, yeah).  If I forced Cocoa to update the Progress Bar, it would take about 11 minutes longer, so I captioned the force-update in the source.

Attached is the compiled app.  In the next post, I will attach the PB/IB source.  

The results?

1.983 minutes on a G4 400mhz AGP desktop model.

Playing around, I discovered a big difference in benchmark times when you change the target from development to deployment (with optimization 3 and debug messages off).  It was about 4.35 minutes with the development target.

I attached the compiled app so that other people that are not programmers could try out this benchmark and post their results.


----------



## knighthawk (Aug 14, 2001)

This is the Project Builder and Interface Builder source code for the Quickbench.

I stuffed both of these files using DropStuff 6.01, so they should be able to be opened on any OS X Mac.  They are really SIT files, not ZIPs.


----------



## knighthawk (Aug 14, 2001)

oops, I hit the "Submit", not the Browse...

HERE is the source


----------



## rharder (Aug 14, 2001)

Thanks, knighthawk.

I think we're learning a valuable lesson about how to do a final compile on our code! Who knew? Okay, optimization levels _are_ well-known, but Wow! what a difference in performance!

And it appears that the G4 is _not_ twice as slow as the Pentium, which is what started this whole thing.

-Rob


----------



## Matrix Agent (Aug 15, 2001)

2.033 minutes

On the 466iBook running 5F24.


----------



## FrgMstr (Aug 17, 2001)

it took 44 secs on my 1.466Ghz Athlon, could probably shave a few more secs off by shutting down all apps off a fresh reboot.


----------



## FrgMstr (Aug 18, 2001)

I know of someone who has an athlon @1.8Ghz, ill try and get him to run it too, should be very quick me thinks, his is using a much higher FSB and DDR RAM so will be interesting.

Laterz


----------



## FrgMstr (Aug 18, 2001)

1.8Ghz + DDR + (400MHZ FSB(200DDR Overclocked)) athlon does it in 29 secs flat, not bad eh

rharder how come you lot ask for an athlon to do the test and when i get em done no one says a word are you all afraid of the athlon power or something  soz hit a nerve their


----------



## rharder (Aug 27, 2001)

Sorry, I've been gone for a week, and this thread's been dead for a while.

Good performance though! I didn't think my 1.0Ghz Athlon gave a correspondingly slower performance though. Hmm.

-Rob


----------



## knighthawk (Aug 27, 2001)

rharder, have you tried using "register" keywords.  I played around with it, but I don't think that I was doing it right because it did not seem to increase the speed any.

One funny thing is that I went and changed all of the doubles to floats just to see how the performance changed.  It was actually slower!!!

Side note:  Apple is rumored to be engineering the G5 with the roadmap saying it will be up to 2+ ghz.


----------



## rharder (Aug 27, 2001)

Yeah, I keep hearing about a 64bit G5 processor at ridiculous speeds. I guess it's inevitable, but I hope it's sooner rather than later.

-Rob


----------

