[Simh] Simh execution speed improvement - about 45% improvement.

Fri Jan 11 14:33:59 EST 2008

Hi!

I would like to share with members of the simh community some little improvements to obtain an executable code of simh that runs faster.  My personnal results gave me an improvement of passing from 17.2 VUPS to 25.0 VUPS on the same machine - about 45% execution speed improvement.

First, I did optimise the gcc command line in the makefile, to permit to the compiler to use the newer instructions from the host machine's CPU.

By default, GCC targets as a machine architecture a 486 type CPU (of course, in the intel world.  This argument cannot be used for a AMD cpu, but GCC supports various -march arguments, this supports AMD64 instructions, with the proper parameter).

But that default does not permit the use of MMX instructions!, so I did various simulator performance tests while varying the -march parameter of GCC.

The best results were attained by specifying -march=pentium2, and also -mfpmath=387  (so floating point instructions are executed in machine code instead of by floating point emulation).

I believe those two parameters are almost universally supported, as Pentium I class machines are now quite obsolete, especially for running a simh instance.  Those parameters did improve execution performance by 10% on my setup, with is hosted by a 2.4GHz quad core CPU, that has 4MB of cache per core, 1066MHz FSB speed, running knoppix linux 5.1 as the host OS.

Another parameter I did change on the GCC command line was to replace the Optimisation level (set to 2) -O2

To the level 9.  (-O9)  This also gave me some performance improvements.

The complete line defining CC in my modified makefile is:

CC = gcc -march=pentium2  -mfpmath=387 -std=c99 -O9 -U__STRICT_ANSI__ -g -lm -lrt $(OS_CCDEFS) -I .

(also, note the adding of -lrt, to make use of the real time library, and thus eliminates a compile error).

After that, I did compile simh to include the profiling code (by adding -pg on the gcc command line).  After studying the profiling, I saw that the functions that reads and writes to RAM and ROM are the ones that are executed 50% of the time.  Since calling a function adds the overhead of pushing and popping all registers of the host CPU, this slows down significantly the running of the simulator.

How can we get rid of this execution overhead?  And at the same time keeping the structure and modularity of the source code?

The solution is: by declaring those functions as inline!

So, I did modify the file vax_cpu.c like this:

inline extern int32 Read (uint32 va, int32 lnt, int32 acc);

inline extern void Write (uint32 va, int32 val, int32 lnt, int32 acc);

inline extern int32 ReadB (uint32 pa);

inline extern void WriteB (uint32 pa, int32 val);

inline extern int32 ReadLP (uint32 pa);

inline extern int32 Test (uint32 va, int32 acc, int32 *status);

(I just added the statement inline in front of the declaration of those 6 functions), and on the front of those functions.)

The diff between my modified vax_cpu.c and the official one in simh37-3.zip gives:

353,358c353,358

< extern int32 Read (uint32 va, int32 lnt, int32 acc);

< extern void Write (uint32 va, int32 val, int32 lnt, int32 acc);

< extern int32 ReadB (uint32 pa);

< extern void WriteB (uint32 pa, int32 val);

< extern int32 ReadLP (uint32 pa);

< extern int32 Test (uint32 va, int32 acc, int32 *status);

---

> inline extern int32 Read (uint32 va, int32 lnt, int32 acc);

> inline extern void Write (uint32 va, int32 val, int32 lnt, int32 acc);

> inline extern int32 ReadB (uint32 pa);

> inline extern void WriteB (uint32 pa, int32 val);

> inline extern int32 ReadLP (uint32 pa);

> inline extern int32 Test (uint32 va, int32 acc, int32 *status);

379c379

< int32 get_istr (int32 lnt, int32 acc);

---

> inline int32 get_istr (int32 lnt, int32 acc);

2922c2922

< int32 get_istr (int32 lnt, int32 acc)

---

> inline int32 get_istr (int32 lnt, int32 acc)

You will notice the adding of the various inline statements.

This resulted in about 35% speed improvement!  Some more profiling and inlining can be done, but I did not push that track further.

One thing to say, is that declaring inline a function gets the executable code recopied at every place this function is called.  This makes the executable code of the function not reusable, but eliminated the overhead of the function call.  The net result is that the size of the executable code of SIMH passed from about 800KB to 1.4MB, on the intel CPU architecture.

To benefit of this speed improvement, you need to use a host machine that can cache all of this executable code in the Level 2 cache of the host CPU.  Since an intel quad core has 4MB L2 cache, this is not a problem in my setup, but it can be for those who own CPUs with 1MB L2 cache.  

Also, just to give a little glimpse of my setup, I am running 6 instances of simh-vax that forms two VMS clusters, 

Running VMS 5.5-2.  and sharing about 30 disks total, with an INGRES VMS database and an ORACLE VMS client to an external ORACLE server.

Those 6 instances are running on two desktops (3 instances of simh per desktop), each having a quad core CPU with 3 ethernet cards.

This is the output of a show cluster from the main cluster (5 instances of simh):

View of Cluster from system ID 1042  node: REGIE           11-JAN-2008 14:10:33

+-------------------+---------+

|      SYSTEMS      | MEMBERS |

+--------+----------+---------+

|  NODE  | SOFTWARE |  STATUS |

+--------+----------+---------+

| REGIE  | VMS V5.5 | MEMBER  |

| EXTRA  | VMS V5.5 | MEMBER  |

| AGENT  | VMS V5.5 | MEMBER  |

| LIVRE  | VMS V5.5 | MEMBER  |

| CYCLE  | VMS V5.5 | MEMBER  |

+--------+----------+---------+

All of this to say that I praise the authors of the very good code that was written for simh!

I would also appreciate any ideas to further improve execution speed of simh.

Very best regards,

Francois Boucher ing.

Université du Québec à Montréal.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.trailing-edge.com/pipermail/simh/attachments/20080111/4bd6691b/attachment-0003.html>