Software Research
Runtime Object Devirtualization (R.O.D.)
Improving JIT compiler performance by inlining virtual function calls and memory access
|
Author |
Robin H.S. Holmes |
|
Contents |
ROD – A way of making JIT compilers build faster code that is conveniently programmed with existing languages and techniques such as C++ and Object Orient Programming(OOP). |
|
Target Audience |
JIT Compiler Writers, Software Researchers, Digital Audio and Video Experts, Programmers, Compiler Makers, Script Language Designers |
|
Date |
|
|
Status |
Draft |
|
Copyright |
Copyright(c) 2009/2010 Robin Holmes. All rights reserved. This is a public domain specification. This is not intellectual property for anyone or any company or corporation. You can freely use, at no cost or obligations to you or anyone, this technology specification in your own compilers and virtual machines. |
|
Download |
This document is also available as a .pdf download from the downloads list
at http://www.code.google.com/p/juke/. (Look for the
ROD_xyz.zip file.) |
|
Warning |
Researching and making new software can be hazardous to you and your family. |
Table of Contents
1. Terms 2
2. Introduction 3
2.1. Existing optimization solutions and problems 3
2.2. Optimizing OOP(Object Oriented Programming) 3
3. Runtime Object Devirtualization 4
3.1. The “devirt” keyword 4
3.2. Applying devirt to functions 4
3.3. Applying devirt to fields and variables 5
3.4. The Devirtualization Object Table( DOT ) 5
3.5. Usage of devirt keyword 6
3.6. Potential problems when using the devirt keyword 7
4. How it works 8
4.1. The build and rebuild 8
4.2. The “this” or “Self” parameter 8
4.3. Expanding the devirtualization 8
4.4. Managing the rebuilt devirtualized code memory and threads. 8
4.5. A common devirt coding error and how to catch it 9
5. Scenarios 10
5.1. Typical JIT compilation breakdown 10
5.2. Example usage in a source code compiler 11
5.3. Speed up factor estimates 12
6. Links 13
6.1. ROD code example 13
6.2. VM Download implementing ROD 13
1 Terms
Virtual machine – A simulated machine running inside a process on a computer.
CLI – Common Language Infrastructure. An ECMA Standard(ECMA 335) that defines a byte code format and machine state for a virtual machine. The terms CLI and ECMA-335 are used inter-changeably in this document.
CIL – Common Intermediate Language. The verifiable bytecode format used by the CLI.
Garbage Collection – Periodically tracking all unused memory objects and destroying them.
Specialization – Modifying the VM so that it suits better a specific type of computer application, such as a specialized VM for computer games.
JIT – The process of transforming bytecode into machine code at runtime(Just In Time), which allows for more selective compilation by being able to target the features of the specific CPU being used to run the virtual machine, and this in turn can produce more efficient code.
Virtual function – A virtual function is overridable between an ancestor class and its descendants. This means that descendant classes can successively provide new implementations for a function with the same signature. A caller of the function calls the function by its index in a vtable, which is an array of function pointers for each class type. Thus the same code can handle two objects with a common ancestor and the call to any virtual function will call two different function implementations by understanding the common definition.
Inline virtual function – Virtual functions are typically never inlined. An innovation we want to make is to inform the JIT ahead of time about what functions that can be inlined. This is done by informing the JIT of the exact type of some variable or field. The object instances are supplied and the JIT computes the rest. The JIT can use the type information of a class(that it is informed about) to inline the virtual function. There are certain types of DSP applications that would experience huge speed ups because of this innovation.
2 Introduction
Improving the speed at which programs execute is necessary to get the most out of existing computers. Newer computer models are using parallel execution to try and boost speed, but since most software is programed as a linear instruction sequence with many statements dependent on previous statements it is difficult to optimize random bits of code for use with parallel processors. This document deals with a specification that defines how to increase by a wide magin the execution speed of certain kinds of linear instruction sequences.
2.1 Existing optimization solutions and problems
In linear instruction sequences, many methods of optimization exist to make programs run faster, such as inlining function calls, register optimization and unrolling loops. These techniques reduce things like unnecessary memory access, and unnecessary branching.
But inlining is non existent when dealing with virtual functions because exactly which implementation of a virtual function that is used at runtime is only known at runtime by looking on the objects vtable(virtual function list) at a specific index and calling whichever function pointer it finds there. Many types of software applications, such as DSP, Text Processing and graphics programs, are more easily programmed with objects and virtual functions. Using objects and virtual functions might be simpler to program than other techniques such as hand coded assembly, but it comes with a performance hit caused by many function calls and extra memory access.
2.2 Optimizing OOP(Object Oriented Programming)
OOP is a widely used programming technology. For existing static compilers and also JIT compilers the concept of object optimization does not really exist, in theory and in practice. But very often, objects are linked into a tree of objects and then a root function on the tree of objects is called which then traverses the entire tree making some data calculation, such as computing a DSP algorithm or rendering a 3D object as a picture. As the code executes it finds sometimes millions or even billions of virtual function calls that it must blindly call and use up scarce CPU resources doing so. But to date in 2009 there has not been a way to generally program an application like normal with OOP, and then somehow get the objects and virtual functions to become optimized at runtime. Objects and virtual functions cannot be optimized at compile-time because compilers do not know exactly which type of object they might be dealing with.
There are some existing tools that will generate machine code based on various object trees such as modular DSP synthesizers which can generate a massive linear function sequence which looks from the outside like thousands of function calls but can effectively become one function. The problem is that these solutions require specialized programming knowledge and are generally for very specific purposes. For our purposes here, these kinds of automatic code generators are not very sophisticated and we need something mush easier to program, and that can be implemented by modifying existing JIT compilers such as .NET, Java and Mono.
3 Runtime Object Devirtualization
Despite the existing performance problems with trees and structures of objects that use virtual functions, a technique, known as Runtime Object Devirtualization or ROD(or just plain devirtualization or devirt) is specified here which shows how existing programming languages and compilers for JIT virtual machines can be modified and improved by adding one keyword(devirt) that will enable the virtual machine to effectively optimize out and inline large chunks of redundant code at runtime, even though the programmer realizes the code as a network or tree of interlinked objects and virtual functions. It is also important to note that code which can be devirtualized in a devirtualizing virtual machine, can also run as is in a standard virtual machine or for some code such as C++ even be built directly to machine code and will produce the same effect, just that devirtualized code will run much faster. The programmer writes code using objects and virtual functions as per usual.
The magic of devirtualization is actually just a side effect of the JIT knowing about allocated objects during runtime code generation. The JIT is modified so that it can track not only the pseudo stack state of types, but also have some exact object references while doing so. In other words, the JIT makes decisions based not only on types as specified by the programmer, but also based on the specific type of the object allocated at runtime. Any JIT expert will immediately see the huge performance increase of this JIT modification for many kinds of common code problems.
3.1 The “devirt” keyword
Programmers who know C++ and Pascal know keywords like virtual and static. The new keyword proposed here, namely the devirt keyword, is used in a very similar way to existing “modifier” keywords. The devirt keyword can be applied to member functions, global variables and fields. A field is a variable on an object, a global variable is one usable from any function, and a member function is a function on an object.
3.2 Applying devirt to functions
The use of devirt begins with applying it to a function, much like you would specify the virtual keyword on a function in a language like Pascal or C++. A devirt function is seen to the outside world as having the same characteristics as a virtual function, and is called by caller functions in the same way that the caller function would call any normal virtual function. But the function is implemented as a “bridge function” or “stub function” or as Mono programmers will know it a “Trampoline”. We will call it the bridge function. That is, when the calling function calls a devirt function control is transferred to the bridge function, and the bridge function is what handles the magic of devirtualization.
When the bridge function is called, it is passed a specific object as the this parameter, which is the current object instance. The bridge function examines the objects dvtable(list of devirtualized functions) . If the specific devirt function being called has not yet been compiled or must be rebuilt then the bridge function will call other internal functions to perform the devirtualization and only then transfers control to the actual devirtualized function.
Typically the devirt modifier is only applied to root processing functions that in turn call many other virtual functions, and it is those functions that become inlined and optimized as much as possible.
Note that a new devirtualized function is built for each separate and distinct object instance that is passed as the this parameter to a devirt function. Thus devirt functions tend to apply to outer objects that contain lots of other objects and virtual functions, such as a process function in a VST plugin.
3.3 Applying devirt to fields and variables
Programmers will use devirtualization on object trees and networks of objects that do not change often, or sometimes just on one object type with many virtual functions such as a source code parser. Programmers will not use devirtualization on objects that change often, and this is because of how devirtualization works behind the scenes regarding fields and global variables marked “devirtualizable” with the devirt keyword.
A field or global variable marked devirt has the following properties at runtime.
When the field value is read from memory, it is like normal code. Nothing is different.
When the field value is written to in memory, the previous value is removed from the global “devirtualizable object table” or DOT. The DOT is a binary searchable list of all objects that have been inlined and devirtualized. All devirt functions that had previously inlined and devirtualized the old object value of the field or global variable being written to are made redundant so that next time they are called they will be re-devirtualized with the new value being written to memory. The new value is then written to memory. It is not added into the DOT until it is actually used in some devirt function.
If you tried to use devirtualization with an object in a field that was written to every millisecond or some other short period of time, the code would constantly need to binary search the DOT in order to find invalidate all functions using the specific object. This would lead to a big performance hit. Thus the rules from above about using DOT with structures that do not change very often in time is unavoidable in the current design, and rightly so because every new assignment will cause a recompilation of the actual devirt function which takes time as well.
It is the combination of devirt functions, and fields and globals marked devirt that makes it all possible to do so easily from the programmers point of view. By following a few simple rules, lots of easily written OOP virtual code can be optimized to execute a great deal faster than normal code as generated by static compilers such as C++ or Pascal.
3.4 The Devirtualization Object Table( DOT )
The DOT is a binary ordered list of objects that have been used inside devirt functions in the whole of the current program or domain in a JIT. Each item in the DOT stores another unordered list of all the devirt functions where it was used. Thus when a devirt field is modified it can quickly be found in the DOT, and then all functions where it was used can be invalidated. An invalidated devirt function will be re-devirtualized the next time it is called.
3.5 Usage of devirt keyword
.This picture shows the devirt keyword applied to a function called ProcessChain. The keyword is telling the JIT that the function is a devirt function, which means it looks like a virtual to the caller, but the implementation is a stub that calls the actual devirtualized function and also rebuilds the devirtualized function on demand.
Picture 1

Click here to download an example .cpp file showing simple usage of the devirt concept...
Picture2 shows the devirt keyword applied to some fields. When the JIT is devirtualizing a function, it can then further optimize the function when it encounters the fields marked with the devirt keyword. For example, if it is sure which object reference it will be using in an inlined root level function, it can merely refer directly to memory offsets in the devirt fields instead of dereferencing them from the this object instance. In addition it can inline functions from the devirt keywords.
Picture 2

As you can see from the above code snippets, programming devirtualization is remarkably simple and is accomplished with just the one keyword “devirt”. In spite of this simplicity there are a few things that programmers must know and watch out for in order to use devirtualization properly.
3.6 Potential problems when using the devirt keyword
Some programmers might try to use the devirt keyword all the time as often as possible, and this can actually lead to performance speed problems and synchronization errors, because each time the devirtualization must be recomputed, a large number of instructions must be executed in order to properly inline and flatten the object chain or tree. Assigning a new value to a field that is marked devirt and which is used in a devirtualized function, will cause the function to be rebuilt. The devirtualized function must always work with the current pointers because it is taking runtime allocated objects and putting direct references and pointers inside the 386 code.
If the value of a field marked with the devirt keyword is modified indirectly, ie by a pointer to the field address, then the devirtualized function will not know about it and so execute blindly even though the object referred to in the field might no longer exist. Fields marked devirt must be assigned to directly so that the JIT compiler can properly generate code to modify the DOT(Devirtualization Object Table) and all the functions that are linked to the previous value of the field. If this is not done properly it can lead to access violations and unpredictable errors.
Misunderstanding the devirt keyword. The devirt keyword does not magically devirtualize and inline any virtual function, but instead is to be used properly with computationally expensive chains of virtual functions and objects.
The devirt keyword causes the function to be built once for each different object instance passed as the this parameter. The built function is stored per object. Applying devirt to some function that will be called with many different objects will cause possible memory shortage and also a lot of time spent building the code. Generally speaking you apply devirt to an outer level functions which contains other virtual function calls and devirtualizable objects. An example of such an outer level function might be the entry call into a parser object that can parse C, C++ and Pascal. When parsing a C++ file it can optimize the entry function by inlining lots of virtual function calls and object just for the C++ object, and then make a different devirtualized function for say the Pascal parser.
4 How it works
ROD is made possible by JIT compilers, which examine byte code at runtime and then transform the byte code into machine code.
4.1 The build and rebuild
By using the assignment rules and the DOT table, and applying these rules to fields marked with devirt keyword, the JIT compiler can build code that knows when to rebuild the devirtualized function.
4.2 The “this” or “Self” parameter
When a virtual function is called it is passed an object instance(called this) as the first parameter. The passed object is the current context for the virtual function that executes. When the function must be rebuilt, normally because one of the devirt field objects it uses has changed, then the code will pass the current “this” parameter to the JIT and tell it to rebuild the function with devirtualization. Because the JIT knows the “actual” this object that will be used by the specific function call, it can track all references to the object and the type of the object. So when the bytecode specifies to do a virtual function call on the passed object, it can then inline the virtual function call instead, and it is this inlining that causes the removal of lots of redundant instructions. In .NET and Mono an instruction like LDARG.0 applies. The JIT will see this instruction and then check if it knows the current value that will be used in addition to the type of the argument. If it knows the value, then it can inline virtual calls, and it can also put direct memory references into the output machine code. For example, a reference to an integer or float value inside the object can be referenced by its allocated address at runtime, which can save some memory access instructions.
4.3 Expanding the devirtualization
As the JIT is building the function it might come across instructions that load fields and global variables from allocated objects that it knows about. If the fields that it finds in objects that are already allocated(for example the this parameter), are marked devirt, then it can load, track and devirtualize those objects as well. This is recursive up until there are no more devirt fields or objects to be found or until the inlining has gone to deep. Devirtualization functions are special and the JIT allows the inlining to go very deep by default.
Note that when a function is to big to be inlined but it is virtual and would otherwise be inlinable in a devirtualization context, then the function call can be like calling an instance method of an object, which is faster than calling it like a virtual function.
4.4 Managing the rebuilt devirtualized code memory and threads.
The ROD technique is typically used in a multithreaded environment where you have one thread that does the DSP processing and other threads that control the GUI and so on. As the devirt fields in the object tree or chain get modified and the devirt functions get rebuilt, memory must be allocated for new code. The simplest way is to just delete the entire function each time and the reallocate and replace the code pointer in the devirt function stub or bridge. The problem with this is that it would not work well if two threads were to try and call the devirt function. A solution around that would be to make two different copies of the devirt function for each thread or in other words use thread variables. Finally you could just put a critical section around the entire call and then any thread could safely call it.
4.5 A common devirt coding error and how to catch it
The major error that can be easily made with ROD is by assigning a new value to a devirt field from within a devirtualized function that is running and that uses the old value of the devirt field. This error can be easily caught because when the new value is assigned it will cause the JIT or VM to examine the DOT and see if the function is currently running(or trace the stack). Thus if the VM finds that the devirt field is being assigned to from within a devirt function that devirtualized the old value of the field, it can throw an exception and that can only be corrected by properly programming the code to be devirtualized.
5 Scenarios
Some topics dealing with various scenarios.
5.1 Typical JIT compilation breakdown
The JIT enters the function to be devirtualized.
The caller tells the JIT that the this parameter is available as a current live object.
Many references to the this parameter in the JIT can then be optimized.
If the this parameter is used to call a virtual function then the function call can be inlined if appropriate or it can be called as an instance method if to big, or the virtual function can be devirtualized itself because the JIT can just pass along the appropriate current value for the this parameter.
If an object is loaded in the JIT from a devirt field in the live “this” parameter object then it to can be tracked. Virtual calls on the loaded object can then be inlined as well. When inlining function calls like this the devirtualization can proceed into the inlined methods as well as the methods that the inlined methods call.
When the initial devirtualization loop has run its course, there will be a linear list of instructions containing instructions from the devirt function, as well as many instructions at all the points where virtual calls were inlined. This list of instructions can then be put through the standard machine code generator to produce the actual machine code and perform all the register optimization. In other words the register optimization and machine code generation is not something that the devirtualization mechanism would deal with, devirtualization is about inlining function calls and putting direct memory references into the code.
The built machine code function can then be called from the stub or bridge function that is attached to the vtable slot for the relevant devirt function. The code itself will go into the dvtable(Devirtualization vtable) on the object, which is a per-object list of functions that have been devirtualized. For every object used in the devirtualization of a devirt function, the built code will also be attached to that objects place in the global DOT(Devirtualization Object Table) In other words when an object is used by the devirtualization mechanism, it is added to a slot in the global DOT. The slot in the DOT also holds a list of all built devirtualized functions where the object was used. Thus other code that might modify devirt field values can invalidate devirtualized functions via the DOT, and invalidated devirt functions can be rebuilt by the stub on the next call to the devirt function.
Once the code is all built and in place everything works like normal, outside functions call into the devirtualized function by calling the stub on the vtable for the devirt function and it simply transfers control to the devirtualized code, after first checking if the code has been invalidated, in which case a rebuild occurs first.
5.2 Example usage in a source code compiler
A multi-language compiler can be built by having a base class for all the common functions and then derived classes to implement specific language functionality, such as a derived class for the C++ language and another derived class for the Pascal programming language. Once you tell the compiler the list of tokens it must process, it is then a case of calling the entry function on the base class of the compiler. Something like calling a function called “CompileTokens” for example. If the CompileTokens() method is marked as a devirt function, then the JIT can do the following.
Devirtualize the CompileTokens() function using the ROD technique.
Any virtual calls using the actual object instance of the compiler object in the function CompileTokens() can be inlined or transformed into devirtualized instance calls.
Any references to fields of the compiler object in question, such as “PascalCompiler” or “CPPCompiler” can be inlined as direct memory references in the machine code.
When a virtual call is made, like in this case to a method called “ParseStatement” or “ParseExpression”, then that method can be inlined as already mentioned. If the current this parameter is a hypothetical CPPCompiler object, the the call to ParseStatement will inline the virtual ParseStatement function from the CPPCompiler class, and obviously inline the ParseStatement function from the PascalCompiler class if the current this parameter was a PascalCompiler. By inlining and devirtualizing a method such as ParseStatement, or ParseExpression, those methods will also inline other methods. So the many thousands or even millions of virtual calls made when compiling a source code file can be reduced by making large parts of the code big flat linear chunks of code that executes forward and has all the call redundancy removed.
The list of tokens as input to the compiler might be a linked list stored as a field in the base class of the different compiler objects. Because the address is known when devirtualizing the many references to the current token in the compiler can be embedded as a direct pointer in the machine code.
This is a description of a hypothetical Object Oriented Programming situation. You have a base class and derived classes and lots of virtual calls in between, but when running any given instance it cannot change to a different class, and so devirtualizing is the optimal solution that speeds up the code but requires no special coding from the programmer to bring about the speed changes. The optimization is handled by the JIT behind the scenes under the direction of a few well placed devirt keywords in the source code.
A problem to consider here is that if you create many objects and then want to call the devirtualized function, that function will need to be rebuilt every time. Devirtualizing can produce some large chunks of code which must be register optimized. A programming technique in this case is to make sure that the devirt marked fields of the root object containing the devirt function are not going to change between compilations. The example of the linked list of tokens being directly inlinable might not be suitable if planning on calling the hypothetical CPPCompiler object repeatedly with many different lists of tokens, for examle when compiling a C++ program with 100 source code files. A rebuild each time might be unacceptable. An alternative would be to use a fixed list of objects that can contain a different array of tokens each time, and virtual calls into the fixed list such as AddItem() and GetToken() could still be inlined and references to the fixed list could be put directly into the machine code.
5.3 Speed up factor estimates
The current estimate for the speed up factor is variable, depending on the contents of function calls in a devirtualization context.
If all the virtual functions inside the devirt function are very big and have lots of jumps and non-devirtualizable memory references, then there might be almost no speed up at all.
If the virtual functions are quite small and can be inlined properly, and there are many of the virtual function calls, such as you might find in DSP, Audio, Word Processing and Data Processing applications, then the speedup can be huge, and by huge it is estimated that some devirtualized code might run at more than double the speed of regular code. When you properly fathom a devirtualized function that is the result of processing thousands of small, easily inlinable virtual functions, the speedup result might be more than double or even triple the speed of regular function calls. For example, the Macaw music program at www.code.google.com/p/macaw uses a Plugin Builder to make modular music instruments and effects. This kind of synth can be implemented as huge in-memory network of objects linked via inputs and outputs. By making the entry point to the processing function a devirt function, millions and even billions of functions per second can be removed. When you think of how much stack frame and unnecessary parameter handling code is removed, you can clearly expect a huge speedup, not to mention to possibility of using limited concurrency in the form of various SIMD type instructions at certain points. Not to mention the fact that so many inlined functions results in a much better register optimization scenario.
6 Links
Some links for more information.
6.1 ROD code example
An example of some C++ source code as it would be if ROD was implemented for a C++ compiler is available at code.google.com/p/juke in the downloads section. Look for the ROD_xxx.zip file in the downloads page which will always contain the latest version of this specification.
6.2 VM Download implementing ROD
ROD and the devirt keyword was initially specified by Robin Holmes on New Years Eve/Day in 2009/2010. There is no download for a working implementation yet. The virtual machine it is initially designed for is Juke, but Juke is still in very early days of construction and does not even generate 386 code yet so it will be some time before the ROD implementation is built into JUKE and Juke is working. Check back at the Juke homepage at code.google.com/p/juke in late 2010 and there might be a downloadable and working copy of the machine code then.