code4k: Benchmarking C#/.Net Direct3D 11 APIs vs native C++

Update your bookmarks! This blog is now hosted on http://xoofx.com/blog

Tuesday, March 15, 2011

Benchmarking C#/.Net Direct3D 11 APIs vs native C++

[Update 2012/05/15: Note that the original code was fine tuned to a particular config and may not give you the same results. I have rewritten this sample to give more accurate and predictible results. Comparison with XNA was also not fair and inaccurate, you should have something like x4 slower instead of x9. SharpDX latest version 2.1.0 is also x1.35 slower than C++ now. An update of this article will follow on new sharpdx.org website]
[Update 2014/06/17: Remove XNA comparison, as it is not fair and relevant]

If you are working with a managed language like C# and you are concerned by performance, you probably know that, even if the Microsoft JIT CLR is quite efficient, It has a significant cost over a pure C++ implementation. If you don't know much about this cost, you have probably heard about a mean cost for managed languages around 15-20%. If you are really concern by this, and depending on the cases, you know that the reality of a calculation-intensive managed application is more often around x2 or even x3 slower than its C++ counterpart. In this post, I'm going to present a micro-benchmark that measure the cost of calling a native Direct3D 11 API from a C# application, using various API, ranging from SharpDX, SlimDX, WindowsCodecPack.

Why this benchmark is important? Well, if you intend like me to build some serious 3D games with a C# managed API (don't troll me on this! ;) ), you need to know exactly what is the cost of calling intensively a native Direct3D API (mainly, the cost of the interop between a managed language and a native API) from a managed language. If your game is GPU bounded, you are unlikely to see any differences here. But if you want to apply lots of effects, with various models, particles, materials, playing with several rendering targets and a heavy deferred rendering technique, you are likely to perform lots of draw calls to the Direct3D API. For a AAA game, those calls could be as high as 3000-7000 draw submissions in instancing scenarios (look at latest great DICE publications in "DirectX 11 Rendering in Battlefield 3" from Johan Andersson). If you are running at 60fps (or lower 30fps), you just have 17ms (or 34ms) per frame to perform your whole rendering. In this short time range, drawing calls can take a significant amount of time, and this is a main reason why multi-threading batching command were introduced in DirectX11. We won't use such a technique here, as we want to evaluate raw calls.

As you are going to see, results are pretty interesting for someone that is concerned by performance and writing C# games (or even efficient tools for a 3D Middleware)

The Managed (C#) to Native (C++) interop cost

When a managed application needs to call a native API, it needs to:

Marshal method/function arguments from the managed world to the unmanaged world
The CLR has to switch from a managed execution to an unmanaged environment (change exception handling, stacktrace state...etc.)
The native methods is effectively called
Than you have to marshal output arguments and results from unmanaged world to managed one.

To perform a native call from a managed language, there is currently 3 solutions:

Using the default interop mechanism provided under C# is P/Invoke, which is in charge of performing all the previous steps. But P/Invoke comes at a huge cost when you have to pass some structures, arrays by values, strings...etc.
Using a C++/CLI assembly that will perform a marshaling written by hand to the native C++ methods. This is used by SlimDX, WindowsCodePack and XNA.
Using SharpDX technique that is generating all the marshaling and interop at compile time, in a structured and consistent way, using some missing CLR bytecode inside C# that is usually only available in C++/CLI

The marshal cost is in fact the most expensive one. Usually, calling directly a native function without performing any marshaling has a cost of 10% which is fine. But if you take for example a slightly more complex functions, like ID3D11DeviceContext::SetRenderTargets, you can see that marshaling takes a significant amount of code:

/// <unmanaged>void ID3D11DeviceContext::OMSetRenderTargets([In] int NumViews,[In, Buffer, Optional] const ID3D11RenderTargetView** ppRenderTargetViews,[In, Optional] ID3D11DepthStencilView* pDepthStencilView)</unmanaged>
public void SetRenderTargets(int numViews, SharpDX.Direct3D11.RenderTargetView[] renderTargetViewsRef, SharpDX.Direct3D11.DepthStencilView depthStencilViewRef) {
    unsafe {
        IntPtr* renderTargetViewsRef_ = (IntPtr*)0;
        if ( renderTargetViewsRef != null ) {
            IntPtr* renderTargetViewsRef__ = stackalloc IntPtr[renderTargetViewsRef.Length];
            renderTargetViewsRef_ = renderTargetViewsRef__;
            for (int i = 0; i < renderTargetViewsRef.Length; i++)                        
                renderTargetViewsRef_[i] =  (renderTargetViewsRef[i] == null)? IntPtr.Zero : renderTargetViewsRef[i].NativePointer;
        }
        SharpDX.Direct3D11.LocalInterop.Callivoid(_nativePointer, numViews, renderTargetViewsRef_, (void*)((depthStencilViewRef == null)?IntPtr.Zero:depthStencilViewRef.NativePointer),((void**)(*(void**)_nativePointer))[33]);
    }
}

In the previous sample, there is no structure marshaling involved (that are even more costly than pure method arguments marshaling), and as you can see, the marshaling code is pretty heavy: It has to handles null parameters, transform an array of managed DirectX interfaces to a respective array of native COM pointer...etc.

Hopefully, in SharpDX unlike any other DirectX .NET APIs, this code has been written to be consistent over the whole generated code, and was carefully designed to be quite efficient... but still, It has obviously a cost, and we need to know it!

Protocol used for this micro-benchmark

Writing a benchmark is error prone, often subject to caution and relatively "narrow minded". Of course, this benchmark is not perfect, I just hope that It doesn't contain any mistake that would give false results trend!

In order for this test to be closer to a real 3D application usage, I made the choice to perform a very basic test on a sequence of draw calls that are usually involved in common drawing calls scenarios. This test consist of drawing triangles using 10 successive effects (Vertex Shaders/Pixel Shaders), with their own vertex buffers, setting the viewport and render target to the backbuffer. This loop is then ran thousand of times in order to get a correct average.

The SharpDX main loop is coded like this:

var clock = new Stopwatch();
clock.Start();
for (int j = 0; j < (CommonBench.NbTests + 1); j++)
{
    for (int i = 0; i < CommonBench.NbEffects; i++)
    {
        context.InputAssembler.SetInputLayout(layout);
        context.InputAssembler.SetPrimitiveTopology(PrimitiveTopology.TriangleList);
        context.InputAssembler.SetVertexBuffers(0, vertexBufferBindings[i]);
        context.VertexShader.Set(vertexShaders[i]);
        context.Rasterizer.SetViewports(viewPort);
        context.PixelShader.Set(pixelShaders[i]);
        context.OutputMerger.SetTargets(renderView);
        context.ClearRenderTargetView(renderView, blackColor);
        context.Draw(3, 0);
    }
    if (j > 0 && (j % CommonBench.FlushLimit) == 0)
    {
        clock.Stop();
        Console.Write("{0} ({3}) - Time per pass {1:0.000000}ms - {2:000}%\r", programName, (double)clock.ElapsedMilliseconds / (j * CommonBench.NbEffects), j * 100 / (CommonBench.NbTests), arch);
        context.Flush();
        clock.Start();
    }
}

The VertexShader/PixelShaders involved is basic (just color passing between VS and PS, no WorldProjectionTransform applied), the context.Flush is used to avoid measuring flush of commands to the GPU. The CommonBench.FlushLimit value was selected to avoid any stalls from the GPU.

I have ported this benchmark under:

C++, using raw native calls and Direct3D11 API
SharpDX, using Direct3D11 running under Microsoft .NET CLR 4.0 and with Mono 2.10 (both trying llvm on/off). SharpDX is the only managed API to be able to run under Mono.
SlimDX using Direct3D11 running under Microsoft .NET CLR 4.0. SlimDX is "NGENed" meaning that it is compiled to native code when you install it.
WindowsCodePack 1.1 using Direct3D11 running under Microsoft .NET CLR 4.0

It has been tested on a Win7-64bit, i5-750 2.6Ghz, Gfx AMD HD6950. All tests were done both in x86 and x64 mode, in order to measure the platform impact of the calling conventions. Tests were ran 4 times for each API, taking the average of the 3 lowest one.

Results

You can see the raw results in the following table. Time is measured for the simple drawing sequence (inside the loop for(i) nbEffects). Lower is better. The ratio on the right indicates how much is slower the tested API compare to the C++ one. For example, SharpDX in x86 mode is running 1,52 slower than its pure C++ counterpart.

Direct3D11 Simple Bench	x86 (ms)	x64 (ms)	x86-ratio	x64-ratio
Native C++ (MSVC VS2010)	0.000386	0.000262	x1.00	x1.00
Managed SharpDX (1.3 MS .Net CLR)	0.000585	0.000607	x1.52	x2.32
Managed SlimDX (June 2010 - Ngen)	0.000945	0.000886	x2.45	x3.38
Managed SharpDX (1.3 Mono-2.10)	0.002404	0.001872	x6.23	x7.15
Managed Windows API CodePack 1.1	0.002551	0.003219	x6.61	x12.29

And the associated graphs comparison both for x86 and x64 platforms:

Results are pretty self explanatory. Although we can highlight some interesting facts:

Managed Direct3D API calls are much slower than native API calls, ranging from x1.52 to x10 depending on the API you are using.
SharpDX is providing the fastest Direct3D managed API, which is ranging only from x1.52 to x2.32 slower than C++, at least 50% faster than any other managed APIs.
All other Direct3D managed API are significantly slower, ranging from x2.45 to x12.29
Running this benchmark with SharpDX and Mono 2.10 is x6 to x7 times slower than SharpDX with Microsoft JIT (!)

Ok, so if you are a .NET programmer and are not aware about performance penalty using a managed language, you are probably surprised by these results that could be... scary! Although, we can balance things here, as your 3D engine is unlikely to be CPU bounded on drawing calls, but 3000-7000 calls could lead to a 4ms impact in the better case, which is something we need to know when we design a game.

This test could be also extrapolated to other parts of a 3D engine, as It will probably slower by a factor of x2 compare to a brute force C++ engine. For AAA game, this would be of course an unacceptable performance penalty, but If you are a small/independent studio, this cost is relatively low compare to the cost of efficiently developing a game in C#, and in the end, that's a trade-off.

In case you are using SharpDX API, you can still run at a reasonable performance. And if you really want to circumvent this interop cost for chatty API scenarios, you can design your engine to call a native function that will batch calls to the Direct3D native API.

You can download this benchmark Sharp3DBench.7z.

47 comments:

UnknownMarch 22, 2011 at 9:44 AM
Hello there,

Very interesting benchmark. I actually felt somewhat surprised to see that managed calls are so cheap ! I expected something around 5x slower than native calls, so 1.52x is, to me, very affordable considering the huge amount of time saved by using C# instead of C++.

I have a question though. Why is x64 code nearly 50% slower than the x86 one ? Could the pointer size alone incur such a slowdown ?

Keep up your great work, SharpDX is awesome !

Heandel
ReplyDelete
Replies
[CC]March 23, 2011 at 2:01 AM
Hi there,

yea I'd also like to know where the main slowdown in the 64bit world is coming from...surely the longer pointers possibly cannot have such a huge impact? Also, seeing as the ratios of the other Frameworks differ from that of sharpdx, do you think the 64bit degradation issue is 'fixable'?
ReplyDelete
Replies
xoofx - Alexandre MUTELMarch 25, 2011 at 5:30 AM
Thanks! I'm out for 10 days so I will check 64bit difference as soon as I'm back.

I don't think that there is a solution to improve this in x64 mode, as the code is pretty straightforward. Although I have still an improvement that will be available in the v2 of SharpDX.

Transform from CLRCall calling convention to stdcall 64bit + difference in the 64bit jit code generation are more likely to make the difference here... but that could also come from tail call optim in 64bit that is probably not efficiently used in SharpDX (I will have to check this, as CLR4.0 changed this)
ReplyDelete
Replies
UnknownMarch 28, 2011 at 12:37 PM
Hey Alex,

First off.. I love SharpDX :)

But... one thing I noticed is that it may be possible to further reduce the number of high frequency allocations and use of fixed.

For instance, I see a pattern in your API mappings where you will do a GC alloc followed by a fixed{} on a value type.

Have you checked what the IL looks like compared to a stackalloc with no fixed{} ?

Also, maybe I'm missing something, but I couldn't figure a convenient way to update resource buffers (DX11) without suffering a penalty of a couple of reference type allocations per update. The same with D2D triangle sink, where you allocate managed arrays on every call.

Maybe to avoid this, you could provide a templatized wrapper around a block of memory allocated with Marshal.AllocHGlobal(). Perhaps one for arrays and another for a singular struct (Perfect for constant buffers).

I'm pretty sure you'd get a decent perf win from removing the GC pressure, and passing void*'s as much as possible would mitigate marshaling costs in the managed/native proxies.

Yeah, I realize these kinds of optimizations may not be to everyone's taste (since it places a larger burden on the application code).

cheers,
-chris
ReplyDelete
Replies
UnknownMarch 28, 2011 at 1:14 PM
Oh ignore the bit about void*'s I forgot you already take care of that.
ReplyDelete
Replies
xoofx - Alexandre MUTELMarch 28, 2011 at 7:34 PM
Hi Christian,
Thanks for your comment!
"fixed" is only used for arrays and value type passed by pointer. Fixed on value type that were allocated on the stack doesn't have a significant impact.
I agree that current design can generate some unwanted allocation that would be difficult to avoid (though I'm going to review them). Some of them (for example in UpdateSubResource or Map) are allocating on the stack, so It shouldn't be a huge issue... but some of them, like arrays are allocated on the heap, even if sometimes they are transient object...

I'm highly concerned by this as well so what I'm going to provide for v2 is an access to a very low level API version on the side of the current API. This low level API won't perform any marshaling both for parameter and return values and will be considered as RAW calls, almost as fast as their native counterpart. These kind of methods could be used in very specific scenarios, where for a small part of an application, we need to process as fast as possible.

I will also provide for the current API some of the hidden methods that could avoid the allocation of transient objects for which the client will be responsible to allocate them (for example, outside of an intensive loop)

If hope to release v2 around the end of april 2011.
ReplyDelete
Replies
UnknownMarch 29, 2011 at 7:21 AM
Excellent, the roadmap sounds great, can't wait to get a look at V2 nearer the time.

I have to admit to being out of touch with the latest code-gen from the compiler/jitter. Last time I paid attention was a few years back, and mostly for compact framework issues. On that, I pretty much expected the worst at all times!

Just an FYI... The function I saw creating the most garbage for me was MapSubResource() in the DeviceContext.cs. It allocs two reference types... a DataBox and a DataStream.

Unfortunately, you need that DataBox for the UpdateSubresource() later so I couldn't see a way around it without modding the code.

thanks!
ReplyDelete
Replies
xoofx - Alexandre MUTELMarch 29, 2011 at 7:51 AM
Agree about the two allocations, I could provide other way to handle this... but It would be easier to follow this if you could log an issue about this and describe a little bit more your use case/workflow. thanks!
ReplyDelete
Replies
RaineApril 28, 2011 at 8:36 AM
In all honesty, I'm pretty happy to see that SharpDX is beating SlimDX in terms of perf. Keep up the good job - that whole c++/cli-like IL gen was a smart move.
ReplyDelete
Replies
AnonymousMay 4, 2011 at 2:02 AM
I would like to give you thums up for this project, i really like it! :)
ReplyDelete
Replies
s33m3May 16, 2011 at 10:21 PM
Really nice project !
As soon as you give the possibility to use the Effect framework under DirectX11, I will give a try to SharpDX, and adapt my project to it ! (atm under slimDX) :)
Keep up the good work !
ReplyDelete
Replies
UnknownMay 22, 2011 at 8:34 PM
All Right !

When your SharpDX 2.0 is ready ?
ReplyDelete
Replies
xoofx - Alexandre MUTELMay 23, 2011 at 4:30 AM
@new, v2 is on the way, though I'm currently pretty busy at moving house to Japan with my family.

SharpDX 2.0 is going to have lots of new features and I will release it as soon as it is stable.
ReplyDelete
Replies
krogenJune 9, 2011 at 6:42 AM
I have a question if you think that an text intensive
(DirectWrite/Direct2D) application would benefit of using SharpDX over Windows API CodePack 1.01?

Its a WinForms datagrid control being used in a very busy trading/market data application (WinForms). The performance is really good as it is (much, much better than any of the commericial WinForms GDI/GDI++ based datagrid controls available) and DirectWrite draws beatifully. The grid draws alot of small/different formatted text strings.
The existing soultions is to my understanding not GPU bound, it is actually spending quite a bit of the CPU
especially in the hwndRenderTarget.DrawText(...) and hwndRenderTarget.EndDraw(...)
.
If SharpDX is peforming 4x the existing Win API CodePack I would love to replace it. Of-course the only real answer is to implement something and then measure the difference. But before jumping on this I would like to know your opinon.
ReplyDelete
Replies
xoofx - Alexandre MUTELJune 9, 2011 at 12:58 PM
@krogen, It depends. I suspect that your current application is currently CPU bound because of DirectWrite itself, not CodePack API overhead. So unless you have a very chatty program with the CodePack API, you will probably won't see any benefits while switching from CodePack to SharpDX (cost of a draw is much above cost of the API overhead).

That being said, using SharpDX over CodePack is still worth if you want to have:
- fullsupport for DirectWrite/Direct2D (From what I remember, CodePack doesn't support all callbacks features of these APIs)
- AnyTarget assemblies, running transparently on x86/x64 without any GAC install
ReplyDelete
Replies
chJune 21, 2011 at 1:50 AM
Thank you so much for all the hard work you've put into SharpDX. We can't wait for v2. In fact i your blog everyday, nervously waiting for the next update.
Good luck and success for your move to japan!
ReplyDelete
Replies
xoofx - Alexandre MUTELJune 21, 2011 at 2:04 AM
Thanks ch, still pretty busy with my installation in Tokyo so I didn't have time to push v2 to where It should be. I still hope to release it around July with a brand new web site.
Stay tuned!
ReplyDelete
Replies
HelloweenScotJune 27, 2011 at 9:55 AM
Hi Alexandre,

Thanks for sharing this; it's a great job.

What can you tell about the portability of such wrapping technic, would "calli" work on linux with an equivalent getProcessAddress function ?

Enjoy Japan, lucky you!
ReplyDelete
Replies
xoofx - Alexandre MUTELJune 27, 2011 at 11:18 AM
Hi HelloweenScot,
Calli instructions is working with Mono so it should work on Linux.
Though I have found that calli is not entirely well implemented on Mono (bugs with argument struct and possibly performance issues), something that I will have to check carefully.
ReplyDelete
Replies
GuillaumeJuly 12, 2011 at 4:49 AM
Hi Alex,
thanks for the article. I also hope there's no too much neutron in the Tokyo's water.

One question : how do you generate the c# code based on the c++ api headers ? What is your metaprogrammation method ?
Do you use some public template / macro solution like Text Template Transformation Toolkit (c#) or Boost (c++).

Regards
ReplyDelete
Replies
xoofx - Alexandre MUTELJuly 12, 2011 at 12:26 PM
Hi Guillaume,
About the code generation process, I wrote an article about it http://code4k.blogspot.com/2010/10/managed-netc-direct3d-11-api-generated.html though It is slightly for some parts now.

The code generation process for the next version v2.0 (alpha available from https://github.com/sharpdx/SharpDX ) is:
0) Read mapping.xml configuration mapping rules files from various directories.
1) the code generation is using gccxml to parse the C++.
2) The gccxml output is mapped to an internal C++ model (previously called XIDL).
3) Rules from configuration files are used to generate C# code.
4) The template engine used is T4 engine though It's now using a simplified version of it from Mono.TextTemplating code.

With gccxml, SharpDX code generation tools is able to parse all windows API (or even any third party API) and is well suited at exposing COM interfaces. The code generation tools will probably available as part of a SharpDX tool chain (for example, in order to generate wrapper for custom C++/COM based code)
ReplyDelete
Replies
Mārtiņš MožeikoJuly 19, 2011 at 7:24 AM
Thank you for excellent library.

I wanted to ask - what about X3DAudio library? Will that be supported in SharpDX?
ReplyDelete
Replies
s33m3July 20, 2011 at 6:28 PM
Hello Alexandre,

I have converted my little game project (aka Voxel landscape rendering : http://www.youtube.com/watch?v=9r93LyIJLjY) from slimDX to SharpDX, it's working fine.

While I didn't lose any FPS (didn't grap some neither - but it's because i'm not "Draw call" limited at this moment), the game has win some "smoothness". With slimDX I had from time to time a GAC collection fired (from inside slimdx) that was making my rendering to freeze for 0.01s. No more of tose with SharpDX !

What I like also, is that you are more matching the DirectX fonctions directly. As there are not a lot of tutorial for slimDX/SharpDX, it's nice to retrieve the DirectX functions nearly directly in SharpDX.

I have some questions :
- What are the requirement for my game to run under mono ? (wich .Net framework ? 3.5, 4.0 ?)
- I did try you alpha V2, but it's crashing nearly directly on functions reponsible to write data into memory (Like updatesubresource). Is it normal ? (I know it's still in alpha) !

Good work so far, I'm sitcking with your managed wrapper around DirectX11 !
ReplyDelete
Replies
s33m3July 21, 2011 at 1:34 AM
One more question : How do you get the Shader compilation error (not using effect framework) ?

With slimDX, I could received something like (when the shsder compilation failed) :
Additional information: Terran.hlsl(71,2): error X3000: syntax error: unexpected token 'output'

With sharpDX I only get :
Additional information: Unknown error (HRESULT = 0x80004005)
ReplyDelete
Replies
Mārtiņš MožeikoJuly 21, 2011 at 2:07 AM
Fabian: ShaderBytecode.CompileFromFile has overload that provides out string compilationErrors argument. This will contain exact error message including line/column numbers.
ReplyDelete
Replies
s33m3July 21, 2011 at 2:10 AM
Ho tx you bubu ! :)
ReplyDelete
Replies
xoofx - Alexandre MUTELJuly 22, 2011 at 2:52 AM
Thanks for your report.

@bubu, For X3DAudio, It was planned and I'm working on it. It shouldn't take too long as the API is very small.

@Fabian, great to see that you were able to switch to SharpDX smoothly while getting some performance gain. For your other request, as it could take more line to respond, and to keep a record of it, could you please log question/issues to https://github.com/sharpdx/SharpDX/issues?sort=created&direction=desc&_pjax=true&state=open

Thanks!
ReplyDelete
Replies
xoofx - Alexandre MUTELJuly 22, 2011 at 3:02 AM
@Fabian, concerning the mapping between DirectX types/functions/interfaces and SharpDX types, It will be fully integrate in the upcomming SharpDX documentation system, with the ability to search with an unmanaged name directly.

The documentation will integrate probably a full listing of all the mappings as well.
ReplyDelete
Replies
UltraheadAugust 8, 2011 at 2:26 AM
Hey Alex,

Any progress with v2?
ReplyDelete
Replies
xoofx - Alexandre MUTELAugust 8, 2011 at 1:22 PM
@UltraHead: Still working on it:
- I have recently integrated X3DAudio and XACT3.
- I need to add RawInput and XInput mappings which shouldn't take too long.
- Direct3D9 needs a little more love, at least on API parts that are the most frequently used.
- I would like also to add WIC support for 2.0

After this:
- I have to improve documentation and support better import from msdn.
- Finalize the new website

Then I will be able to release a 2.0 beta.

So basically, It will take several weeks to do this.
ReplyDelete
Replies
UltraheadAugust 13, 2011 at 1:14 AM
Good to know. Thanks ... I´ve subscribed to your tweeter page to follow up progress :).
ReplyDelete
Replies
RosenAugust 22, 2011 at 10:27 PM
Hi,
why do you have in the XNA 4.0 version

device.SetRenderTarget(null);
device.Clear(Color.Black);

inside the loop ? These things should be done per every scene rendering, not per object rendering. And the triangle seems to fill 1/8 ot the screen. Those things may be bottlenecked by the video card's fillrate, and as I see they may interfere with the number of draw calls possible.
I see the same clearing of the screen in the other benchmarks as well. If they are intended, can you explain them please?
Thanks.
ReplyDelete
Replies
xoofx - Alexandre MUTELAugust 22, 2011 at 11:32 PM
@Rosen, indeed, those should have been moved out of the loop (and I don't remember why the SetRenderTarget end up with a null), but the main goal of this benchmark was to bench the cost of the API call. In fact for Direct3D11, I should have setup a deferred context though for XNA there is no equivalent. That's why I tried to tune correctly flush limits on my machine for all api in order to avoid any stall from the GPU, but I don't think It's going to change lots of thing for XNA (though I could have done sonmething wrong).
ReplyDelete
Replies
AnonymousSeptember 24, 2011 at 12:14 PM
Sorry, disregard the first part, realized the source was there, but then I got lost with this multi-id system at blogspot, and pasted an old version of my comment.
ReplyDelete
Replies
AnonymousSeptember 24, 2011 at 12:15 PM
"This test could be also extrapolated to other parts of a 3D engine, as It will probably slower by a factor of x2 compare to a brute force C++ engine."

I seriously doubt that.
Marshalling might put a huge burden on the CPU, but I've done bechmarks on basic operations for both C++ and C#, and they're very close (other than generics, who are considerably slower than templates).
ReplyDelete
Replies
xoofx - Alexandre MUTELSeptember 24, 2011 at 12:38 PM
@pball81, there is in fact no speculation here, as I have been working and investigating JIT code generation for the last years, and implementing a new 3D engine in C# at my work, I can tell you that in lots of cases where you ask your CPU to get the maximum out of it, the JIT is far from being efficient. There are lots of cases where the JIT won't inline your code for example... Or JIT x86 code generation is not able to use SSE/SSE2 instructions to efficiently vectorisze things (only the x64 is using it, because it is forced to) by loosing also all the benefits of having extra SSE/SSE2 registers... and the performance boost here using inline SSE/SSE2 and proper register usage is at least x2 in C++ compare to C#.

In order to achieve the best performance in C#, you really need to profile your code (avoid any kind of allocation, as you would do in C++, but It is more important here with the GC running in the background) + checking the JIT generated code... and sometimes play with some dirty JIT hacks to in order to get the best from it.

Although in some cases, you won't notice a huge difference, the 10-20% overhead of JIT code compare to highly optimized C++ code is a myth, specially when you are entering heavy CPU computing...
ReplyDelete
Replies
UltraheadSeptember 27, 2011 at 1:52 AM
This comment has been removed by the author.
ReplyDelete
Replies
UltraheadSeptember 27, 2011 at 1:53 AM
"the JIT is far from being efficient. There are lots of cases where the JIT won't inline your code for example..."

Not only the JITter isn't efficient in certain situations, there are many Core and System libary operations in .NET/C# with methods also far from being efficient (and for these cases, most of the time you end up writing your own code).
ReplyDelete
Replies
Tape_WormNovember 21, 2011 at 12:31 PM
So, just for kicks and giggles, I decided to integrate SharpDX into my graphics library (which used SlimDX).

It was somewhat painless to do so, had to fix up some naming and re-order some parameters. I was a little thrown off by having to use the SharpDX.Direct3D namespace to use the FeatureLevel enumeration.

Anyway, that's not why I'm posting on this particular blog entry. I did some crude benchmarking by drawing a multisampled scene with a rotating square and simulating motion blur on that square by drawing it 8 times, and a single untransformed textured square. I also enabled the depth buffer. When measuring against the FPS (yeah I know, FPS is not a good metric, bear with me...)

FYI, This is on a Win7 x64 box, i7 2600, Radeon 6870:
For SlimDX -> At the highest, I was getting about 2200 FPS (about 0.45 ms), and at the lowest about 1900 FPS (about 0.52 ms).
For SharpDX -> At the highest, I was getting about 2200 FPS, and at the lowest, about 1700 FPS (about 0.56 ms).

This was consistent between x86 and x64. Since I knew I wasn't using a good metric (FPS) to measure draw time, and I really wasn't taxing the card in any way I decided to draw that same scene, but this time I drew the single square 65536 times and transforming it a little on each iteration.

Here were my results (and I'm saying this right now, I'm pretty impressed with SharpDX):
For SlimDX x86 -> Got a delta of 85ms, ~12 FPS
For SharpDX x86 -> Got a delta of 36ms, ~28 FPS

For SlimDX x64 -> Got a delta of 101ms, ~10 FPS
For SharpDX x64 -> Got a delta of 27ms, ~37 FPS!!

When I did the first test I was a little skeptical about your claims, but after the second... not so much.

I did find it interesting that SharpDX didn't appear to perform as well with so little on the screen (mind you, if we're looking at delta, it's very minor performance difference, approximately 0.04 ms, hardly worth worrying about in my opinion).

You've done an amazing job with this. I may consider switching Gorgon to use this as its API once I get some more testing out of the way.
ReplyDelete
Replies
xoofx - Alexandre MUTELNovember 21, 2011 at 1:33 PM
Thanks TapeWorm for this valuable feedback and glad to see that you have found some interesting results! :)

Since this article was posted, I worked a little bit more on some core methods on Direct3D11 to improve performance, so I expect to check this and benchmark it again in a near future.
There are also a couple of things I'm slowly working on it that are not released yet but that will improve performance as well. I'm confident that SharpDX could go under the x1.5 performance penalty against the C++ version.

About the API difference you had, sorry for that: If you think that some of those changes are not relevant, drop an issue on SharpDX and I will have a look at it. For the FeatureLevel for example, as It is shared effectively between Direct3D10.1 and Direct3D11 in C++, I had to map the correct C++ behavior.

And concerning the small differences (0.04ms) in a simple case, It seems indeed not enough critical to worry about it.
ReplyDelete
Replies
Tape_WormNovember 21, 2011 at 3:15 PM
Well, that's good to hear. I look forward to seeing what you're going to come up with.

I'll pop a suggestion or two into the bug tracker. For the most part it was just casing issues and parameter ordering, stuff that I expect to be different.
ReplyDelete
Replies
HeinrickJanuary 7, 2012 at 1:40 PM
Hi, Could you tell me what your plans are for long term development and support of SharpDX?
ReplyDelete
Replies
xoofx - Alexandre MUTELJanuary 7, 2012 at 2:18 PM
@Heinrick,
Long term development includes:
- Improve documentation
- Full Win8 Metro Style App support
- Performance for interop and Math operations
- Reflect next DirectX API changes

SharpDX is covering almost 100% of DirectX API, so development for coverage will probably only include new changes.

Concerning support, SharpDX has currently a free support (most of issues are resolved within 24h to a week). A Gold support may be introduced this year.
ReplyDelete
Replies
@mzMay 15, 2012 at 7:25 PM
Hello Alexandre!
I wanted to ask if you've already developed 4k demos in C # with SharpDX? Is that possible in size at all? I have been programming in C # and have programmed since the C64 is no longer intro and would like to make fun times, again a project.

greetz

zipmar
ReplyDelete
Replies
xoofx - Alexandre MUTELMay 16, 2012 at 1:17 AM
Hi Marco. I have already investigated small intro with C# and this is not really feasible. The smallest assembly that you can get in .NET is 1536 bytes, but it requires already a bit of PE hacking. Then the major drawback is that the assembly is storing long names to reference other assemblies or any used API (like System.IO.FileStream), so more than the half of the executable would be already occupied by PE headers + metadata.
ReplyDelete
Replies
@mzMay 16, 2012 at 5:17 PM
Hello Alexandre!
I've thought about something. When I made my first experiments with Java and lwjgl, I was not even under 11kb and that with just a cube! My Experiments with C# and GDI+ were at 7kb on a text scroller and a copper bar. Pure Java would be enough even for an old school intro (37kb with images), but one wants more and more.^^ Thank you for your answer! I will now learn c++ and see what comes. Once been in the demoscene they can never get rid of any one ^^. Even after such a long time not.
ReplyDelete
Replies
Ziplock9000May 19, 2015 at 1:00 AM
I'd love to see an update to this in 2015 but this time utilising batching techniques where possible.
ReplyDelete
Replies

Add comment

Comments are disabled

Note: Only a member of this blog may post a comment.