FLEXINE - a flexible engine
Random informal unsorted incomplete notes - last updated october 9, 2001
Flexine is my new engine, based on my former ICE framework. Its main goal is, you guessed it, flexibility. Performance is important as well, but it's definitely a runner-up here. Basically, I want to be able to throw whatever I want to the engine, and let it handle the gory details. I want to be able to test any whacked algorithms in a second. I want to be able to replay levels from various games - e.g. Quake3 and Oni -regardless of their respective native data structures (BSP for Q3, octree for Oni). I want to be able to handle exotic file formats such as ROK files, from the japanese modeler Rokkaku-Daioh. I want to be able to replace a DX7 rendering module with an OpenGL one the day OpenGL 2.0 is out. I want to be able to use DX7, DX8 or DX9 in a transparent way. I want that engine to last. I want it to be future-proof.
What I don't want is to start a new engine from scratch once again, next year or the year after, because this one was unable to evolve and stick to new trends. What I don't want is to ban some specific algorithms just because the architecture can't support them. What I don't want is to be format-dependent. What I don't want is to build one more eye-candy pusher with nothing underneath. What I really don't want is to recode the same things again and again, all over the place.
Various choices have been made in order to reach my goals. Below are some of them, but keep in mind this is only a tiny informal overview.
Main goals in order of importance:
- Ease of use
- Code reuse
- everything is written in C++, and makes good use of the language's features. In 2001, there's no valid reasons to be shy about virtual methods or multiple inheritance. I accepted them once and for all, and don't want to loose my time in endless, useless, pointless holy wars about this. In short, it pays off in the long run and the price to pay is minimal - CPU time wise - as long as you know what you're doing. If you're a tiny bit serious about professional development anyway, you already know - and agree with - that.
- more specifically:
- inline assembly has not been ruled out (you know I'm not afraid of it - worse, I actually like the smell of the metal), but it's only there for very specific tasks (SIMD or pixel shaders instructions come to mind), and only included lately - not blindly.
- the project has been cut into different dedicated modules right from the start. It's always tempting to hardcode some maths in places you need it. But it's bad. Here, everything is included in a dedicated maths DLL. It is painful when the project starts, because you get plenty of DLLs with not much in them, and the whole thing looks overkill. But before you know it you have hundreds of files in the project, and without the initial separation into multiple modules, it becomes an infernal mess. Current modules in Flexine are listed below.
ICE basic blocks:
"Flexine", or the glue using all of the above:
- various formats
Documentation and code-style choices
- I have a very accurate code-style, and I stick to it. Whether it's the best or not is totally irrelevant, and often people don't see this and start endless wars about the best way to write readable code. Useless. I don't claim anything here except coherency. The only rule is to stick to your rules.
- I chose Doxygen as the automatic documentation tool. Some people reported they discarded it because they didn't like the way they had to write their comments. I say this is actually the best thing happening with inflexible documentation tools: you loose the tiny bit of liberty which happens to be a PITA, responsible for many useless holy wars out there. You don't have anymore to ponder contradictory things endlessly. Doxygen wants it that way, you write it that way, period. I have better things to consider.
- Code reuse is vital. That's probably my heaviest burden so far. When I write new code, I simply can't stand not to share the maximum of it, even if the whole project needs rebuilding afterwards. I also have a very personal strategy to design my classes: I always do as if someone else was supposed to include it directly in its own engine. Of course it is not always possible, but theoretically it leads to cute, independent modules. I already released some of them just to check that theory - e.g. my triangle stripper, the consolidation code, or what's now known as Opcode. In short, I think I could extract most ICE / Flexine modules in the same way, and one could still include them and use them in another engine. That's no altruism. That's a way to build neat interfaces.
- Design patterns are vital. This goes along with code reuse - something like ideas-reuse maybe. As a single example, the publish-subscribe design pattern is used extensively not to loop through all your objects in vain.
- Lazy evaluation is vital. As the guys from dPVS/Umbra said, it is the key to great performance. I couldn't agree more, and my primary design strategy is always to think "how can I lazy-evaluate this ?".
- Robustness is vital. Long, long years ago, I was a coder-wannabe on Amstrad CPC. And there was that french magazine (I think it was "AMSTAR") with some coding lessons in that good old Locomotive Basic. And there was that first, simple, beautiful rule: "Un bon programme est implantable." - read: "a good program never crashes". I know it may sound dubious in those days of random BSODs, but it has happened in the past, and even if it was on old machines with limited ram, limited users, limited resources, limited possibilities for the code to crash, it did the job nonetheless. A PC program can crash for millions of new reasons, but I strongly feel we, as developers, are responsible for most of them. So I fight for robustness. When I was at ESME, one of my teachers (Bernard Ourghanlian, technical director at DIGITAL) clearly pointed out the problem: the actual trend is to write "disposable software", later superseded by new versions or more-or-less fixed with patches. We know the reasons for this, and I'm not here to criticize or discuss them. But I definitely try not to play that game, and it shows in my code where I often - if not always - try to make any given function bullet-proof. This has been one of the most required key-features when I started coding ICE, and it has evolved in a pathological crusade until now. In Flexine for example, as long as you play by the rules and let ICE monitor everything, you can safely mess up a lot of things without crashing - ICE's kernel does the cleaning. As a simple example, if you delete an object anywhere ("delete ptr") instead of calling a Relase() method, without checking a possible reference counter, bypassing all safety nets, it usually recovers gracefully anyway. The object gets wiped out. Then, if you stored that pointer in some container (for example in a list of meshes, a list of visible objects, whatever), the container automatically knows one of its elements is now invalid, and removes it from the list - calling you back for notification if needed. Perhaps even sicker is the ability of the underlying kernel to fix now invalid pointers all by itself, even if that pointer lies somewhere as a private member of some class. The pointer becomes null, automagically - of course you shouldn't have savagely deleted the referenced object in the first place, the point is it doesn't crash even if you do stupid things. On top of that, you're usually supposed not to use pointers anymore, but IDs. The ID-to-pointer translation is supposed to be done on-the-fly each time it's needed, and the overhead is virtually free on PC thanks to a carefully crafted implementation.
- Smartness is vital. But not too much of it! Often you build a complex system, and in the end it runs slower than a brute-force, or at least less clever approach. Often you optimize one thing, and another thing starts suffering from the change. It's easy to optimize one particular routine to the max. It's harder when everything gets interconnected. There's a balance to find, and no single good way to the top. This is now known as "smartness wrap-around", following one of my posts on the Algorithms list.
"Everything is precision-limited with computers, even smartness. Too much of it and you wrap around."
"Curiosity killed the cat,
Complexity killed the cache !"
So keep profiling, low profile. It's harder and harder to foresee what will come out best, and one good technic one day becomes bad practice the other day. Learn, adapt, evolve, survive. "I got no idols", Juliana said ! Beware of self-proclaming gurus. Believe no one but yourself. Yes, that's a design choice! It leads to flexibility as you don't put all your bets on a single "best" design or data structure.
- Shipping is vital. Highest priority. Don't trust a guy who's never shipped anything. Don't. Do not.
Data format choices
- Automatic serialization, usually implemented thanks to the help of virtual import / export methods and clever OFFSETOF macros, is cute and handy. Unfortunately it also has two painful drawbacks: it usually produces large files, and it implies a great dependence between the file format and your internal classes. On top of that, versioning is difficult to make automatic.
- In Flexine, I then used the old way. I have my proprietary format, and also support some other classic formats - in sake of, you guessed it, flexibility. They all use the same model:
- X-importer class for a particular X format
- X-factory ISA X-importer
- scene HASA X-factory
Usually those factories only fills creation structures with incoming format-dependent data, then call the standard scene creation methods. On one hand, that way of dealing with data has a major drawback: you need to write and maintain a lot of code. On the other hand, it has several interesting advantages:
- the X-importers can (and should) be written as independant modules, say static libs. Other people can re-use them in their own engines, since only the X-factories depend on my internal classes. In the same way, I can reuse those loaders in other projects.
- If there are some changes in the internal classes (even radical ones), already existing art remains valid. It doesn't need reexporting or anything. The only piece of code which needs updating is the factory, the actual interface between the data and the engine. This is not the case with automatic serialization.
- Some developers build their custom format so that it's optimal for hardware T&L. While I understand the purpose, I don't think it's very future-proof, as the nature of hardware-friendly things tends to radically change over the years. As a single example, there's still no clear answer whether strips are better than lists or not. As a result, in Flexine I chose to build hardware friendly data in runtime, on loading or sometimes on-the-fly when needed - following the lazy-evaluation paradigm. Triangle strips are not stored to disk but built on-the-fly. Consolidation is performed on demand as well. Care must be taken to ensure it doesn't slow down the loading process, but in the end it allows me to keep the art relatively independent of hardware trends. The transform between on-disk and in-engine data can furthermore be driven by a runtime performance analysis, to make sure it really fits the underlying hardware.
- All bets are off when it comes to streaming, nonetheless. Not all data format support streaming, and you really need to design things carefully here. In any case, I wouldn't want to support streaming of automatically serialized classes.....
- The low level rendering API is currently DirectX, for better or for worse. I'm not claiming it's better than OpenGL, let's leave those useless discussions behind. At one point it just seemed to evolve faster.
- This is a problem as far as compatibility is concerned. The right way to solve that problem is probably to build a rendering abstraction layer, upon which you can plug whatever you want. This is not an easy task, and the trick is probably to design relatively high-level interfaces. That way you can implement any exposed function with any low-level API, even if a given function has a native counterpart on one API, and must be recoded with multiple calls on the other. In Flexine there's a first wrapper exposing those high-level interfaces, then a particular implementation of those interfaces for each low-level API (one for DX7, one for DX8, etc).
- This is not a matter of simply wrapping a DX call with a one-liner "high-level" method. It sometimes made me rewrite complete DX functionalities. For example, I have my own DX7 version of what has been later introduced in DX8 under the name "Effects & Techniques". Exposing the E&T interfaces allows me to use the native E&T stuff in DX8, but also to keep the same application code working with, say OpenGL. Wrapping also allows one to minimize redundant render state changes. Here's a snippet of one of my posts on the GD-Algorithms list about this:
"One of the reasons I chose to wrap
everything in the first place was to avoid redundant state changes by caching
all states at the app level - hence avoiding to even call DX SetRenderState
methods when not needed. From what I read on the DXML - and you probably know
those bits better than I do - it *should* be useless because the driver
*should* do it as well. But several reality-provided things have since been
added on top of that cute scenario :
- when I started implementing this on DX7 (or was it DX6?) the driver behaviours were quite random, to say the least. Does that one cache things correctly? Does that other one filter redundant changes? Uh, nobody was able to provide a definitive answer, and I doubt such a thing even existed. Better safe than sorry, wrapping everything was faster than looking for dubious IHV-provided-answers-yeah-of-course-we-do-it-what-do-you-think.
- if I understand correctly (and since I'm still using DX7 I might be wrong), you can't use GetRenderState() methods with pure devices. So the only way to know the current states is to cache them at the app level anyway. Whether you should have to worry about the current states is another question, the point is: if you want to know, you have to code the thing anyway.
- but if you start playing with state blocks, it becomes messy : the state block gets applied and your app-level caches don't know about it. Screwed. Hence you need to wrap state blocks as well as render states.
- now comes the NVIDIA case: state blocks are good and fine, except... doh, on NVIDIA cards. What the hell ? What am I supposed to do ? Duplicate my code everywhere, supporting state blocks or not depending on the card ? Big mess ahead. Once again, it's way cleaner to do your own state blocks once and for all, and use them everywhere. On NVIDIA boards, they end up calling the crude render state methods. On all other boards, they end up using real state blocks. You can even profile both and choose what's best at runtime.
For all those reasons (which basically come down to a single one: peace of mind), biting the bullet is, IMHO, worth it. Extra advantages:
- the delta stuff is actually one C++ line:
StateBlock Delta = SB0 - SB1;
- you can use Effects & Techniques files in your DX7 app
- you don't bother anymore about what's best / fastest: the runtime profiler does that for you for every piece of new hardware you can imagine. I don't want to put my nose in that code again, that's not interesting, that's vain, that's painful.
- the wrapper also checks the caps for you - another painful stuff to do IMHO.
There are two downsides :
- it's a lot of code
- it's admittedly slower than calling the DX methods directly. (and I definitely don't care, to me it's very very worth the price)"
- Batching and dynamic lighting: here's another of my post from the GDA:
> Does anyone know a decent order in
general for renderstate costliness?
> (including turnng lights on and off)
"Lights can be handled by the collision detection module. I just send both meshes + lights to the sweep-and-prune, basically. Since I already send meshes out there to handle all meshes vs meshes collisions, adding some lights to the call is virtually free. (...as long as they are point lights) Then for each mesh I keep N lights (usually N=8 but it depends on your hardware). I don't use a scene graph, I batch & sort everything as you propose (a reasonable way IMHO - and I don't think there's a "best" way). I usually sort by texture first, or by material/shader. Sometimes you want to sort by VB instead, but it depends on your card, geometries, viewpoint, anything. So the best way is to be able to change the sort keys at runtime, and do a little runtime analysis to figure out what's best. Usually textures win, but I admittedly only ever tested this on NVIDIA cards and with limited scenes. Now, if you have a lot of dynamic lights around, it may be better to sort by "mesh" first, activate N lights for the mesh, render it - possibly with multiple texture changes -, desactivate the lights and go on. If you sort by texture first, you decrease the number of textures switches but
increase the number of light switchs. Since lights activation is really fast, it may be a win. But since a given mesh may be lit by N lights, you're actually changing one texture switch for N light switchs, and all in all the winner is unclear... In any case, the sweep-and-prune initial phase is good to determine the mesh/lights interactions, regardless of how you handle them afterwards. The same method is used in Umbra, under the name "Regions Of Influence" (even if I prefer the good old sweep-and-prune label). It works reasonnably well anyway. I must have some test scenes with 128 meshes lit by 128 lights, and determining what is influencing what is virtually free (just a O(n) algorithm on 256 thingies).
Well, in any case I wouldn't use a scene-graph, and I wouldn't bother too much about what's best since it has always been evolving/changing over the years, is currently unclear, and is probably going to change with future versions of DX, future cards, future whatever. So don't make it "best", make it flexible. Meanwhile: batch, sort by texture, you'll be fine - correct me people if I'm wrong, but it seems to work well here.
I also would like to back up Tom's comments about cooking your own state-blocks wrapper and delta-compressing state changes. This is a lot of code indeed, not too exciting and even pretty boring. But that's probably one of the best route I've ever taken as far as rendering is concerned. It just makes life simpler."
- As in numerous engines, there's a distinction between the actual objects to render (meshes, but also various helpers) and their rendering properties (materials, textures, illumination model, in a word: shaders). Basically you can batch by shader to minimize render state changes, batch by object to minimize VB / IB switches, or batch by light to apply each of them in a separate pass - effectively bypassing the usual hardware limit of 8 lights. You're of course supposed to batch by shader or mesh as well within a light batch... It gets worse when visibility enters the pictures, as you also want to render things in a rough front-to-back order to reduce overdraw. And it becomes really messy when all of this relies on traversing a scene-graph. There's no perfect pipeline, it all depends on the situation.
- Actual meshes are made of several submeshes. A submesh is a group of faces sharing the same rendering properties. Each submesh has the usual geometry and topology, both locked in their respective vertex and index buffers. Most of the time a system memory copy of both buffers is kept, and used in various cases such as picking or collision detection. Hardwired topology is best expressed as indexed triangle strips - usually with a single degenerate indexed strip for each submesh -, but both strips and lists are of course supported. The system copy is always a list, strippified on-the-fly when needed. Vertex and index buffers can be either static (and optimized under DX7) or dynamic. Most of the time anyway, they're semi-dynamic: multiple meshes are stored in shared buffers, managed as LRU caches. Meshes are not stored in hardwired buffers at loading time: they're actually sent to the renderer only when they become visible for the first time. All of this is kept under the hood, in the abstract renderer, so that the process is totally invisible from the app, which doesn't want to know about the details. The day vertex buffers become obsolete, the rendering interfaces will hopefully remain valid nonetheless. Now, this management is exposed in a RenderableSurface at Renderer level, but you're not forced to use it if you don't want to. You still can access the simpler wrappers and build your own system.
- The classic MAX model has been followed, where each face references a material, and each material references one or more textures. Easy, handy, works well. But it's not flexible enough. The new model supports E&T-like shaders. They can be compiled from text files or directly from memory, and they produce byte-code further captured in state blocks. A fast state block emulation path is provided - which happens to be faster on NVIDIA cards, in accordance to what Richard Huddy repeated many times on the DXML.
- DXTC compression is supported simply because there's no reason not to support it. Reduces bandwidth, speed things up, best of all worlds.
- Parametric geometry and high-order surfaces have been taken into account, not for the sake of it, but really because the more we can draw polygons, the more bandwidth becomes an issue. I don't believe in B-patchs so far : as long as it's not hardware-accelerated, it only looks like geometry compression to me. If you want geometry compression, see next paragraph. N-patches and RT-patches are "supported" simply because I wrap & expose all DX render states. Theoretically you can setup those render states in a Flexine shader and it should work like the proverbial charm. Now, I don't have a GeForce3 or a Radeon to test this! Meanwhile, I have software subdivision surfaces using the modified Butterfly algorithm. It's been designed so that it cohabits well with other things like cloth or skinning. Hence you can subdivide the result of a skinning algorithm on-the-fly, for example.
- Dedicated geometry compression algorithms have been implemented, in order to be Internet-friendly. Decompressing a mesh into a vertex buffer on-the-fly can sometimes be a good move. [more about that later]
- Shadow maps & shadow volumes are here. 8 different silhouette extraction routines and a software renderer in case render-to-texture is whacked.
More to come:
Collision detection choices
Visibility & culling
Even more to say about:
- LOD & simulation LOD
Lookin' for a planet with 96h a day. Someone ?