Flap Flap

Development Diary Frame Rate Glitches on the Xbox 360

04-April-2011

During playtesting of Flap it became obvious that Flap had some issues. Random glitches would occur in the framerate.

Steve from StevePro Studios suggested that I investigate garbage collection. He also pointed me towards two excellent articles that he wrote. I also investigated a few others. See the links on the right hand side.

To determine if there are conventional memory leaks simply watch what the process does over time. However, creating temporary strings, managed objects or boxed values will not lead to memory leaks but they will lead to garbage collection events occuring.

Other useful resources include a good presentation ny Shawn hargreaves on the subject of how managed code runs on the Xbox. http://www.microsoft.com/downloads/en/details.aspx?FamilyId=B11AD912-4158-44CC-A771-A5E044F7E3BB&displaylang=en

I also particularly like this simple description of how to use idlasm to locate boxing and unboxing. Some of them can be a bit hard to spot.


With XNA Remote Performance Monitor I can see that the main "start menu" "press start" is creating lots of strings and boxed values.
Ok, lets dig into what I'm doing wrong.

The xbox remote performance monitor is very useful because you can see the GC work occuring. As the collections occur I get glitching on the framerate. Its very small and subtle but it is definately there.

Build, deploy, start the XNA framework performance monitor, start flap.

Flap starts and then "sits" on the start menu (i.e the menu where it scans for the controller to bind player the to).

The delta column is the important one. I have 200 managed strings being allocated, 550 managed objects allocated and 200 boxed value types.

Woah. this screen just sits there and displays a string such as "Press A"
Every few seconds the GC will come along and clean up. Each time it does that it collects about a 1MB of memory. This is not good.
If I display the screen with a highscore table (10 rows with format number, name, number) I have: 3400 strings, 6350 managed objects and 700 boxed value types!
That is a large scale problem. At that level of allocatios the GC kicks in very frequently. This is not a bad place to start investigations though. It is a well defined "local" problem - i.e. I can make changes in one isolated location that I can directly measure.

The highScore table has a CompleteScoreRecord that in turn has three List<ScoreRecord>.
CompleteScorerecord and Scorerecord are both value types as they are defined as structs.
Ahh. I bet we are boxing and unboxing quite a bit here.
Ok, simple change; just make them both classes.

Run it again, measure. What effect has that had?
Very little! That wasn't it - ok, it is almost certainly the strings and string.format function.
Should I carry on guessing or should in fact use ildasm to work it out?

Ok, I know I build the string every single time I render the frame. I do this because to be honest performance is not an issue in this bit and it doesn't cause problems. I could use the same argument to not bother fixing it but this is a good localised problem so I can gain a few insights here that i can then apply elsewhere.

My score Record looks like this:
public class ScoreRecord
{
    public string Name;
    public int Score;
    public int Level;
}
This was a dumb struct (which I was using in a C++ sort of way even though C# is very different to C++ treatment of structs).

On screen this is displayed as [int] position [string] Name [int] score. The posiiton and name are left aligned and the score is right aligned.

If I add a data member to cache the display line instead of building it everyframe:
public class ScoreRecord
{
    public string Name;
    public int Score;
    public int Level;
    [XmlIgnore]
    public string displayLine;
}
During construction or allocation of a new score record I shall pre-build the displayLine. I won't persist it so I mark it as ignored. The displayLine will be a combination of the position and the player's tag.

After making this change and remeasuring:
The managed string count is now down to 1900 and the number of boxed value types is down to 200. (which are probably attributable to the rest of the screen).
To work out exactly which are ours I shall tweak the menu settings to remove the highscore display from this menu.
All of flap is data driven so it is a small change in an XML file.

Ok base numbers for the menu screen without a high score table are 1400 for the strings and 200 for the boxed values. We can conclude exactly 500 string allocations are down to the high score table. How can make those other 500 string allocations go away? I suspect pre-rolling the other string work we do will be enough.

Final changes: Preroll the invariant culture score and through in the size calculation (to work out positioning) and the whole score record should now be rendered off cached data with no "on the fly" string creation.

Great, re-examination shows that the string allocations and boxed value types is back to 1400 and 200. This is the result of "other things" going on.

My menus built on the fly every frame are a bit pants but to be honest but also not an issue that will lead to glitches. Lets go to work on those now as they are much more interesting!

If I examine the main rendering loop of the game in the remote performance monitor first to see what we are aiming for and "how bad is it".

Managed string objects 150, managed objects 1200 and boxed value types 200. Remember these are only the deltas! The values can all rise dramatically during game play but I think that is the steady state. Time to see if we can eliminate all of those.

The score is formatted everyframe and rendered. I shall keep a pre-formatted string that is updated when the score changes and use that.

internal void setScore( int n )
{
    _score = n;
    _scoreString=string.Format("Score: {0}", n );
}

And that's all it needed to get rid of the boxed value types and the managed strings.

There is the odd one that you can see pop in but not the 150/200 that we were seeing.

More worryingly though the managed object count is 4000 or so per delta. What is going on there then - because those are still leading to GC events firing? These are explicit new calls that I am making. The number is directly proportional to the number of actors on screen. What do actors do every frame? They are rendered. Time to examine the actor rendering code by eye first and then maybe use some of the other tools at our disposal.

That many calls must be something to do with the main actor class - especially as the number of allocations rises smartly as the amount going on screen also rises.

What's in the main render function?

If there is a Effect for this actor then use the effect and set up parameters. It could be that but the effects are all shared off a singleton manager class.

Then the actor itself is drawn with a simple Spritebatch.Draw call.

The other thng that actors do everyframe is get updated. What's in the update method of an actor then?

public override void Update(GameTime gameTime)
{
removeChildren();
updateSourceRectangle(gameTime);
if (!Engine.IsDisplayingMenu() && _isAlive)
{
if (isUserControlled())
{
checkForInput(gameTime);
}
determineVelocity(gameTime);
_position += Vector2.Multiply( _velocity,(float) gameTime.ElapsedGameTime.TotalSeconds);
checkAndConstrainPosition();
updateVisibility();
updateParticleEmitters(gameTime);
updateAnimations(gameTime);
updateArrowPorts(gameTime);
base.Update(gameTime);
}
}

Start at the top and work down. The removeChildren() fucntion will never be called on early levels. It only applies when knights position on walls or the bosses. This doesn't occur on level 1.

The next line is the updateSourceRectangle(). Ah ha. A possible culprit.

Most of the actors in the game use Sprite sheets. A single texture with lots of pictures on. This cuts down on texture switches and other expensive render state changes.

Each Actor has a source rectangle property that has, not surprisingly, where to get the source rectangle on the sprite sheet.

Everytime time the animation is due for an update the rectangle is moved by replacing it with a new rectangle.

SourceRectangle = new Rectangle(spriteColumn * width, spriteLine * height, width, height);

Ok, what happens if we replace that with a simple set of assignments instead?

Not great. Very little difference. What now?


determineVelocity is a possible. It has a number of overrides in most derived types.
Lets look at the arrow class:

public override void determineVelocity(GameTime gt)
{
Vector2 v = Velocity;
switch (_gravity)
{
case GravityAffected.FALLS_TO_EARTH:
v += new Vector2(0f, 0.375f * Actor.SCALE_FACTOR * (float)gt.ElapsedGameTime.TotalSeconds);
Velocity = v;
break;

... other details elided
}
}

Ah. Well that's not required. The code is written as conceptually that's what we want to do. Lets refactor out that new statement.

The Velocity is a property that wraps a member variable of the base class.

We have a double assignment and the new call. Vector2 is a struct hence lots of copying going on here.

Move the declaration of member variable velocity to be protected. This is rubbish and really we should have a by ref setters. We'll go back and add that in a minute.

That had a tiny impact. We're starting to struggle a little. Time for the profiler.

The CLR profile can profile allocations. Simply start the profiler and ask it to start flap. Play through a level or so and see what we have.

The number of allocations rather than the actual size of memory is important here.

Most of the raw memory is spent in texture allocations.

But if I look at the "histogram of allocations" I can see what has been allocating the most.

Of 400,000 total instances, 124,000 are Vector3, 62,000 lists, 62,000 PossibleLight collections and 62,000 LightValues.
Ah hah. The real time lighting that is used on many sprites is starting to look a bit guilty! Conclusion. I should have used the profiler earlier. This is a truism when profilers are involved and one I should have remembered. Don't guess; measure properly with the correct tools and don't assume you know what is really going on.

The function is quite simple - every frame, for every possible scene light it determines which three are lighting the actor. It sorts and allocates local collections for the actor.

A couple of strategies spring to mind:
The light list should belong to the actor, be passed in by reference and reset each frame rather than a new one been allocated constantly.

That's the first place to start.

The maximum lighting to be applied to a single actor is three lights. I'm newing in a List of three lights each time rather than assigning in.

These two changes alone may be enough.

Each actor now has an instance of LightValues which is passed by reference to the lighting engine (which works out which scene lights are in effect for any given actor).

The Vectors are initialised up front and then assigned into. So we still have copy overheads going on but many less "news".

Lets see if we can see the effects in the XNA Framework Remote performance Monitor.

Wow. That's a significant impact. We've just cut it in half. We're down to 2000 or so managed object allocations per delta or less when little going on the screen.

Ok, back to the profiler. Find and then knock off the next guilty party!

I run level one, play it properly and activate as many spells as possible.

Rerun the allocation profile and, again, look at the histogram of allocations first. This will tell us what types of things are being allocated and for how much memory.

I have 360,000 total allocations.
149,995 are particles. Ah ha. This was one of my "if I had to bet" candidates even though the particles are designed to be reused and not many should be allocated dynamically. Well that obviousy isn't working quite as expected!

Work out what the issue is and then come up with an appropriate strategy. Perhaps offloading to the GPU with a GPU based particle system would be best but then again that may be overegging it?

Right click on the allocation you are interested in and "show whos allocated".

90% of these allocations are the "Dragon::fireDragonsFire" method that launches the rocket powered bombs horizontally.

Ok, that should be fixable pretty easily.

The bombs are reusable. They are allocated "upfront" and then there is a list of active and inactive bombs (for when you need the next bomb). This is to prevent "newing" the bombs. This bit of the design works perfectly.

The small change is that each bomb needs to *keep* its particle systems and simply turn them on and off. What happens at the moment is the bomb changes its behaviour at firing time by adding dynamic particle systems.

Popping back into the performance monitor and see what impact that has had. I'm hoping that this will also have fixed some of the last few string allocations as well (as the particle system, like many other items in the game, are data driven with XML files and the XML no longer requires deserialisation - which in turn builds strings).

That has cut it in half again. The managed object allocations is now down to less than a 1000. We're really moving in the right direction now.

Whilst we're at it lets do the same for arrows and also the smoke damage. I know they work in the same fashion and don't really require a profile to see it.

That's now cut it down even further with a few hundred. Back to the profiler to work out where to go next.

We're now down to the PossibleLight function that builds sorted lists on the fly. we can do better there.

The list that holds the sorted set of lights is newed for each actor per frame. No need for that as we're not holding for deferred lighting - it can be one common list used by everybody.

There are also some Vector assignments that are using new vectors to go from vector2 to vector3 - those can be re-engineered as assignment operators.

The steady state on a frame is now down to a small handful of assignments. This is from a starting point of a few thousand. A couple of easy to use tools has given enough clues to solve the glitching frames.

A final run through on the xbox and what do we have left?
a few tens to a few hundred when it is particularly busy. The allocations are becoming harder to solve. Now the profiler needs to be activated just for periods and the application terminated. This "snapshot" of allocations removes all of the level load allocations and other issues that we simply aren't interested in.