Showing posts with label Optimizations. Show all posts
Showing posts with label Optimizations. Show all posts

Tuesday, April 7, 2009

Handling Big XML Files on iPhone

Jim Dovey (aka @ alanQuatermain) has come up with what looks like a tremendous solution for handling very large XML files on the iPhone. Jim tested with a large (22 megabyte) XML file. Parsing the file with NSXMLDocument took about 123 megabytes of virtual memory to parse (I'm assuming this was run on the simulator since there's no way you could get that much heap allocated on the phone itself).

Running the same exact file through his own parser used only 70 kilobytes of virtual memory. That's a dramatic, several-order-of-magnitude difference. So, if large XML files are the bane of your existence, you might want to check his solution out.

Friday, April 3, 2009

Adding CLANG to Your Build Process

Frasier Spiers has a nifty piece this morning on using Git pre-commit hooks to automatically run the CLANG Static Analyzer. I'm not a Git user myself, but I know a lot of people are, so this looks like good info for those of you who are

Tuesday, March 31, 2009

Speed with a Catch

A while back, I wrote a post about surface normals in OpenGL ES. Yesterday on Twitter, there was some discussion about using the inverse square root function from Quake 3 to speed up the performance of iPhone OpenGL ES applications. Here is what that method looks like (converted to using GL and iPhone data types):

static inline GLfloat InvSqrt(GLfloat x)
{
GLfloat xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(GLfloat*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}

The inverse square root can be used in several ways. Noel Llopis of Snappy Touch pointed out two uses for it on Twitter yesterday: calculating normals and doing spherical UV texture mapping. I'm still trying to wrap my head around the UV Texture Mapping, but I understand normals pretty well at this point, so I though I'd see what kind of performance gains I could get using this old optimization. There's all sorts of arguments around the intertubes about whether this function still gives performance gains, but there's an easy way to find out: use it and measure with Shark.

I used my Wavefront OBJ Loader as a test, and profiled the loading of the most complex of the three objects - the airplane. The first run was using my original code, which stupidly1 used sqrt(). I then re-ran it using sqrtf(), and then again using the Quake3D InvSqrt() function above.

The results were impressive, and you definitely do get a performance increase from using this decade-old function on the iPhone. Using InvSqrt() gave a 15% decrease in time spent calculating surface normals over using sqrtf() and a 40% decrease over calculating with sqrt(). That's not an amount to be sneezed at, especially in situations where you need to calculate normals on the fly many times a second.

Now, if you remember, this was how we calculated normals using the square root function from Math.h:

static inline GLfloat Vector3DMagnitude(Vector3D vector)
{
return sqrt((vector.x * vector.x) + (vector.y * vector.y) + (vector.z * vector.z));
}

static inline void Vector3DNormalize(Vector3D *vector)
{
GLfloat vecMag = Vector3DMagnitude(*vector);
if ( vecMag == 0.0 )
{
vector->x = 1.0;
vector->y = 0.0;
vector->z = 0.0;
}

vector->x /= vecMag;
vector->y /= vecMag;
vector->z /= vecMag;
}

So... how can we tweak this to use inverse square root? Well, the inverse square root of a number is simply 1 divided by the square root of that number. In Vector3DNormalize(), we divide each of the components of the vector (x,y,and z) by the magnitude of the vector, which is calculated using square root. Since dividing a value by a number is the same as multiplying by 1 divided by that same number, so, we can just multiply each component by the inverse magnitude instead, like so:

static inline GLfloat Vector3DFastInverseMagnitude(Vector3D vector)
{
return InvSqrt((vector.x * vector.x) + (vector.y * vector.y) + (vector.z * vector.z));
}

static inline void Vector3DFastNormalize(Vector3D *vector)
{
GLfloat vecInverseMag = Vector3DFastInverseMagnitude(*vector);
if (vecInverseMag == 0.0)
{
vector->x = 1.0;
vector->y = 0.0;
vector->z = 0.0;
}

vector->x *= vecInverseMag;
vector->y *= vecInverseMag;
vector->z *= vecInverseMag;
}


Sweet, right? If we now use Vector3DFastNormalize() instead of Vector3DNormalize(), and each call will be about 15% faster on current generations of the iPhone and iPod Touch compared to using the built-in square root function.

But… there's a catch. Actually, two catches.

The Catches


The first catch is that this optimization doesn't work faster on all hardware. In fact, on some hardware, it is measurably slower than using sqrtf(). That means you're gambling that future hardware will also benefit from this same optimization. Not a huge deal and very possibly a safe bet, but you should be aware of it, and be prepared to back it out quickly should Apple release a new generation of iPhones and iPod Touches that use a different processor.

The second, and far more important catch is the possible legal ramifications of using this code. You see, Id released Quake3D's source code under the GNU Public License, which is a viral license. If you use source code from a GPL project, you have to open source your entire project under the GPL as well. Now, that's an oversimplification, and there are ways around the GPL, but as a general rule, if you use GPL'd code, you have to make your code GPL also.

But, the waters are a little murky. John Carmack has admitted that he didn't write that function, and doesn't think the other programmers at Id did either. The actual author of the code is unknown. Some of the contributors to the function have been found, but not the original author. That means the code MIGHT be in the public domain. If that's the case, its inclusion in a GPL application doesn't take it out of the public domain.

So, bottom line: is it safe to use? Probably. This function is widely known and widely used and there's been no indication that any possible rights owner has any interest in chasing down every use of this function. Are there any guarantees? Nope.

My recommendation is to use it, but make sure every place you use it, have a backup method that you can fallback on if you need to. If you want some assurance, you could try contacting Id legal and getting a waiver to use that function. I don't know if they'll respond, or if they'll grant it, but the folks at Id have always struck me as good people, so it might be worth an inquiry if you're risk averse.

1 - sqrt() is a double-precision function. Since OpenGL ES doesn't support the GLDouble datatype, which means I was doing the calculation on twice as many bits as needed, and converting back and forth from single to double precision then back again.

Monday, January 19, 2009

C++ and Performance

Here's an interesting blog post about using ObjC++ to improve performance in iPhone apps. I may be in a minority on this, but I've never been a huge fan of the Objective-C++ language. Why straddle yourself with the limitations of two languages?

Okay, that's an over-simplification; there are times when Objective-C++ is a great choice, especially when you are writing the Mac or iPhone version of an application that runs on multiple platforms. It wouldn't however, be the first option that leapt to my mind as a tool for optimizing a poorly-performing iPhone application. Although C++ does some things, like message dispatching, faster (well, technically it's a "vtable lookup" on C++, not a message dispatch), it still incurs some overhead for those things so, it would seem to me, any optimization in C++ is going to leave some room for further improvement. Any way you cut it, though, there's definitely some really valuable stuff in that blog posting in terms of iPhone coding and performance, so go read it.

I'm in the middle of re-writing the drawing code for my particle engine to be faster and am dealing with some of the same issues this blog post talks about, however I chose a slightly different, more old-school option for my optimizations: static inline C functions and structs for the really time-sensitive stuff, and straight Objective-C for anything that isn't time sensitive. Sure it can be a pain to debug, but I only move code into an static inline function after I've had it running inside a regular method or function first. I use the structs to avoid the object creation overhead on things like particles, which there will be a lot of.

So far, it seems to be a good choice - I've had a 2,000 particle per-second generator with a 12-second particle-lifespan using GL_POINT rendering running at 20 frames per second on a first-generation iPhone without much noticeable skipping, which is a huge improvement over my first version, which skipped with even a few hundred particles per second. I'm hoping I see similar improvements in the other rendering types when I finish re-writing those.

Monday, October 20, 2008

iPhone "Optimized" PNGs

For the most part, Apple's developer documentation for both the Mac and iPhone is excellent. They are well-written, accurate, and provide excellent code samples. There are, of course, always gaps, especially with newer areas. Sometimes these gaps are simply a matter of resource constraints; the volume of documentation the documentation teams have to maintain is staggering, and frameworks are constantly being added and updated.

At other times, Apple is tight-lipped about stuff not because of resource constraints, but rather intentionally, though it's not always clear why. Do you want to know how fast the processor in your iPhone is, or how much memory is available to your application? Both of these are obviously relevant to developing for the iPhone, but don't look to Apple for that information. The information is available (processor | memory), just not from Apple.

Here's another area where Apple is less than fully forthcoming: When it comes to the iPhone's use of PNG images, Apple tells you that Xcode "optimizes" PNGs as part of the build process and, as a result, you should always use PNG images in your iPhone projects. PNG images will usually be bigger than JPEGs, causing your application to be a little larger, but you'll be rewarded by some mysterious kind of improvement.

Since this information is not documented, what follows is something of an educated guess based on research and observations. I will gladly be corrected on any inaccuracies, so don't hesitate to ping me if you see something wrong.

So, just what is this magical, mystery "optimization"?



During the build, Xcode basically screws up your PNG image, turning it into something that's technically no longer a PNG image. Xcode doesn't change your original file, but when it copies it to the Resources folder of your application bundle, what it does is it byte-swaps the red and blue octets and also "premultiplies the alpha" (if that doesn't make sense, don't worry, we'll talk about it in a moment). The PNG specification does not support either of these changes, which is why the image is no longer technically a PNG and can't be loaded by most image editing software. Not that it really matters, since the new version only exists inside your application bundle, but it does mean that any PNG file that you include in your app bundle should not be, for example, FTPed or e-mailed from your application directly, because the file will be corrupt as far as most software is concerned.

Byte Swapping


Byte swapping is exactly what it sounds like. Uncompressed computer graphic images are most-often stored as sequences of three or four bytes or "octets" representing each pixel in the image. Each byte in the pixel represents one of the three additive primaries (red, blue, and green), plus there's often another byte called "alpha", which we'll look at in a moment. There are other ways to store images, but this is the most common technique. The most common byte-ordering used in this technique is RGB (or RGBA) which stands for Red-Green-Blue(-Alpha). This means that a single pixel is represented in memory by four bytes, the first representing the intensity of red, the second representing the intensity of green, and the third representing the intensity of blue (we'll ignore alpha for now). The PNG specification uses this byte-ordering, as do many other image formats.

However, the iPhone's video memory uses a non-standard byte-ordering of BGR or Blue-Green-Red (I don't believe there's an alpha component in the iPhone's vRAM). In order to copy from an image in memory using RGBA to the video memory using GRB is more computationally expensive than just copying from, for example, BGR to BGR, which can be done with a single, fast memory move (or "blit") operation. By doing this byte-swap in the Copy Files Build Phase, your application is (in some situations) able to ignore the alpha component when loading the file into memory so it can use the faster memory move operation to get the image into video memory (and hence onto the screen).

Premultiplied Alpha


Okay, that makes sense, but what about this "premultiplying the alpha" thing? That one sounds kind of mysterious, doesn't it? As mentioned in the previous section, the iPhone's vRAM has no alpha component, so if we're going to ignore that component, we still need to take it into account somehow. Remember that PNG Images that have been optimized for the iPhone are stored as four values, representing the Blue, Green, Red, and Alpha components (in that order). Although it's called a "component", Alpha isn't really a color component at all. It's more like a modifier that acts on the other three values, and it represents how much of whatever is beneath the color will show through. So, a color that looked like this:

B:1.0
G:0.0
R:0.0
A:0.5

Would represent the color blue, but at 50% opacity. In order for the computer to use the alpha component, it has to multiply the alpha times the other three components (and possibly by other values) before putting them into video memory. This multiplication process is more computationally expensive than just copying the value into video memory. So, Xcode also premultiplies the alpha by the three components and stores the multiplied value. As a result, the color above would look like this after the Copy Files Build Phase:

B:0.5
G:0.0
R:0.0
A:0.5

Notice that the Blue component in this new pixel is equal to the Blue component of the previous version multiplied by the alpha. The result of this premultiplication, is that the alpha component can be ignored when loading the image into memory and the image can then be blitted directly to video memory without having to do the expensive floating-point multiplications to calculate the alpha. Well, at least, sometimes. This pixel is what the pixel would look like when drawn over white.

So, what happens if there's a color beneath that needs to show through? This is where things can get a little confusing. If your application is drawing the image on top of something else, or if you're using the alpha channel in the image as a mask, then the iPhone can't use this optimization. It has to do the alpha calculations (which is why the alpha channel is left intact in the pre-multiplication process), the the byte-swapping still offers some improvement. For the most part, you don't have to worry about this - it all happens under the hood. The thing you should take away from it though, is that you can help your iPhone know when it is safe to use the optimization. The way to do that is to make sure you check the "opaque" checkbox in Interface Builder for your image views, or other views that contain images, or set the opaque property of your UIImageView to YES in code. This tells the iPhone that it's safe to do the faster blit operation and ignore the alpha channel because nothing below shows through an opaque object and there's no need to do any expensive floating point alpha calculations for that object. Of course, if you are using the alpha channel in your image, or if you are doing complex image compositions in your app at run-time, you shouldn't make the view opaque. But, if you're just displaying an image in an UIImageView, checking that opaque checkbox is a really, really good idea.

An awful lot of the time, PNG images have 1.0 for the Alpha value in every pixel. As a result, more often than not, program runs faster and use less memory because of the PNG "optimizations" done by Xcode.

What happens when you use different file types


When you use any other file type (or if you load a non-optimized PNG files), your iPhone has to do the byte-swapping and alpha premultiplication at load-time (and possibly re-do the alpha multiplication at display time). Your application basically has to do the same processing that Xcode does, but it's doing it at run-time instead of at build-time. This is going to cost you both in terms of processor cycles and memory overhead. One of the reasons why Mobile Safari is the biggest memory hog of the built-in iPhone applications is because the images it has to load in order to display web-pages are all non-optimized images, mostly JPEGs. Since JPEG is a compressed format, it has the added extra step of having to decompress the image into memory before it can do the premultiplication and byte-swapping.

Thursday, April 10, 2008

The Root of All Evil: Introduction to Optimizing Objective C

Donald Knuth has a very famous quote: "[P]remature optimization is the root of all evil"1. This is a valuable idea to keep in your head while programing. It's actually okay to write inefficient code at times. In the context of Objective-C, this means it's okay to use provided methods and objects in your first pass at writing algorithms, even if you know that those objects or methods might not be the optimal way to do the task in terms of performance. Get the code working, and get it working right, and then go back and optimize intelligently where it make sense. In many places small inefficiencies will have no noticeable affect on your application, and by leveraging thoroughly-tested code from Cocoa which was written by knowledgeable Apple and NeXT engineers, you likely saved yourself considerable time and maintenance headaches.

But, there are times when inefficient code has an impact and needs to be addressed. In Java, Visual Basic, Ruby, and the other high-level languages, there is a lot you can do to optimize your code, but there is a limit to what you can do because those languages insulate you from the inner workings using mechanisms such as garbage collection, and a lot of the code that is running when you run your code is completely outside of your control. In all of those languages, you can help the runtime know when it's okay to free up memory that you're done with it by following good coding practices, but memory management is mostly out of your hands. Other optimizations like loop unrolling can be used in these languages, but the performance yield tends to be less than from the same optimizations in a low-level language compiled directly to machine code.

Here's where Objective-C really has a leg up. It is a high-level language and has a lot in common with Java, VB, and C#. It has garbage collection (well, not on the iPhone, yet) and exception handling and it shields you from many low-level implementation details as a good high-level language should. But, Objective-C is a super-set of C, and even though it does have a small runtime engine to enable object messaging and other features, most of your code gets compiled down to machine language. You are shielded from the low-level implementation details, but not prevented from getting to them. When you have exhausted your options for optimizing using Objective-C and Cocoa, you always have the option to lift up the hood and use good old-fashioned procedural C to optimize even more. A great Objective-C programmer needs to be at least a good C programmer.

Let's take a look at an example of optimizing a small chunk of Objective-C code.

Let's say that we have an NSData that contains encoded binary data stored as 8-bit ASCII characters. In this encoded data,  we need to look for certain sequences of characters to make sure that it contains the type of data it's supposed to. These tags are simple sequences of ASCII characters (aka c-strings). 

Here is my first, quickly written version leveraging built-in Cocoa objects. This was written as part of a category on NSData, so any reference to self is a reference to the NSData object holding the encoded binary data.

I apologize for the code formatting. Blogger is not cooperating and refuses to believe me when I tell it that my code is preformatted using the pre tag.

The Original, Horribly Non-Optimized Version




-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];

while (offset < [self length])
{
byte = (char *)bytes ;
NSString * checkString = [NSString stringWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:[NSString stringWithCString:cmp length:strlen(cmp)]])
return YES;

offset++;
}
return NO;
}

Shall I leave it as an exercise for the student to identify just why this code is poor from an optimization standpoint? I'd better not, though the problems are likely obvious to many of you.

First of all, I'm allocating autoreleased objects inside of a loop. And that loop goes byte by byte through a potentially large chunk of data. None of the memory for those objects will be released until the end of the current pass through the run loop, which won't happen until after my loop finishes. I create two autoreleased NSString instances for every byte in the data and then run a string compare between them. NSString is a complex class cluster designed to deal with a multitude of string encodings, and could potentially be using multi-byte characters to represent these strings, meaning that each compare could be comparing two or even three bytes rather than just one. Now, the NSString class cluster is smart and probably won't use multi-byte encoding unless it needs to, but since that's not our code and the algorithms are not documented, we can't know for sure what it's going to do, and it's possible that something in our data could trigger the use of multi-byte encoding

So, basically, I'm being very, very inefficient with both memory and processing power here. That may not be important, depending on how I'm using this in my application, and this version only took me a couple of minutes to write, so if I do this rarely and the chunks of encoded data are small, perhaps this is going to be fine just the way it is. But there's the potential for real problems with this code in an application. 

One way to find out if it's going to be a real problem is to set up unit tests (you do use unit tests, right?) that use representative situations similar to what the code will face in the actual application.

For grins and giggles, let's run MY unit test on this method, which takes two three-megabyte chunks of encoded data - one that contains the tags being searched for and one that doesn't - and do two searches on both chunks. The result? 6.8  seconds to do the two searches on each of the two three-megabyte chunks running it on a a 2.6ghz MacBook Pro. That's definitely a potential Spinning Beachball of Death situation. I guess we'd better look at optimizing, huh?

Note that the amount of time it takes to run this code will depend on a number of factors, and it will not be the same every time, but it only varied by a few milliseconds over a dozen runs, so I'm pretty confident it's the algorithm that's slow and not something else running at the same time2.

First Optimization

The first and most obvious optimization here is getting the comparison string allocation out of the loop. That value does not change at all during the loop, so let's move it up to before the while statement so that only happens once per call to this method. In theory that should significantly reduce memory overhead and probably reduce the processing time significantly as well. Here's the new version:

-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];
NSString *compareString = [NSString stringWithCString:cmp length:strlen(cmp)];
while (offset < [self length])
{
byte = (char *)bytes ;
NSString * checkString = [NSString stringWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:compareString])
return YES;

offset++;
}
return NO;
}

Let's run the unit test and see what happens. And the result? This time it runs in a little faster, clocking in at 3.7 seconds. A significant improvement, but not fast enough to avoid that spinning beach ball. Perhaps I'm going to have to spawn threads. I'm going to be doing this encoding a lot in my application. 

So, is there a way I can improve it more? Is there any way for me to avoid autoreleasing so many NSStrings, or even creating them at all? Well, there are a couple of options. I could manually allocate and release the string objects. That would be considerably more efficient than filling up the autorelease pool, in theory. 

The Second Optimization

Here's what we get if we manually allocate and release our NSString instances rather than letting them go into the autorelease pool. This should help with our memory overhead, but it's unclear how much of a speed increase it will give us:


-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];
NSString *compareString = [NSString stringWithCString:cmp length:strlen(cmp)];
while (offset < [self length])
{
byte = (char *)bytes ;
NSString *checkString = [[NSString alloc] initWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:compareString])
return YES;
[checkString release];
offset++;
}
return NO;
}
We run it and? 3.44 seconds. Hm.. An improvement, but not a huge one by any means. It does appear that creating strings is one of our bottlenecks. In a more complex method, we would probably have to resort to using Shark or Instruments to identify where our bottlenecks were, but in this case, it's pretty obvious where the problem is.

So, let's look at doing this in straight C. After all, we're just comparing chunks of memory. C can do that, and it can do it fast without the overhead of objects or string encodings or anything else.

The Final Optimization

Instead, let's forget about using NSString or NSMutableString, and just use C. We'll work with pointers, and use some old-school techniques for comparing buffers.



-(BOOL)containsCString:(const char *)cmp
{
NSUInteger offset = 0;
const unsigned char * bytes = [self bytes];

while (offset < [self length])
{
if (memcmp(bytes++, cmp, strlen(cmp))==0)
return YES;

offset++;
}
return NO;
}

This version doesn't create any new Objective-C objects, it simply uses a C function that compares to chunks of memory. Now there is no overhead due to the objects, and none due to string encodings or conversions. 

So, just how much faster is this? Let's run our unit test and find out. How about 0.4 seconds? Not terrible at all for doing a byte-by-byte search through 12 megs of data, especially when half of the data it had to search didn't contain the thing it was looking for.

Concluding Thoughts


We could take this further. There are optimizations we could do to this that would eke out several more milliseconds, and if our code needed to process huge chunks of data, that would probably be worth investigating (and might be a good idea for a future posting), but for my purposes, I've decided that this is good enough performance for my needs. The point of this article was not really to teach you how to optimize your code, and certainly is not showing you the fastest possible implementation of this algorithm, but I did want to drive home the point that Objective-C is C, and that as nice as all the object-oriented goodness is, there are times when it makes sense to roll up your sleeves and do things old school, an option that most high-level languages don't offer, but Objective-C does. I also wanted you to see some of the rather common kinds of mistakes you often see in high-level languages that cause memory bloat and performance bottlenecks so that you'll be less likely to make them in the future.

The timings in this article were all done using the Debug target. The final optimization dropped down to .1 seconds by switching to the Release target, and I was able to drop it even lower by implementing a reverse iterator for the end tag.


1 - Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268
2 - It does, indeed, help to know that I intentionally wrote this inefficiently.