Thursday, April 10, 2008

The Root of All Evil: Introduction to Optimizing Objective C

Donald Knuth has a very famous quote: "[P]remature optimization is the root of all evil"1. This is a valuable idea to keep in your head while programing. It's actually okay to write inefficient code at times. In the context of Objective-C, this means it's okay to use provided methods and objects in your first pass at writing algorithms, even if you know that those objects or methods might not be the optimal way to do the task in terms of performance. Get the code working, and get it working right, and then go back and optimize intelligently where it make sense. In many places small inefficiencies will have no noticeable affect on your application, and by leveraging thoroughly-tested code from Cocoa which was written by knowledgeable Apple and NeXT engineers, you likely saved yourself considerable time and maintenance headaches.

But, there are times when inefficient code has an impact and needs to be addressed. In Java, Visual Basic, Ruby, and the other high-level languages, there is a lot you can do to optimize your code, but there is a limit to what you can do because those languages insulate you from the inner workings using mechanisms such as garbage collection, and a lot of the code that is running when you run your code is completely outside of your control. In all of those languages, you can help the runtime know when it's okay to free up memory that you're done with it by following good coding practices, but memory management is mostly out of your hands. Other optimizations like loop unrolling can be used in these languages, but the performance yield tends to be less than from the same optimizations in a low-level language compiled directly to machine code.

Here's where Objective-C really has a leg up. It is a high-level language and has a lot in common with Java, VB, and C#. It has garbage collection (well, not on the iPhone, yet) and exception handling and it shields you from many low-level implementation details as a good high-level language should. But, Objective-C is a super-set of C, and even though it does have a small runtime engine to enable object messaging and other features, most of your code gets compiled down to machine language. You are shielded from the low-level implementation details, but not prevented from getting to them. When you have exhausted your options for optimizing using Objective-C and Cocoa, you always have the option to lift up the hood and use good old-fashioned procedural C to optimize even more. A great Objective-C programmer needs to be at least a good C programmer.

Let's take a look at an example of optimizing a small chunk of Objective-C code.

Let's say that we have an NSData that contains encoded binary data stored as 8-bit ASCII characters. In this encoded data,  we need to look for certain sequences of characters to make sure that it contains the type of data it's supposed to. These tags are simple sequences of ASCII characters (aka c-strings). 

Here is my first, quickly written version leveraging built-in Cocoa objects. This was written as part of a category on NSData, so any reference to self is a reference to the NSData object holding the encoded binary data.

I apologize for the code formatting. Blogger is not cooperating and refuses to believe me when I tell it that my code is preformatted using the pre tag.

The Original, Horribly Non-Optimized Version




-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];

while (offset < [self length])
{
byte = (char *)bytes ;
NSString * checkString = [NSString stringWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:[NSString stringWithCString:cmp length:strlen(cmp)]])
return YES;

offset++;
}
return NO;
}

Shall I leave it as an exercise for the student to identify just why this code is poor from an optimization standpoint? I'd better not, though the problems are likely obvious to many of you.

First of all, I'm allocating autoreleased objects inside of a loop. And that loop goes byte by byte through a potentially large chunk of data. None of the memory for those objects will be released until the end of the current pass through the run loop, which won't happen until after my loop finishes. I create two autoreleased NSString instances for every byte in the data and then run a string compare between them. NSString is a complex class cluster designed to deal with a multitude of string encodings, and could potentially be using multi-byte characters to represent these strings, meaning that each compare could be comparing two or even three bytes rather than just one. Now, the NSString class cluster is smart and probably won't use multi-byte encoding unless it needs to, but since that's not our code and the algorithms are not documented, we can't know for sure what it's going to do, and it's possible that something in our data could trigger the use of multi-byte encoding

So, basically, I'm being very, very inefficient with both memory and processing power here. That may not be important, depending on how I'm using this in my application, and this version only took me a couple of minutes to write, so if I do this rarely and the chunks of encoded data are small, perhaps this is going to be fine just the way it is. But there's the potential for real problems with this code in an application. 

One way to find out if it's going to be a real problem is to set up unit tests (you do use unit tests, right?) that use representative situations similar to what the code will face in the actual application.

For grins and giggles, let's run MY unit test on this method, which takes two three-megabyte chunks of encoded data - one that contains the tags being searched for and one that doesn't - and do two searches on both chunks. The result? 6.8  seconds to do the two searches on each of the two three-megabyte chunks running it on a a 2.6ghz MacBook Pro. That's definitely a potential Spinning Beachball of Death situation. I guess we'd better look at optimizing, huh?

Note that the amount of time it takes to run this code will depend on a number of factors, and it will not be the same every time, but it only varied by a few milliseconds over a dozen runs, so I'm pretty confident it's the algorithm that's slow and not something else running at the same time2.

First Optimization

The first and most obvious optimization here is getting the comparison string allocation out of the loop. That value does not change at all during the loop, so let's move it up to before the while statement so that only happens once per call to this method. In theory that should significantly reduce memory overhead and probably reduce the processing time significantly as well. Here's the new version:

-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];
NSString *compareString = [NSString stringWithCString:cmp length:strlen(cmp)];
while (offset < [self length])
{
byte = (char *)bytes ;
NSString * checkString = [NSString stringWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:compareString])
return YES;

offset++;
}
return NO;
}

Let's run the unit test and see what happens. And the result? This time it runs in a little faster, clocking in at 3.7 seconds. A significant improvement, but not fast enough to avoid that spinning beach ball. Perhaps I'm going to have to spawn threads. I'm going to be doing this encoding a lot in my application. 

So, is there a way I can improve it more? Is there any way for me to avoid autoreleasing so many NSStrings, or even creating them at all? Well, there are a couple of options. I could manually allocate and release the string objects. That would be considerably more efficient than filling up the autorelease pool, in theory. 

The Second Optimization

Here's what we get if we manually allocate and release our NSString instances rather than letting them go into the autorelease pool. This should help with our memory overhead, but it's unclear how much of a speed increase it will give us:


-(BOOL)containsCString:(char *)cmp
{
NSUInteger offset = 0;
char * byte;
const char * bytes = [self bytes];
NSString *compareString = [NSString stringWithCString:cmp length:strlen(cmp)];
while (offset < [self length])
{
byte = (char *)bytes ;
NSString *checkString = [[NSString alloc] initWithCString:byte length:strlen(cmp)];
if ([checkString isEqualToString:compareString])
return YES;
[checkString release];
offset++;
}
return NO;
}
We run it and? 3.44 seconds. Hm.. An improvement, but not a huge one by any means. It does appear that creating strings is one of our bottlenecks. In a more complex method, we would probably have to resort to using Shark or Instruments to identify where our bottlenecks were, but in this case, it's pretty obvious where the problem is.

So, let's look at doing this in straight C. After all, we're just comparing chunks of memory. C can do that, and it can do it fast without the overhead of objects or string encodings or anything else.

The Final Optimization

Instead, let's forget about using NSString or NSMutableString, and just use C. We'll work with pointers, and use some old-school techniques for comparing buffers.



-(BOOL)containsCString:(const char *)cmp
{
NSUInteger offset = 0;
const unsigned char * bytes = [self bytes];

while (offset < [self length])
{
if (memcmp(bytes++, cmp, strlen(cmp))==0)
return YES;

offset++;
}
return NO;
}

This version doesn't create any new Objective-C objects, it simply uses a C function that compares to chunks of memory. Now there is no overhead due to the objects, and none due to string encodings or conversions. 

So, just how much faster is this? Let's run our unit test and find out. How about 0.4 seconds? Not terrible at all for doing a byte-by-byte search through 12 megs of data, especially when half of the data it had to search didn't contain the thing it was looking for.

Concluding Thoughts


We could take this further. There are optimizations we could do to this that would eke out several more milliseconds, and if our code needed to process huge chunks of data, that would probably be worth investigating (and might be a good idea for a future posting), but for my purposes, I've decided that this is good enough performance for my needs. The point of this article was not really to teach you how to optimize your code, and certainly is not showing you the fastest possible implementation of this algorithm, but I did want to drive home the point that Objective-C is C, and that as nice as all the object-oriented goodness is, there are times when it makes sense to roll up your sleeves and do things old school, an option that most high-level languages don't offer, but Objective-C does. I also wanted you to see some of the rather common kinds of mistakes you often see in high-level languages that cause memory bloat and performance bottlenecks so that you'll be less likely to make them in the future.

The timings in this article were all done using the Debug target. The final optimization dropped down to .1 seconds by switching to the Release target, and I was able to drop it even lower by implementing a reverse iterator for the end tag.


1 - Structured Programming with go to Statements, ACM Journal Computing Surveys, Vol 6, No. 4, Dec. 1974. p.268
2 - It does, indeed, help to know that I intentionally wrote this inefficiently.

No comments:

Post a Comment