Oct 072011


Whilst everyone is buzzing and creating fancy new Symfony2/Doctrine 2 applications, and perhaps even a shift to new frameworks/no frameworks, a great deal of us are still maintaining legacy apps and will be for some time to come. As these apps grow, we occasionally need to look back and scream at our old code and wonder why we didn’t make it more scalable or use neat optimisation tricks back when it was first conceived. The fact is, many of these “tricks” are not necessary at the time for a virgin app, and we need to develop code that is relevant to the task at hand.

That said, being aware of some of the case studies I will present to you now may help you to optimise old code, but may also allow you to think twice when you are working with new code – as the things I will describe do not take too much time to implement first time round.

Case study: Working with large resultsets and array_merge

In my sample application, the database contained many different types of organisation stored in different tables, and due to the structure of the data and the criteria for each organisation, there was no easy way to retrieve several organisations at once using pure SQL and joins (well, no convenient way). The solution is relatively simple, one query per organisation then combine the results, in this case as we go along:

$membershipClasses = array("Entity1", "Entity2", "Entity3", "Entity4", "Entity5");
    $results = array();
    foreach ($membershipClasses as $joinedTable)
      $query = $this->createQuery()->from("TablePrefix".$joinedTable." t");
      // Lots of differing criteria based on which class we were dealing with

      $results = array_merge($query->execute(array(), $hydration), $results);

This logic was used to generate a report with approximately 50,000 rows, used 1.2GB of internal memory and took around 20 minutes to complete – often failing or bringing the system to a halt.

One optimisation used in this case (many more are surely possible but lets focus on one) involves the last line – using PHP’s array_merge function. Simply switching this out to use a simple array traversal brought the execution time down to under 2 minutes, however there was no change in memory usage.

$theseResults = $query->execute(array(), $hydration);
foreach ($theseResults as $aResult)
  $results[] = $aResult;

Why is this so much faster? And why can’t we use += instead?


The “problem” with array_merge, is that even though it discards numeric keys, it still has to *check* them all first, which is a lot of overhead on 50,000 rows. We can’t use += in this case either, because that would respect the numeric keys which would start from 0 each time, and therefore the results would not “stack” as intended, unless we tell Doctrine to index the results with some unique key.

Why it takes *so* much longer is a bit of a mystery to me. I’ve looked at the c code behind it (ext/standard/array.c) and it’s, well quite frankly voodoo.

Case study: Limiting results

Something that was “fixed” (read: removed) in Doctrine 2 was the ability to limit results returned from queries, at least the “magic” part of it. The problem with limiting hydrated results is that you need to know exactly how many rows each sub-tree will contain before you can limit the query as a whole. Imagine a person with many addresses, and you want an array limited to 5 people – you can’t list add “LIMIT 5″ to the end of the query, because when this is hydrated you will most likely end up with one or 2 people, the second of whom may not have all their addresses, because you’ve told your database manager to return 5 *rows*, it has no idea how these rows relate to your model.

In Doctrine 1, adding a limit clause would cause all sorts of magic to happen, the resulting query ending up as a collection of complex subqueries each with their own limit clause, the more joins you introduced, and the more levels of join, the more complicated it becomes. First level joining is not so bad, but joining several levels deep soon starts to get heavy, and your execution speed will suffer for it. Couple this with a large resultset and you will see the smoke drifting from the server room in no time.

So, what’s the solution? Well, this one is not so clear cut – you have to experiment. Sometimes, doing it the “magic” way will work just fine, and is totally acceptable, but when the query becomes “heavy” you have a few options:

Work out your subset first

This is the default Doctrine 2 approach, it’s quite simple – you use one query to get all the ids of the subset of top level elements, then pass these IDs to another query in a “WHERE IN” clause. When you look at the resulting SQL, it can be quite scary, especially if you are dealing with many results – you can easily be saying “WHERE IN (1,2,3,4…..9999999999)”. Most dbms will handle this surprisingly well, as long as you are using the primary keys, so experiment with it and it might be the way to go.

if ($limit)
  $subQuery = $this->createQuery("foo")
                   -> //
  $ids      = $subQuery->execute(array(), DOCTRINE_CORE::HYDRATE_SINGLE_SCALAR);
  $query->andWhereIn("cf.id", $ids);

Note the use of “single scalar hydration” – we have no need for more than this (to be really picky could go straight to PDO here but for consistency this is ok). In this case, the single scalar hydration will give us a resulting array of raw IDs, which is exactly what we need to pass to the main query.

Make your own subquery

A variation on the above is to embed the “ID sucking” query into your main query, some database engines may prefer this style, but I have not noticed any significant performance gain or loss doing this so prefer the above option as it’s cleaner in the PHP code.

$query->whereIn('SELECT id FROM my_other_table WHERE blah LIMIT 50');

In a perfect world, this is the kind of thing Doctrine could have done instead of the subquery magic, but there are so many factors to consider here that I can understand why they did not go down that route.

Stick with the magic, but simplify

It is also possible to continue using the magic, but greatly improve performance by simplifying. Remove all those second+ level joins and use separate queries to get hold of the data you need. How often have you added a chain of joins just to get one snippet of data from the last node? Scrap all the joins and make a new query later where you can pass in all the IDs from your main query and get the data you need. Now you don’t need a limit clause, because you are asking for exactly the data you need in the first place.

Start with the fastest hydration mode and work upwards

This one I can’t stress enough – an ORM such as Doctrine is there to give you the tools you need when you need them, but it is often the case that applications are built with all the overhead and magic just because it can be, or because it’s the default behaviour. Stop. Look at what data you *actually* need, and especially ask yourself if you need objects in your results. Perhaps you are retrieving all your users and listing them with their profile data, but you are hydrating them as objects because you need some custom functions you’ve written, classic examples are getAge() or getFullName() which are derived from other fields. If this is the only reason you are object hydrating, consider something like this:

class User
  function getAge()
    return self::getAgeFromDOB($this["dob"]);

  function getFullName()
    return self::getFullNameFromUserArray($this);

  public static function getAgeFromDOB($dob)
    $dob = strtotime($dob);

    $year_diff  = date("Y") - date("Y", $dob);
    $month_diff = date("m") - date("m", $dob);
    $day_diff   = date("d") - date("d", $dob);

    if ($month_diff < 0 || ($month_diff == 0 && $day_diff < 0))
    return $year_diff;

  public static function getFullNameFromUserArray($user)
    return $user["first_name"] . " " . $user["first_name"];

In this example, we’ve kept the sideways compatibility of the getFullName function with array/object hydration, so it will work whether an object is passed or an array (sanity checking needed of course). This method could be expanded to support more “raw” hydration methods also. Now in our templates, we can just use:

<li><?php echo $user["username"]; ?></li>
<li><?php echo $user["Addresses"][0]["city"]; ?></li>
<li><?php echo User::getFullNameFromUserArray($user); ?></li>
<li><?php echo User::getAgeFromDOB($user["dob"]); ?></li>

This can massively speed up your application if you have long lists or are generally dealing with lots of data, especially data than spans several levels.

Lets also take a second to consider the super fast hydration methods, if we are really struggling for resources (perhaps working with large downloadable reports) and we can live with slightly less friendly arrays of data, scalar hydration can save the day. Shifting to scalar hydration means we lose the ability to later instantly switch to object hydration due to the alternate syntax and lack of nesting, but if we are considering scalar hydration in the first place we are probably in a situation where object hydration will never be practical.

<li><?php echo $results["u_username"]; ?></li>
<li><?php echo $results["a_city"]; ?></li>
<li><?php echo User::getFullNameFromUserArray(array("first_name" => $results["u_first_name"], "last_name" => $["u_last_name"])); ?></li>
<li><?php echo User::getAgeFromDOB($results["u_dob"]); ?></li>

I could go on and on but I hope these points have given you some food for thought!

May 072009

Last night I nailed an annoying bug in my bug tracker that has been tricky to locate, and was related to MySQL, transactions and apparently only certain MySQL versions which behaved correctly.

To ensure consistency in the bug tracker, I use a transaction to commit several changes. In this case, after an issue was created, it wouldn’t save stuff being inserted after the transaction->commit(). However – the id fields which were auto_inc – would all increment by one for each row that was never inserted. This would only happen on a select few MySQL versions, which lead me to believe it was actually a bug in those specific MySQL versions, and not in my code (it’s never me, of course :P ). This was also based on the assumption that doing a commit() would stop “transaction mode” and turn “autocommit” back on. One glance at the PHP manual proved me wrong.

It all did come down to the fact that in the class file that was creating an issue, the “autocommit” parameter for mysqli was never set back to “true” after committing a transaction. According to the manual, this should mean that all subsequent queries for that connection would “fail” unless a new “commit()” has been made. However, this obviously didn’t happen, as I have most of my bug tracker installations running properly, without this problem happening at all.

Under no circumstance was a commit() called after that point – which should mean that any queries being executed after that point were being rolled back. However, that only happed on a specific few MySQL versions, and none of the ones I tested on (confirmed at least 5.0.61 and 5.1.3). To make matters worse, the auto_inc id field did increment by one, even though the row was never saved!

The solution was simple – either do a second commit() at the end of the function to make sure all the queries after the initial transaction were commited, or make sure that the transaction turns on autocommit when it was done. In this case, I created a second function in the transaction class to commitAndEnd(), which turned autocommit back on after committing.

Lesson learned: Read manuals, don’t make assumptions. And don’t blindly trust MySQL.

Apr 212009

In the last few days (evenings) I have been revisiting several games that I found amusing when I was younger. Some of these games I spent endless hours playing – both single- and multiplayer. I also convinced a friend of mine to try playing some of these games with me – apparantly he also enjoyed playing some of them in the past.

As it turns out – these games aren’t as good as they once were. Or – our perceptions, expectations and gaming habits have changed. We want more. The graphical standard – which were breathtaking at the time – now makes the game look old, outdated, boring and uninteresting. The intense gameplay we experienced back then has now turned into boring tasks, repetitive assignments and uninvolving gameplay. The AI is still stunning – although now it is not because it is so good, rather that it is mindnumbingly bad.

Truth be told – I didn’t have as much fun revisiting some of these games as I thought I would.

What on earth does that have to do with PHP? Well, by coincidence, this weekend I also found a backup CD I had lying around – and it had some of my older PHP projects on it. I think you know where I’m going with this now ;)

Yes, you are right – I was amazed by how badly coded these projects were. They are riddled with amateurish mistakes, shortcuts, half-assed solutions and just in general messy code. Functionality that I still remember spending endless hours implementing turns out to be nothing more than semi-intelligent calculations written in poorly designed PHP code.

Notices are being thrown all over the place, short open tags everywhere (and completely at random), $_GLOBALS variables used inconsistently, naming confusion, duplicate functionality in different functions, no classes (well, actually there was one, with one function in it but it wasn’t even used), and no concept of how an application is structured.

Suffice to say I wasn’t as impressed looking at it now, as I was when I was writing it. However – this is a good thing! Spotting obvious mistakes in old code shows that you’ve learnt. I have learnt a tremendous amount of new stuff over the past two, three years, it’s hard to even think about it all!

So – if for nothing else than to give you a good laugh – dig up some old code and have a look at it. Try to remember the good times you had writing it – the challenges you faced and how good you felt when you finally found the ultimate solution. Which turns out was probably not so much the universal ultimate solution, but rather “the best you could come up with, at the time”.