Oct 072011


Whilst everyone is buzzing and creating fancy new Symfony2/Doctrine 2 applications, and perhaps even a shift to new frameworks/no frameworks, a great deal of us are still maintaining legacy apps and will be for some time to come. As these apps grow, we occasionally need to look back and scream at our old code and wonder why we didn’t make it more scalable or use neat optimisation tricks back when it was first conceived. The fact is, many of these “tricks” are not necessary at the time for a virgin app, and we need to develop code that is relevant to the task at hand.

That said, being aware of some of the case studies I will present to you now may help you to optimise old code, but may also allow you to think twice when you are working with new code – as the things I will describe do not take too much time to implement first time round.

Case study: Working with large resultsets and array_merge

In my sample application, the database contained many different types of organisation stored in different tables, and due to the structure of the data and the criteria for each organisation, there was no easy way to retrieve several organisations at once using pure SQL and joins (well, no convenient way). The solution is relatively simple, one query per organisation then combine the results, in this case as we go along:

$membershipClasses = array("Entity1", "Entity2", "Entity3", "Entity4", "Entity5");
    $results = array();
    foreach ($membershipClasses as $joinedTable)
      $query = $this->createQuery()->from("TablePrefix".$joinedTable." t");
      // Lots of differing criteria based on which class we were dealing with

      $results = array_merge($query->execute(array(), $hydration), $results);

This logic was used to generate a report with approximately 50,000 rows, used 1.2GB of internal memory and took around 20 minutes to complete – often failing or bringing the system to a halt.

One optimisation used in this case (many more are surely possible but lets focus on one) involves the last line – using PHP’s array_merge function. Simply switching this out to use a simple array traversal brought the execution time down to under 2 minutes, however there was no change in memory usage.

$theseResults = $query->execute(array(), $hydration);
foreach ($theseResults as $aResult)
  $results[] = $aResult;

Why is this so much faster? And why can’t we use += instead?


The “problem” with array_merge, is that even though it discards numeric keys, it still has to *check* them all first, which is a lot of overhead on 50,000 rows. We can’t use += in this case either, because that would respect the numeric keys which would start from 0 each time, and therefore the results would not “stack” as intended, unless we tell Doctrine to index the results with some unique key.

Why it takes *so* much longer is a bit of a mystery to me. I’ve looked at the c code behind it (ext/standard/array.c) and it’s, well quite frankly voodoo.

Case study: Limiting results

Something that was “fixed” (read: removed) in Doctrine 2 was the ability to limit results returned from queries, at least the “magic” part of it. The problem with limiting hydrated results is that you need to know exactly how many rows each sub-tree will contain before you can limit the query as a whole. Imagine a person with many addresses, and you want an array limited to 5 people – you can’t list add “LIMIT 5″ to the end of the query, because when this is hydrated you will most likely end up with one or 2 people, the second of whom may not have all their addresses, because you’ve told your database manager to return 5 *rows*, it has no idea how these rows relate to your model.

In Doctrine 1, adding a limit clause would cause all sorts of magic to happen, the resulting query ending up as a collection of complex subqueries each with their own limit clause, the more joins you introduced, and the more levels of join, the more complicated it becomes. First level joining is not so bad, but joining several levels deep soon starts to get heavy, and your execution speed will suffer for it. Couple this with a large resultset and you will see the smoke drifting from the server room in no time.

So, what’s the solution? Well, this one is not so clear cut – you have to experiment. Sometimes, doing it the “magic” way will work just fine, and is totally acceptable, but when the query becomes “heavy” you have a few options:

Work out your subset first

This is the default Doctrine 2 approach, it’s quite simple – you use one query to get all the ids of the subset of top level elements, then pass these IDs to another query in a “WHERE IN” clause. When you look at the resulting SQL, it can be quite scary, especially if you are dealing with many results – you can easily be saying “WHERE IN (1,2,3,4…..9999999999)”. Most dbms will handle this surprisingly well, as long as you are using the primary keys, so experiment with it and it might be the way to go.

if ($limit)
  $subQuery = $this->createQuery("foo")
                   -> //
  $ids      = $subQuery->execute(array(), DOCTRINE_CORE::HYDRATE_SINGLE_SCALAR);
  $query->andWhereIn("cf.id", $ids);

Note the use of “single scalar hydration” – we have no need for more than this (to be really picky could go straight to PDO here but for consistency this is ok). In this case, the single scalar hydration will give us a resulting array of raw IDs, which is exactly what we need to pass to the main query.

Make your own subquery

A variation on the above is to embed the “ID sucking” query into your main query, some database engines may prefer this style, but I have not noticed any significant performance gain or loss doing this so prefer the above option as it’s cleaner in the PHP code.

$query->whereIn('SELECT id FROM my_other_table WHERE blah LIMIT 50');

In a perfect world, this is the kind of thing Doctrine could have done instead of the subquery magic, but there are so many factors to consider here that I can understand why they did not go down that route.

Stick with the magic, but simplify

It is also possible to continue using the magic, but greatly improve performance by simplifying. Remove all those second+ level joins and use separate queries to get hold of the data you need. How often have you added a chain of joins just to get one snippet of data from the last node? Scrap all the joins and make a new query later where you can pass in all the IDs from your main query and get the data you need. Now you don’t need a limit clause, because you are asking for exactly the data you need in the first place.

Start with the fastest hydration mode and work upwards

This one I can’t stress enough – an ORM such as Doctrine is there to give you the tools you need when you need them, but it is often the case that applications are built with all the overhead and magic just because it can be, or because it’s the default behaviour. Stop. Look at what data you *actually* need, and especially ask yourself if you need objects in your results. Perhaps you are retrieving all your users and listing them with their profile data, but you are hydrating them as objects because you need some custom functions you’ve written, classic examples are getAge() or getFullName() which are derived from other fields. If this is the only reason you are object hydrating, consider something like this:

class User
  function getAge()
    return self::getAgeFromDOB($this["dob"]);

  function getFullName()
    return self::getFullNameFromUserArray($this);

  public static function getAgeFromDOB($dob)
    $dob = strtotime($dob);

    $year_diff  = date("Y") - date("Y", $dob);
    $month_diff = date("m") - date("m", $dob);
    $day_diff   = date("d") - date("d", $dob);

    if ($month_diff < 0 || ($month_diff == 0 && $day_diff < 0))
    return $year_diff;

  public static function getFullNameFromUserArray($user)
    return $user["first_name"] . " " . $user["first_name"];

In this example, we’ve kept the sideways compatibility of the getFullName function with array/object hydration, so it will work whether an object is passed or an array (sanity checking needed of course). This method could be expanded to support more “raw” hydration methods also. Now in our templates, we can just use:

<li><?php echo $user["username"]; ?></li>
<li><?php echo $user["Addresses"][0]["city"]; ?></li>
<li><?php echo User::getFullNameFromUserArray($user); ?></li>
<li><?php echo User::getAgeFromDOB($user["dob"]); ?></li>

This can massively speed up your application if you have long lists or are generally dealing with lots of data, especially data than spans several levels.

Lets also take a second to consider the super fast hydration methods, if we are really struggling for resources (perhaps working with large downloadable reports) and we can live with slightly less friendly arrays of data, scalar hydration can save the day. Shifting to scalar hydration means we lose the ability to later instantly switch to object hydration due to the alternate syntax and lack of nesting, but if we are considering scalar hydration in the first place we are probably in a situation where object hydration will never be practical.

<li><?php echo $results["u_username"]; ?></li>
<li><?php echo $results["a_city"]; ?></li>
<li><?php echo User::getFullNameFromUserArray(array("first_name" => $results["u_first_name"], "last_name" => $["u_last_name"])); ?></li>
<li><?php echo User::getAgeFromDOB($results["u_dob"]); ?></li>

I could go on and on but I hope these points have given you some food for thought!