Free chocolate = service outage

M & M Candy
M & M Candy

M&M Mars company offered a coupon for free chocolate, but didn’t bother to think through the effects of this promotion on their web site.  Any marketer should know that you get negative brand awareness if you offer something positive, like free chocolate, and then make the experience painful.  This is what Mars company did this morning.  The site asks you to register to win free chocolate each week for the summer and if you tried before noon or so PDT today, you probably only got frustrated.  While this promotion may not have as much immediate draw as Oprah, it certainly garners a lot of attention from the deal sites such as Consumerist and DealNews.

This sort of negative publicity is easily avoidable.  Proper testing methodology includes flash crowd testing from the open internet, performing an end-to-end transaction.  Their IIS servers can be made to scale, if configured and built out correctly, but it needs to be proven before your customers tell you that it didn’t work.

Just as important is timing.  Good testing methodology means good communication between the marketing-communications teams to the web operations teams.  Testing like this needs to be done far enough in advance that you have time to fix something or correct an issue before you go live–in other words, testing the day before isn’t good enough and is more or less a waste of time and resources.  Start at least a week in advance, find out what happens under potential load scenarios, practice remediation strategies, etc.

So M&M Mars, next time call me first.

Liza Minelli crashes web site

Lize Minelli crashes web site
Lize Minelli crashes web site

Liza Minelli, the famous daughter of the famous Judy Garland, causes more traffic than the Sydney Opera House web site can handle and crashes.  The article doesn’t say how much traffic they received, only mentioning that the technicians took hours to get the site operational again.  That tells me that the crash wasn’t just because of a high traffic spike by itself, because otherwise the site would have recovered after the traffic left.  Moreover, the appear to not have had a monitoring service, so they may not have even known that the site was experiencing problems until customers starting calling to complain.

It is ironic that firms set up websites to handle customer traffic to lower costs and reduce the amount of operator staff to take calls.   This crash flooded their call operators and caused negative publicity.

Proper load testing takes time and money.  The Return On Investment is usually rather easy to see when you compare it with the damage caused by the web site crashing during an importent event like this.  This was probably one of the most popular events to be at the Opera House in a while, and I doubt that the Opera House management performed end-to-end load testing as they should.  I see this so often and it doesn’t have to happen.

Metrics-concurrent users versus rates

I frequently see confusion regarding concurrent VU’s versus VU’s per hour, or what should be called sessions per hour or transactions per hour. When modeling web traffic on the open Internet, the rate-based metrics are better suited to finding out what will really happen. If a site slows down, do your users know that *before* they arrive at your site or after they get there. The concurrent user model assumes that a new user doesn’t arrive until the previous user leaves. If the last user in the queue isn’t leaving, that is the users is stuck on the system trying to perform a task, then no new user arrives. This simply doesn’t happen. A user doesn’t know and frankly doesn’t care how many other users are on your site until the user gets to the site and discovers that it is about to die–or that the user would rather die than use this site.

Transaction rate and the number of virtual users concurrently on the system affect the application server differently.  Transaction rate primarily takes CPU to process the delivered pages, while the number of concurrent users primarily affects memory.  Both are important, but they are independent variables.  If the site performs well and the scripts are modelled correctly, then the transaction rate and total number of concurrent users will match your web analytics.  If the site degrades, CPU is still maxed out, but memory may not immediately be maxed out.  However, as the number of concurrent users increases, memory utilization will also increase, as well as database connections, etc.

This means that applying a rate-based metric to drive the load combined with the right scripts and use cases will drive the best load to allow you to see the behavior of your application under high loads.

Does Geographic Distribution of load really matter?

The question of geo-distributed testing is really 3 questions. The first question is where you should be generating the load, second how many locations are required to generate load and third, can I use sample agents in some locations instead of having load generators everywhere.

The reasons for geographic distribution of load is both simple and complex. On one hand, you are testing outside of the firewall for 2 primary reasons: 1) that is where your users are located and 2) to do end-to-end testing.

If you are only testing externally to have an end-to-end test, then you could just as easily do the load test in a loop-back scenario, i.e. generate the load on a circuit that sends the traffic out on one interface/circuit and it comes back in through the primary ingress point(s). If you have enough bandwidth and load generation, this is pretty simple and you can even use NISTnet to try to emulate latency. However, it is really only half of the reason for performing external tests. A loop-back test doesn’t really tell you about latency, even if you try to emulate it.  Moreover, you assumed that your users were sitting in your data center or lab, which is pretty unlikely.

If you wish to discover the customer’s experience of the site under load, you need to have real geo-distribution. For SUTs where there is only 1 bandwidth provider and the volume of the test is relatively small, 2 locations will probably suffice. This is especially true for situations where the customer base is centrally located in a small number of locations, for example a local retail chain that is only present in a few states. If you have customers coming to you nationally or internationally, then you need more. Given the demographics of North America, I recommend either of the following options: 2-3 load generation sites distributed across the time zones and on different ISPs with 5-6 sample locations spread out among the rest of the high-traffic areas, or my standard practice of 9 load generation locations domestically—3 east, 3 central and 3 west. If you are international, then you’d need to think about whether your traffic is European or APAC. This also lets me avoid crashing individual Content Distribution Network POPs, although it still happens. For some reason, they get annoyed at me for this.

So think about the reasons you’re even testing outside of the firewall. If it is only to do a simple end-to-end test, then don’t bother paying a provider or anyone else and just loop the traffic. If you want a good representation of your traffic, plan properly and distribute the load as well as you can.  Professional load test service providers do more than just deliver some hits.

The worst thing you can do to your home page — don’t slow it down on purpose!

This one will be short, because there simply isn’t that much to say. Your home page is one of the most important pages on your site in terms of the visitor’s experience. If your site requires registration, authentication or identification, nearly all users must go through this page. It is the proverbial front door to your site and application.

On a recent load test, the test had to be aborted after 9 minutes, while they were only at 25% of the planned total load level of 385,000 sessions per hour. They were using a LAMJ architecture, and each home page hit generated a long running SQL query. Even very patient users, who are tolerant to slowdowns and errors, will not stick around if the home page takes several minutes. However, this site didn’t even do that! After 2 minutes into the test, the pages simply said

Whoops! The social network is currently down for maintenance. Please be patient, we’re working on it! “

As you may imagine, their home page is now very fast–0.07 seconds in fact. That is a very fast error message that every user is seeing on the home page, and it would deliver the same for every other page too if the user actually made it that far. I don’t think I need to mention the usefulness of all users seeing that error message.

What caused this slowdown and crash you may ask? I’m glad you did. The long running queries exhausted the JDBC connection pool and maxed out the available number of connections, which is what caused the immediate error page.

The only good thing I can say is at least they didn’t just print stack traces with DSN information contained in them. I’ve been shocked at the content of some of the stack traces I’ve seen on production sites when they encounter an error, but that’s another post.