The PerfMon Blog

October 2, 2008

The Benefits of Eco-System Management

Filed under: Performance Monitoring — Tags: , , — Tyler Fullerton @ 2:06 pm

When talking about web applications, the Eco-System is a term used to define the ever increasing complexity and disparity of a web application (as it’s delivered to the users).  An eco-system approach defies traditional web based development methodologies such as managing all software and hardware in-house and instead adopts development concepts from OOP (code reuse being the biggest).  This change in behavior has made web based development far more flexible (and reduced time to market) but has dramatically complicated the management of the resulting application (and in particular performance monitoring and SLA reporting).  With various components and pieces of functionality scattered across the internet there needs to be a common platform for providing performance monitoring and reporting on SLAs.  The Webmetrics eco-system management platform does just this and it does it by considering these main points:

  1. Monitoring should not be duplicated – Just like code/functionality you do not want to duplicate your efforts.  Multiple departments these days need performance data (IT, QA, Marketing, Executives) and they all need to be on the same page.  A common misconfiguration is to have multiple services monitoring the same resource but with different test parameters.  The problem here is that if alerts and errors are not standardized then there is no way to build consensus among different departments.  Worse is if these values are being used to support an SLA.  If the company isn’t consistent with how the SLA is calculated then the company as a whole cannot (and will not) meet the SLA.  Eco-System management makes it easy to setup a single set of monitoring services that can then be shared among different groups in the organization.
  2. Thoroughly monitor web applications – You lose control as more and more functionality moves outside your firewall.  Your best bet is to not ignore these components when you’re monitoring.  With applications mashed-up from disparate components you need to make sure that not only is your application as a whole monitored, but all the bits and pieces that make it up are monitored as well.  This will help you locate and diagnose problems faster.  The Webmetrics eco-system does this by allowing you to monitor your web service calls (from the perspective of your network) and view their performance side-by-side with your other monitoring data.
  3. Keep your third parties honest – If Amazon says that EC2 is available 99.999% of the time what is your guarantee that at the end of the month (or year) it was up that amount of time?  Outages can be subjective (5 minutes of downtime can feel like eternity) and if you are put in a position to question an SLA you have with a vendor it’s best to have some sort of data to bring to the table (not just what your vendor brings).  The data can also be used to help you negotiate with your business partners at the end of the year (or contract).  The Webmetrics eco-system does this by allowing you access to comprehensive reports on performance and availability (uptime).
  4. Build trust with users – Your users look at you the same way…they want to know that they’re getting what they pay for.  A lot of companies are starting to see the benefit of building trust with users through behaviors that promote transparency of uptime.  The Webmetrics eco-system allows you to extend sharing of data to your users.  It empowers them to keep track of the performance of services they’re offering.  I suppose that’s good or bad depending on the stability of your systems.  For more information on transparency check out this blog.

September 30, 2008

PerfMonBlog Performance Update

Filed under: Performance Monitoring — Tags: — Tyler Fullerton @ 10:27 am

I’ve noticed lately that the performance metrics I’ve been collecting for this blog have degraded.  The statistics I’m seeing lately are:

  • A daily (9/30) average load time of 2.06 seconds.
  • A daily (9/30) uptime of 99.19%.
  • A weekly uptime of 99.73%.

Nothing too alarming, the cause of the alerts is that the performance of the page is now starting to exceed 3 seconds for some locations (Atlanta and St. Louis) due to the addition of content (text and images).

The Newbie Introduction (A Recap)

I’ve been blogging on performance monitoring for a couple months now and it dawned on me that the information presented is probably straightforward for someone who has had past experience in performance monitoring (setting up, interacting with, etc.), but for someone new to performance monitoring it may be hard to cobble together a decent understanding of performance monitoring from a bunch of scattered concepts posted on a blog.  My goal with this post is to provide a list of questions that are integral to establishing a functional performance monitoring solution.  These questions are:

  • What problems are being solved?
  • What base are we trying to solve problems for?
  • Who is involved in solving the problems?
  • What information is required to solve the problems?
  • What is the perspective needed to solve the problems?
  • How will problem solving techniques be integrated?
  • Can future (unexpected) problems be solved?

These questions revolve around the idea of solving a problem.  That is, you have a web based resource (website, application, web service, etc.) and something about it is keeping you up at night.  So, let’s go through these one at a time:

  • What problems are being solved? This is the most important question you can ask yourself, but you already knew that :) .  You need to know where you’re at before you can determine if you’re moving in the right direction.  The process of implementing performance monitoring should be broken down into distinct milestones that should be in place before you even start talking to vendors.  For example, if your answer to this question is: I want to know where on my site my users are going, where they enter the site, and where they leave the site!.  Talking to a performance monitoring vendor about this is only going to infuriate you as they try to sell you monitoring when all you really need is analytics (by the way, there are lots of really well established companies that can help you with analytics, such as: Omniture or Coremetrics).  So know the problems that you want to solve…these are your goals.  Write them down on a piece of paper and give each one a weight (critical, nice to have, not really necessary).  Since the assumption is that you don’t know anything about monitoring you’ll have to be vague.  Problems like: My site is slow, It becomes unavailable a lot and I want to know when that is, my customers complain about performance but I don’t see it, I want to track where my users are going, I want to be able to perform fail over if something happens, I want to have someone else host my site/content, and I want a cup of coffee! are what you need to concentrate on.  They help you know where you need to go and how to get there.  The first three problems can be solved with monitoring.  The fourth, fifth, sixth, and seventh are probably not problems that you would want to solve with monitoring.  So now you’re armed with information that will allow you to successfully navigate the process of picking a vendor.
  • What base are we trying to solve problems for? That is, what exactly do you want to monitor?  In an ideal world you could monitor and collect performance metrics on every page of your site, or every business process in your web application.  The problem is that the cost and management of that solution is prohibitive and the data generated (alerts, logs, and reports) is probably more than any organization would be able to handle.  So it’s clear that you need to distinguish between what you need to monitor and what you don’t need to monitor.  This really depends on your business model and commitments to your customers (and other stakeholders).  If one of the main problems you’re trying to solve (from the previous question) is manage SLA values then you need to consider monitoring only those resources that fall under the SLA.  If many resources fall under the SLA then you could potentially re-tool your SLA to take into consideration that SLA verification will be based on a subset of the services you provide (this depends on how well established your SLA already is and if your clients will allow you to amend your agreement with them).  One important attitude to have during this step is honesty with yourself.  You really need to be honest and make compromises with yourself (and your organization) as to what needs to be monitored, what would be a nice to monitor (but not necessary), and finally – what doesn’t need to be monitored.  You may even want to reach out beyond your current needs and ask others (customers, executives, etc.) what they might need performance metrics on.  They may not fit into your current budget or plans but it’s always good to know what the future holds.
  • Who is involved in solving the problems? Gather the people that are going to help you make a decision.  if you’re the head of an IT department then you’re going to want to poll your customer base (whether this is an internal or external base) and find out what they’re asking for.  Are they even concerned with performance?  If they are, how apparent are performance issues with them?  Also, find others within your organization that can help you evaluate the performance monitoring services/tools that you will be looking at.  Marketing for example may be able to make great use of the data that is collected (by the way, marketing departments are almost always beneficiaries of performance monitoring data), or your development and Q&A departments may be interested in looking at the data that is collected.  This goes all the way up to your boss, your boss’ boss, etc..  When you have consensus among individuals in your company it will help to enrich your list of requirements.  Again, you’re just trying to build consensus at this point.  You do not have to commit to any of these requirements, you’re just building an eco-system (check out some of the other blog entries for more information on this term) view of your companies performance monitoring needs.  Also, you may be able to expand your budget by doing this (how many IT departments have the same budget as the Marketing department?).
  • What information is required to solve the problems? Monitoring is all about data collection!  Sure there is definitely something to be said about how that data is collected (as has been expressed in numerous posts), but at the end of the day you’re left with a set of data.  That’s all you’ve got!  Yes, you have a control panel or some other tangible artifact but the only thing that’s going to consistently show your the ROI of performance monitoring is data (alerts, reports, graphs, logs).  So it is very important to figure out ahead of time what type of data you are looking for, how you want to display it, and how easy it is to get to data presentation that isn’t standard.
  • What is the perspective needed to solve the problems? The accuracy of monitoring is really quite subjective.  It depends on what you are willing to consider end-user perspective.  For example, you may want to consider only the HTML load time as the performance of your application because your goal is to only improve the deliver of the HTML (and no consideration is to be given to the various other components that make up the page).  Or you may want to consider everything that happens when a person uses your application (JavaScript execution, downloading images, etc.).  The importance of this is that you need to understand what you’re buying from a vendor and also that you will need to understand the context around the data that is collected.
  • How will problem solving techniques be integrated? No monitoring solution will be able to meet all your IT demands.  For example, monitoring is developed to provide accurate and reliable functionality that will alert you of issues and report on those issues as well as overall performance.  So it’s imperative to make sure that your monitoring solution can easily plug in to any existing tools (or future tools) your organization has.  This may be through technical solutions (API, SNMP msgs, etc) or procedural solutions (who gets alerts, how they react to them, decision trees, etc.).  Doing this will give creedence to a monitoring initiative and will reduce confusion once the solution is implemented.
  • Can future (unexpected) problems be solved? Often a monitoring solution will meet the inital needs but will fail to meet future needs due to the accelerated evolution of technology.  As an example, a standard monitoring solution can easily monitor a site that relies on basic HTML but will more than likely have problems with more dynamic technologies like Flash or Ajax.

September 9, 2008

Active and Passive Monitoring Solutions

Filed under: Business Considerations, Performance Monitoring — Tags: , — Tyler Fullerton @ 8:48 am

Note: My brain must be on the fritz.  Dear readers, I have updated the post as I completely misused the terminology and definitions of Active and Passive monitoring.  I apologize for any inconvenience and have updated this post as of September 22nd (3pm PST).

I’ve been asked quite a few times about the distinctions between active and passive monitoring and which is the best method to consider when implementing a monitoring methodology.  In this post I’d like to provide a basic introduction to the two types of monitoring and talk briefly about their benefits and deficiencies.

First, let’s start with definitions of these terms:

  • Passive Monitoring – Performance/Availability monitoring that uses data sets generated from actual human users of a website or web application.
  • Active Monitoring – Performance/Availability monitoring that uses data sets that are generated by a consistent and automated user of a website or web application.

We can see from these definitions that in one case (Passive Monitoring) we are relying on the real world experiences of the existing user base for the website/application, similar in fashion to how web analytic data is collected.  In the other case (Active Monitoring) we are relying on the experience of a synthetic user (a piece of software that emulates an end-user’s interaction with a website/application).  Let’s start our analysis of the two methodologies by looking at their similar properties:

  1. Both can provide the same statistics (uptime, availability, errors, throughput, and other performance metrics).  Essentially, neither is limited to the basic data sets of performance monitoring solutions.
  2. Both will reflect accurate measurements that will represent the performance of the server at the time the sample was taken.  Stated differently, if the infrastructure for the website/application is under duress then the impact will be reflected in the data that is collected by the monitoring solution.  There are fringe cases where this concept breaks down for Passive Monitoring that we will discuss below.

What about the differences in these monitoring methodologies?  Here are the basic properties of a Active Monitoring solution:

  • Monitoring is performed from an emulated user.  This can be as simple as an automated process that makes base level HTTP requests (ex: Unix wget commands) or can be a complex solution using actual browsers for performing monitoring.  In either case, we are talking about a user of the website/application that is strictly software based.
  • Monitoring is consistent throughout the day and will always attempt to monitor regardless of the state of the website/application infrastructure.
  • Monitoring is consistent in configuration of the monitoring environment.  That is, every time you monitor the user (automated process) is the same.

And for Passive Monitoring solutions, the properties are:

  • Monitoring is performed by actual (human) users.  This is done by execution of JavaScript code embedded in the website/application that track the performance that the end user sees while accessing the site.
  • Monitoring reflects the actual usage parameters of the end users (ex: browser type, configuration, platform, etc.).  This is another way of saying that the end user perspective is accurately represented.
  • Monitoring will adapt to the demographic of the users of the website/application.

The end goal of monitoring is really going to be the driving force in dictating which solution is the best.  Some companies may want to be able to record the experiences’ of their actual users, in this case Passive Monitoring is the appropriate solution.  Passive Monitoring will allow the company to collect samples on performance that are actionable in the sense that they can see what type of browsers are being used, what platforms are most important, which problems are certain users having, how is performance from an exact location.  This can help direct the companies efforts when it comes to initial development and improvements of a website/application.  The Active Monitoring solution is more in tune with the task of reporting on performance to management, ensuring availability, and tracking SLAs because it is more consistent and has a reliable monitoring base that will not change over time.

Each solution has its faults as well, for Active Monitoring the faults are:

  • Does not track experiences of actual users accessing the site.
  • Does not provide statistics on browsers and platforms used by website/application end users.
  • Does not provide last mile information.

Passive Monitoring has the following faults:

  • Monitoring requires JavaScript which can alter the performance of the website/application being monitored and can potentially break or not work all together (if someone has JavaScript turned off).
  • Monitoring is subject to spoofing since information about a browser, platform, and other environment variables can been altered by a malicious end user.
  • Monitoring data will be sporadic and will only be collected when users are accessing the website/application.  No data will be collected during times when users are not on the site.  Therefore…
  • Issues with the website/application will not be detected until someone accesses the site.  This ability to detect problems before customers do is key and central to an on-going monitoring solution.
  • If the site becomes unavailable then no monitoring will be performed because users will be unable to interact with the website/application and therefore will not be able to execute the JavaScript that will track their experience.

The final analysis is: Passive Monitoring is great for QA and development purposes.  If your product is in Beta or not mature/critical enough for an SLA then this may be the best solution because it provides statistics on how your end users experience your website/application and the specifics of their environments.  However, if your application is central to your business or a certain level of service (performance and availability) is expected (even agreeded to) by your customers then you need a more consistent and robust monitoring solution.  The Active Monitoring solution is far superior for these types of environments because it guarentees monitoring, consistent monitoring (you don’t have to distinguish samples based on environmental factors such as IE vs. Firefox), early alerting of problems (before your customers see those problems), and provides a basis for reporting performance and availability to others in the organization as well as tracking SLAs.

August 29, 2008

PerfMon PerfOrmance

Filed under: Performance Monitoring — Tags: , , — Tyler Fullerton @ 9:54 am

Hi everyone,

I thought I’d post some graphs on the performance of the PerfMon Blog over the last couple months.  There are a few reasons for doing this:

  1. I want to see how consistent performance of a WordPress blog is.
  2. The graphs added to the page will mean more objects being downloaded which will affect the weight of the page, which in turn will allow me to look at how adding images will affect the performance of a page.

This first graph shows the average load time of the PerfMon Blog over time.  We can see from this graph that the performance of the blog is pretty consistent over a 30 day period (around 1.84 seconds) and rarely do errors occur (the errors on the graph are timeout errors that occurred when the perfmon.wordpress.com page took longer than 3 seconds to download):

Average load time performance of the PerfMon Blog.

Average load time performance of the PerfMon Blog.

So far we see that the performance is pretty consistent from a high level.  The timeout errors when the page takes longer than 3 seconds to download does not concern me too much (it’s bound to happen) and a very consistent performance (that is a non-spikey graph) is always a good sign, especially when monitoring from geographically dispersed locations.  Now let’s drill down a bit and see what the performance of the page looks like from the object level (images, CSS, JavaScript files).  Here we have a waterfall style graph called a Full-Page Breakdown:

Full-Page breakdown for PerfMon Blog

Full-Page breakdown for PerfMon Blog

The graph shows us the performance of the items on the page and breaks down that performance into three key values: DNS lookup time, latency time (i.e. 1st packet time), and transfer time.  Since monitoring is being performed from an IE browser we know that we’re seeing the actual performance of the page (JavaScript execution, rendering and layout, as well as any objects requested asynchronously…if any).  The object level performance looks pretty good the only thing I’d really like to comment on is the DNS lookup time (the yellow values of the graph).  It seems that because of a number of third party items (analytics and advertising code) there is a bit more DNS lookup time then I’d like.  The way IE works it will perform a DNS lookup for a domain only once and then cache that information for the remainder of the session.  So every time a new domain is introduced a DNS lookup needs to be performed.  We spent about 0.5 seconds doing DNS queries!

This final graph is here to show how the PerfMon Blog performs for viewers coming from different locations.  It looks fairly good:

Uptime and average load time for PerfMon Blog

Uptime and average load time for PerfMon Blog

This is fairly good performance.  The average load time (per monitoring location) is represented by the green line.  This line is rather straight meaning that the performance is consistent regardless of where in the US you are viewing the PerfMon Blog from.  That’s good news!  Often this line will start out low on the right (meaning the performance is good from those locations) and will increase as the line moves to the left of the graph indicating that users further from the server hosting the web-site see poor performance.  The background of the graph is broken down into 3 values:

  1. Green – This is the percentage of successful samples taken from that location.
  2. Yellow – This is the percentage of unvalidated errors.  An error is unvalidated if another monitoring location is unable to duplicate the error that was reported.
  3. Red – This is the percentage of validated errors from the location.  An error is validated if a number of monitoring locations reported the same error.

We can see that certain regions generally see more errors (validated and unvalidated): Salt Lake City, Boston, San Jose, Newark, Scranton, and Los Angeles.  This coeincides with their higher load times as well.  Overall, the performance of the PerfMon Blog page is pretty good.  The question is; is that because it has minimal content, is hosted on a solid infrastructure, or has very few people viewing it ;) .

August 22, 2008

SOA Management Framework

Filed under: Business Considerations, Performance Monitoring — Tags: , , , — Tyler Fullerton @ 2:47 pm

I’ve been reading a book on SOA architectures by Nicolai M. Josuttis which provides a very accessible introduction to SOA (Service Oriented Architecture) design, benefits, and established best practices. One theme that keeps coming up is Collaboration, and in fact Nicolai states:

One key requirement for SOA is collaboration (pg 104).

The collaboration that Nicolai is talking about in the book is among isolated departments or business units within a company and is a key factor in ensuring the success of SOA (pg 104). This need to collaborate is a major driving force behind the decisions that need to be made to manage a SOA application.

Take the example of any company that has successfully impelemented a SOA based application. The company has overcome the inflexibility of large number of complex, distributed systems by creating a framework of services and processes that expose functionality to the consumers (users) of the application. This leaves the company with a number of services and processes that interact in a choreographed manner where no process has total control and all processes and services have limited knowledge of the over all application. These departments have given up some knowledge (and control) to be a part of a more flexible federated application (think about U.S History. At the Constitutional Convention of 1787 the states were basically asked to give up some of their influence and power to a centralized federal government. Some influence is gone but the resulting government is stronger and more flexible). Now there is a flexible (and scalable) infrastructure but there isn’t any unifying view of the application (other than the application itself but if you look at only the application you become ignorant to the components underneath).

In a SOA environment, monitoring of these services and processes is going to become more and more critical because of the limited scope of knowledge each department has. I think SOA based applications are still a relatively new concept that companies are experimenting with so there really hasn’t been any consideration as to how to ensure the performance of these applications and distribute the results to all interested/involved parties. What I suspect is that a need will arise (if it already hasn’t) for a platform in which all functional and non-functional requirements of a SOA base environment can be managed. I’ve heard that this isn’t even possible because of all the different ideas and methodologies for implementing SOA but it seems clear that some base framework that doesn’t contribute to the underlying architecture be present for management of the architectures requirements.  What I’d really like to see is a platform that allows you to plug-in non-functional requirements (e.x. performance monitoring, SLA management, Business Process Management, etc.) as needed. A SOA management platform would help alleviate the pains that can occur when a company’s culture acts to resist collaboration. My experience with performance monitoring tells me that unless such a platform exists, there will never be widespread adoption of monitoring for SOA based applications.

August 12, 2008

I left my heart in San Francisco (Web 2.0 that is)

Filed under: Performance Monitoring — Tags: , , , , — Tyler Fullerton @ 3:26 pm

Earlier this year (March/April) Webmetrics exhibited at the O’Reilly Web 2.0 conference in San Francisco and we found that there were quite a few unanswered questions on the mind of fellow exhibitors and attendees. The most prominent questions were:

  1. Service Level Agreements (SLA) – Just about everyone who came to the Webmetrics booth had some sort of requirement for SLA reporting. Mostly we saw that the requirement was to provide an SLA to users of a service (since many of the exhibitors were companies that have a SaaS model/platform or at least were providing web services that could be used by their clients to extend functionality of existing products. There were some cases where tracking SLA values was more geared towards keeping tabs on SLAs that are offered by third parties but the overwhelming majority were looking to provide their clients with the SLAs that had been agreed upon. This indicates that many companies are becoming proactive in sharing information with their clients (in the form of SLAs). Which leads to…
  2. Collaboration – Everyone understood. Very rarely did someone not get the idea of collaborating with third parties or partners. One of the main ideas behind the Web 2.0 movement is to develop software using a service model. Just about everyone in attendance of the conference was entrenched with some sort of third party. People are naturally suspicious which makes for a bad situation when a third party offers up some metrics that were collected in house. Often reports are generated by the provider of a service and then handed over to the user without any explanation of what errors are, where data was collected from, or even…god forbid, incomplete data sets.
  3. Problems – Finally, the majority of people who stopped by the booth had experienced some sort of performance issue. In most cases it was uptime, that is, the service being provided was not available for extended periods of time (or unavailable for short periods of time very frequently). Although users of web services are becoming more sophisticated with their consumption they need really need to buckle down and pay attention to SLAs and demand (or track) SLA information (see the first point).

I will be at the Web 2.0 conference in New York City this September. Please feel free to stop by the Webmetrics booth and share your thoughts and opinions with me on the performance issues that surround web eco-systems and Web 2.0 applications. September.

The performance for this blog (today’s average load time and uptime):

  • Average load time is 1.78seconds.
  • Availability (uptime) is 100%.

August 6, 2008

More MobileMe

Filed under: Load Testing, Performance Monitoring — Tags: , , , — Tyler Fullerton @ 11:22 am

The other day Ohm Malik wrote an interesting post about the availability issues with Apple’s MobileMe site. He made quite a few key observations related to monitoring and load testing that I wanted to reiterate here. They are:

  • There is no-unified IT plan vis-a-vis applications; each has their own set of servers, IT practices and release scenarios. This is becoming more and more of a problem with adoption SOA architectures, SaaS, and mashup models (see yesterday’s post for more information on monitoring these types of architectures). You now need to work closer with your partners when performing load testing to ensure that all components of your application are thoroughly tested. A load test is of little value if it fails to test 25% of your application/infrastructure. This requires more up-front time planning a load test and if you’re using a vendor for load testing then ensure that they provide services that are consultative and strive to understand your environment and infrastructure.
  • There’s no unified monitoring system. Monitoring data from all sources (server performance, external application performance, analytics, etc.) needs to be considered and in an ideal environment would be mashed up to gain new insight into the data. Consistent and fine-grained monitoring intervals are two properties that are important in effectively monitoring a web application.

Even the giants like Apple will feel the pain if they fail to ensure performance and availability of their applications through pre-release load testing of infrastructure and on-going monitoring for performance and continued availability.

The performance for this blog (today’s average load time and uptime):

  • Average load time is 1.81seconds.
  • Availability (uptime) is 100%.

August 5, 2008

Web Eco-System Monitoring

Filed under: Performance Monitoring — Tags: , , , — Tyler Fullerton @ 3:08 pm

Seems like with cloud computing and the SaaS/PaaS models everyone is starting to really ask questions around availability and Service Level Agreements. And why shouldn’t they be? Moving your critical application functionality outside your firewall and onto the servers of a third party is a big step that makes you vulnerable to the downtime of systems you have little to no control over. Both Amazon and Twitter have had high profile downtime events recently which have caused users to really dig deep into the SLAs of these services.

The concept of monitoring performance of an application can become quite complex when third party services are used to provide key functionality. On the one hand you need to be able to drill-down on the various components that make up the application (previously it would have been sufficient to monitor the performance of the web based application from just an external users perspective, but now your servers are the users that are using third party components so that monitoring needs to be done from behind your firewall). In addition to the added complexity of end user perspective is the sharing of the data that is being collected. If you’re a producer of services then you want to be able to share your uptime and performance SLA statistics (creating an open and trusted relationship with your users) and if you’re a consumer of third party services then you will want to be sharing the appropriate performance data with various departments in your organization (as well as be able to present that data to your third party providers if SLAs are ever in question).

If you have a moment, check out the Webmetrics Eco-System monitoring platform for a solution to managing the SLAs of third party components and mashups. I would be interested to hear any feedback you have. I promise I will not focus this blog too much on Webmetrics based products and will try to keep it focused on web performance monitoring concepts and best practices. In this case I feel that keeping track of the availability and performance of third party components and services is a critical factor in managing any web application that utilizes third party services.

The performance for this blog:

  • Average load time is 1.73 seconds.
  • Availability (uptime) is 100%.

July 29, 2008

MobileMe or: How I Learned to Stop Worrying and Love the Load Test

Filed under: Load Testing, Performance Monitoring — Tags: , , , — Tyler Fullerton @ 2:24 pm

So, you’ve got your new iPhone 3G, you’ve waited in line at the Apple or AT&T store for hours and it was worth it. You probably bought the 8 Gig…but maybe you bought the 16 Gig (and like a fellow co-worker you bought the white version so people knew you had the 16 Gig iPhone). You’ve now got the cool new iPint application, or you’ve finally solved an age old problem with the BigTipper application. Furthermore, you’re extremely happy that you can use MobileMe to store your contacts, calendar appointments, tasks, etc. on “The Cloud” so that you can easily keep your iPhone and MacBook synchronized. Life is good, Windows is bad, and Apple rules….but, er, wait. What’s this? Why can’t I manage my contacts and why are the changes I made not propagated to my devices? And where are my emails?

I don’t care how good Apple products are. “The Cloud” is bad (not that it means to be). With all this hype that the new iPhone 3G is getting some unexpected issues arose on the Mac run me.com (MobileMe) site when it was flooded with traffic (i.e. users). These issues caused downtime, loss of functionality, and even loss of mail messages for some MobileMe users.

MobileMe is a web application that allows you to keep all your information on the web so that all your devices can be synced from one convenient location, that conveniently isn’t under your control. I don’t know what exactly Apple did to prepare for the launch of this application but it would seem that Load Testing was not performed (or not properly performed) since there was no concept of how much load the servers could handle.

A load test would have helped the IT folks at Apple understand what capacity they were currently operating at, where their likely points of failure are at (and at what levels of usage would those points fail), and what they would need (infrastructure) to surpass those limits and prevent failure. Load testing is something that should be done before any major release and ideally would be done by someone with a proficient background in load testing and the review of load test results (either someone internal or external to the company, as long as they have some experience with load testing).

Generally load tests are launched before a major release of a web application or service. A load test will consist of a number of iterations in which virtual users (the load) are applied to the server for a duration of time (usually 1 hour). during the iteration multiple types of scenarios will be run. in the case of the MobileMe application scenarios could have been:

  1. Login to MobileMe.
  2. Sync contacts on MobileMe.
  3. Update calendar event.

The first iteration of a load test would have consisted of 1000’s of virtual users (each one performing one of the above scenarios) accessing the MobileMe application. During the test, the virtual users would track performance metrics (availability and load time) for the various scenarios mentioned above. For example, as the load increases on the server we could see how the performance would degrade for someone who was trying to login to the MobileMe application while 1000’s of other requests were being served by the MobileMe server. Once this initial iteration is complete, the Apple team would make revisions to their application and architecture based on the results of the load test. After those changes are made another load test iteration could be conducted to ensure that the changes have improved performance and not caused any unattended issues. Rinse and repeat.

Here’s a link to the MobileMe status blog (http://www.apple.com/mobileme/status/) which outlines more specifics about the outage. This is a problem that all companies providing web based applications and services need to consider; it’s great when you launch an application that’s popular, it’s not so great when that popularity damages your brand and prevents your service from being usable. The iPhone is popular enough that most users will probably ignore the outage (and the lost emails between July 16th and 18th), heck, those users may even endure a couple more outages. But the majority of applications on the web do not have the brand backing of Apple/iPhone.

More on load testing offered by Webmetrics: http://www.webmetrics.com/loadtesting.html.

And here’s the performance for this blog:

  • Average load time is 1.70 seconds.
  • Availability (uptime) is 100%.
« Newer PostsOlder Posts »

Blog at WordPress.com.