Performance Testing of the Top 100 Sites is Misleading at Best

By on January 11, 2010 10:09 am

Recently, a number of performance tests have been released that are based on the performance of the top 100 web sites such as SpriteMe savings, the IE8 100 top sites test results, or the JSMeter research. These are in direct contrast with tests such as ACID3 which attempt to test the future of the web rather than just what’s possible today.

These efforts are outstanding and highly useful, especially the JSMeter work and their valiant effort to redefine performance tests that are indicative of today’s web apps.

I completely agree with one of their stated goals:

We hope our results will convince the JavaScript community to develop and adopt benchmarks that are more representative of real web applications.

However, I disagree with their approach: they are testing the performance of today’s already optimized sites! There’s nothing in the middle of testing today’s sites and the more unrealistic “test every feature” ACID tests.

I believe more accurate tests for tomorrow would be useful, testing what’s pushing the limits of the web today, but are not currently top 100 sites. My main objection with comparing performance across the top 100 web sites is this: The top 100 web sites are already relatively highly performant, because they are optimized for what’s possible today. They have and continue to improve thanks in large part to the work of Steve Souders and others in the performance optimization community. Because it costs significant amounts of money in server operations fees and bandwidth, high traffic web sites generally dedicate considerable resources to highly optimizing their sites. High traffic web sites also face significant competition and are highly scrutinized for acceptable page load time. Budget and competition result in popular sites not deploying code that makes pages load slower than their desired performance threshold. Even more importantly, top 100 sites have the budget to make their app work in the future when things change. You can dedicate people on your team to squeezing out performance improvements in all aspects if you have the budget for it. Most web apps cannot afford to do this.

When we’re testing the performance of new browsers or analyzing page load performance, we should also really be looking at what the top 100 sites will look like in terms of features and expectations in five years! So how do we do that today? There’s no simple answer, but here are some ideas:

  • Test popular web apps, e.g. mint.com populated with large amounts of data
  • Test apps that don’t support IE6, e.g. Google Wave
  • Test all sections of popular sites, not just the home page, through an automated performance test harness
  • Test ridiculous configurations of popular applications, e.g. enable every feature in modular applications until they slow down
  • Test apps over long amounts of time in the browser, not just initial page load time
  • Test 50 apps, each in a different tab, all at once, and see how fast you can make a browser like Firefox or IE crash!
  • Test throttled networks that emulate the profile of mobile and satellite networks, slow hotel wi-fi networks that often limit the length and duration of connections, corporate proxies, tech conferences, and countries with overloaded pipes (e.g. YouTube in New Zealand)

Only when browsers are pushed to their limits do we see where they break down, and how sites break them. We also need tests and tools (such as instrumented usage of YSlow, PageSpeed, SpeedTracer, etc.) comparing the most complex apps and how they perform across the various browsers, as today’s complex app is potentially tomorrow’s median site.

To be clear, I’m not saying “don’t optimize for today”. I’m saying, stop comparing cutting edge sites to Google search results. Lumping these two together in a common test is like putting apples against oranges because they are both round fruit.

Comments

  • Great post, Dylan. One thing I like about testing the Top 100 is that if you can find significant opportunties for savings in these sites, it’s likely that the savings are even greater in other sites. Also, getting these Top 100 sites to change (improve) their performance is a bigger bang for the buck (since the Top 100 sites have more traffic).

    Broader testing is great (Top 1000000?), but the resources it would take to churn through so many sites and pages is too much for many projects, esp. ones that are Open Source. My solution to this is to crowdsource the data – as in SpriteMe and Browserscope. Crowdsourcing doesn’t work for all test suites, most typically because it’s not a controlled environment.

    Getting benchmarks that better reflect the real world are definitely needed.

  • Joeri

    As you point out, there’s a strong self-selection going on here. A page that pushes the limits of browsers will be awkward to work with, and so won’t get truly popular. Google wave will never be a break-through app until it runs well on IE, even though I think in the long run it will mean much more to the web than facebook.

    Another issue is that what is “popular on the internet” does not mean what is “representative for the internet”. If you go through the top 100 sites, the sort of web apps that you find are blogging tools, social networks, video streaming sites, trading sites, … The repeating pattern is that they are sites that are personal-time oriented, and that don’t have a pay-to-play gateway. For popularity, you need lots of eyeball time, and to get that you need to get people to come to you in their leisure time, and not put any roadblocks in their path (like having to pay money). You’re never going to see a focused business web app in the top 100, no matter how useful it is.

    The fact that browsers focus so much on the top 100 is demonstrated by their slanted priorities. For example, the video and canvas tags are major priorities in the social space, and therefore in the top 100, but they’re not that big of a deal in the corporate web apps space (canvas is nice, but not essential). Browser improvement effort is heavily focused on the apps that are least focused on “getting things done”. That might not be a bad thing, but it’s something to not forget about.

    Looking at windows, microsoft benchmarks any new OS’s they release against a representative sample of apps from all categories. They don’t just take the 100 most installed apps and test / optimize only for those. Maybe browser makers also need such a representative subset, with sites and web apps from all categories and complexities, regardless of how popular they are. No idea how you would make such a list though.