Last week an interesting problem surfaced at work. An application engineer received reports of slow performance on a particular website and needed some help from my group to track down the source of the problem. This engineer had done some fantastic research on the problem and was able to answer almost every question we threw at him in regards to details surrounding the issue. I am going to try and run through the problem itself and the questions we asked which led us to the possible culprit. The solution to the problem was discovered a few days later and ended up surprising us as it was not even something we had considered could be the cause. Although the application engineer collected a lot of data in the form of trace logs and packet captures, my group didn’t examine any of this data. The problem was solved before we actually had to get in and look at the data ourselves. With a white board and some direct questions, we were able to point the engineer in the right direction. He did all the work.
A URL that was Internet facing was performing very sluggish compared to others.
When did the problem start? Unknown.
Possible causes to consider:
1. Remote end of the connection
2. Internet connectivity
4. Intrusion prevention sensor/Content filter/Other security hardware
5. Router/Switch/Load balancer problem on the internal network hosting the site
6. Server hosting the site
7. Web server software on server hosting the site(ie IIS,Apache)
8. Web site code (ie HTML,ASP,JScript,CSS,XML)
Troubleshooting: For the purposes of isolating the problem, we started with the remote connectivity and worked our way inward. From here on out, I am going to refer to the application engineer as Bob. That’s not his real name, but it’s a lot easier to type than “application engineer” or his actual name.
Had Bob checked into the remote side as being the source of the problem? Yes, he had. In fact he ran the same checks from other ISP’s and experienced the same result. That rules out item 1 on the list of possible causes.
Bob had a lot of additional information to add regarding this problem. First, this particular website was really a specfic URL that was problematic. Over a dozen URL’s using the same exact hostname were fine. It was just this one particular URL that was having a problem. That rules out item 2 as being the issue. Second, Bob stated that the problem was occurring on the internal network as well. That rules out items 3 and 4 from the list of possible causes. Now we’re getting somewhere. At this point, we know that we aren’t dealing with a problem isolated to the Internet. That’s actually a good thing because it’s never easy when you have to explain to people that you have no control over traffic once it leaves your network. It just comes off like you are passing the buck to non-network savvy people.
Bob added an additional piece that would vindicate the network hardware from being the culprit. He stated that the average MTU on all of the URL’s that were working great was somewhere over 1000 bytes. However, for the URL that was operating sluggish, the average MTU size was a little over 200bytes. Now the discussion goes on for a few minutes about how MTU size will affect performance and that 200byte average sizes are not good when compared to the other URL’s and their greater than 1000 MTU averages.
At this point, we know there is an MTU problem and that problem occurs on the external and internal network. Now I know that every switch this traffic is traversing on the internal network is going to allow an MTU of 1500, so I don’t think there is a piece of networking gear causing the problem. This seems like it is going to be something with the system itself. It turns out that this particular server hosting all of these URLs is one of several servers hiding behind a load balancer. I know my load balancer isn’t messing with the MTU, so I feel comfortable in ruling out item 5 as being the source of the problem.
Has Bob checked the server hosting these URLs? Bob indicates that there are 4 different servers behind the load balancer hosting these same URLs and they are all having the same problems. He tested the URL on each individual server and experienced latency. It is possible that we are dealing with a problem on all 4 servers, However, the odds of that being a sever hardware probem are very low. Considering the fact that these same servers host over a dozen more URL’s that are running with no problems, I am convinced that we can rule out item 6 as a possible culprit.
Now we are looking at the web server software or the site code itself as being the culprit. While I am by no means an expert when it comes to IIS, Apache, or other web server software, I am willing to bet that the issue is not with the web server software. My reasoning is that only 1 URL is experiencing the problem and over a dozen other URL’s are not. They are all using the same hostname, so one would expect any sort of MTU setting in the web server software, if there is one, to be the same across every URL.
At this point in the troubleshooting process, we figured it must be something in the code. Our recommendation to Bob was that he go back to the developers and have them check their code.
Bob came back several days later. He found the problem. Actually, there was no problem. The way the developers had coded this particular URL was what caused the problem. In this case, they had a bunch of really small CSS files that were used in conjunction with the URL that was problematic. The client would make the request and then it would have to grab tons of really small CSS files. Due to the small size of these files, the MTU itself was small. I suppose that small file sizes wouldn’t be too much of a problem, except in this case, there were too many files that had to be transferred. That is what was causing the latency.
In this particular case, there was nothing wrong with any infrastructure or server equipment. Everything was working as designed. If nothing else, it was a reminder that developers don’t always consider application performance over the network when designing software. They routinely get beat up for having poor security. I guess you can add poor network performance to the list as well. I think it is a generally accepted belief that programs are usually designed for low latency LAN environments, and very rarely are designed with WAN performance in mind. I shouldn’t be surprised to find a case like this in which the code wasn’t designed with network performance in mind at all.
I feel that it is also important to point out that it is fairly difficult to write code that takes all factors into consideration(ie Security, Network). Maybe the best solution is to involve the various entities during the testing of code to ensure it will perform properly. I can see how this issue would have been overlooked since it was a simple URL that was affected. Had it been an entire program that was affected, it might have been caught during testing.