Home-made Stress Testing
Software engineering nowadays necessitates the use of tools. From IDEs to the xUnit family of unit testing to aspect-oriented frameworks to performance testing tools, you will be hard-pressed to find a decent development shop that does not use tools at every opportunity…except on The Project.
Yes, you see, I am still waiting for The Company to send me my check for my last week of work. Meanwhile, I have already been paid for my first week of work at my new job.
So, as you can imagine, dumping a stack of purchase orders on my boss’s desk was not going to get me anywhere. This blog has chronicled the results of this frugality:
But to their credit, I was provided with four test machines (not behemoths, but regular-use workstations) for me to use as I saw fit.
What, then, is a poor chap to do when it comes time for performance testing?
(this is how easy it should be)
Simple; work from the ground up. What was the problem? I had to simulate 300 simultaneous users in 100 buildings spread across the American northeast hitting the same server and somehow measure the response time in each of our use cases.
My first crack at this was to strip the UI layer from the system and replace it with a console app. Hey, layers came in handy after all. After obtaining an operational profile, I essentially wrote a script in the console app that acted out this profile.
Then, another executable spawned off 70 copies of this console app. Run that on my four test machines and 70*4=280, which was close enough for us.
In the script, I added code to time how long each operation of interest took. This proved to be inaccurate because of the overhead of console output and exception handling. I finally bit the bullet and added more fine-grained time measurement code within the code-under-test itself. Yes, this did spawn off a “test” branch version of some files, but the cost and inconvenience was minimal.
The time measurement code would dump the statistics into Excel sheets, and a quick VBA script tallied everything up and gave us an average response time for the system.
For a few sprints, this worked out fine, and I was meeting the goal set forth in the requirements document. But this model was fundamentally flawed, and it soon showed when more and more operations were added to the profile. The measured response time skyrocketed, and the problem was not in the new code.
The problem: in the real operating environment, each user would be on their own machine. That means each one would have its own CPU, memory, hard drive, and network connection. The effect of 70 processes executing identical code at the same time and sharing these four resources was warping the stress test. And the hardware wasn’t built for large-scale processing. These were regular plain workstations.
The new problem is now how to distribute the remaining 69 processes in a such a way to not distort the measurements. And gaining access to a test machine in each of the 100 buildings was not going to happen, nor would it be manageable.
The big flash of insight came when I theorized about the layout of The Company’s WAN. A logical person would give each of the 100 buildings its own router, and traffic in building A would not be routed through building B, thus not affecting building B’s router. The traffic would build up on the server side only.
(WAN, LAN…oh man…)
After confirming this theory, I came up with my final home-made stress testing method:
- Keep the console script and keep spawning it 50 times, but only once. There were, incidentally, around 50 users in my home building. This would simulate 50 users hitting the server out of our building. I dumped the timing code and used the working copy of the system.
- For the remaining 210 users, simulate this traffic on the server in order to tie of server resources. This is the tricky part. One of the constraints of The Project was that I had no privilege whatsoever to install any executable on our server. I had a SQL Server, and that was it.
So I had to bite the bullet and write a script in Transact-SQL. Not only that, but the operational profile had to be translated into stored-procedure calls. It was ugly, but there was no other choice.
- Finally, modify the UI to display measured response time on the UI only. This consisted of simpler timing code and a text box on each screen to display the elapsed time between a click and the conclusion of the operation. These values were used to calculate the response time.
To put it all in action, fire up the server-side script, the local 70-process script, and a copy of the system on my own machine. Run through the use cases and record the response times displayed on teh screen.
This new method successfully modeled the intended operating environment. It also uncovered some silly performance bugs, like queries that used the LIKE operator to match on a primary key. Hey, I never claimed to be a super uber guru, and I probably never will. It also uncovered the performance issue associated with sorting colored DataGrid rows.
It can be done, and it can be done cheaply. But, to be honest, the scalability of a solution like this would be directly related to how messy the server-side T-SQL script became. And I have a feeling that this is limited, at best.