…so your plain ordinary webserver just got listed on a high-traffic news site. Slashdot? Reddit? Hacker News? Well done, turns out you’re hosting something thousands of people want to read. Now thousands of people want to come to your webserver at once.
But you’ve got a problem. Your server’s hosed and none of those potential readers can see it. Worse, your story will only be top of the Slashdot/Reddit/whatever for a few hours and this means you’ve got to fix the problem right now before the world loses interest. You haven’t time to buy new hardware, setup a complex load balancing solution or rewrite half your app’s codebase. What are you going to do?
Follow uncle Alex’s checklist and we’ll soon have you back on the air.
This is written on the assumption that you’re running a standard, out-of-the-box setup on RedHat, Ubuntu, CentOS or some other common Linux distro with SYSV init scripts. It also presumes you have Prefork MPM (the default), MySQL as your database and no other major apps running on there. If you’re already running some funky config then this tutorial isn’t for you.
NB: Uncle Alex makes no warranties about his checklist and accepts no liability if destroys your server, fails to solve the problem, runs over your wife or sleeps with your dog. Follow these instructions at your own risk. But you’re desperate, right?
1) Gain Access
If your server is really hosed you will find it hard to SSH in. If you have console access (e.g. on a virtual hosting package, HP’s excellent iLO or you’re stood in front of the machine) login through that – you’ll save a minute waiting to negotiate a SSH session into your overloaded machine.
Try to run ssh inside a terminal with a scrollback buffer (putty, mac console, gterminal, whatever) because your server may be so laggy it’s quicker to hunt for the output of previous commands than running them again. If you’re running SSH from the command line use the ‘-v’ flag so you can see how the connection progresses.
It might take a while…
mock@hostname:~$ ssh -v firstname.lastname@example.org OpenSSH_5.2p1, OpenSSL 0.9.8l 5 Nov 2009 debug1: Reading configuration data /etc/ssh_config debug1: Connecting to my.slashdotted.server [x.x.x.x] port 22. debug1: Connection established. .... .... debug1: channel 0: new [client-session] debug1: Requesting email@example.com debug1: Entering interactive session. .... .... You have mail. Last login: Sat Nov 6 02:13:31 2010 from your.desktop.computer firstname.lastname@example.org:~$
Be patient. Resist the temptation to open multiple sessions when the first doesn’t connect immediately: you’ll only load the machine even more. For each connection the server needs to negotiate a secure pipeline, do authentication for your user and start a process for your shell. On a healthy Linux box this is usually instantaneous – but right now yours ain’t healthy.
The waiting is the hardest part.
Best practice be damned – become root straight away. Yeeee-haw cowboy.
email@example.com:~$ sudo -s # or 'su -' if you're in 1990
2) Now to Work. Diagnosis…
You’re on a heavily loaded machine. Commands will take forever to complete so you’ve got to make each one count.
First we want to diagnose the problem. Since interaction with a hosed server is painfully slow we need to do this in as few steps as possible. But we don’t want to kill Apache and free up resources before we’ve examined the system’s state to figure out what’s wrong…
Bandwidth or server load?
A slashdotting can kill your server in one of two ways: saturating its internet connection with the data you’re serving (rare these days, ISP’s have a lot of bandwidth) or inundating your host with many more connections than it can handle. We want to figure out which as quickly as possible.
firstname.lastname@example.org:~$ time cat /proc/loadavg 45.25 31.26 22.73 1/135 27844 real 0m3.002s user 0m2.002s sys 0m1.000s
Count in your head how long this really takes to return output. One elephant, two elephant, three elephant…
The ‘cat /proc/loadavg` part is obvious, you want an idea of the system’s load. But why `time`? Because it tells us how long it took your host to load the `cat` program into memory, run it and print out the contents of /proc/loadavg. Perhaps it took a long time because the system’s loaded or perhaps it ran quickly and the result took forever to reach you through the congested network. And the best part of it is if your shell’s bash `time` is a builtin – so we can know it imposed almost no overhead.
So does the machine’s timing for ‘real’ match what you counted in your head?
- If `cat /proc/loadavg` ran quickly but the load is high your network connection is likely okay. But the load is huge – what do you do? Read on…
- If the load is low (say <10) but it took forever for that output to reach you probably your server’s network link is saturated. Your server is healthy, it just has a bandwidth issue. Skip the next section.
3) Getting the Load Down (or “It’s almost always the RAM”)
Look at the first three figures in /proc/loadavg. On a healthy Unix server the load averages ought to be no higher than your number of CPU’s. A little higher (maybe 2x) is bearable but really high numbers mean you’ve got a system load problem. The extreme figures in the example – “85.25 73.26 41.73” in our example would triangulate your position as “up shit creek”.
In case you you were thinking “hey, how many CPU’s have I got?” here’s a simple way to find out:
email@example.com:~$ grep ^processor /proc/cpuinfo processor : 0 processor : 1
You’ll see a ‘processor:’ line for each CPU. The server in this example has two cores.
For a better understanding of load averages read the excellent page at LinuxInsight. In short – if the first number is higher than the last your problem is getting worse.
How many apache processes?
One dead giveaway for system load issues is the number of Apache processes. That’s not to say Apache itself is the problem – but if you have hundreds of them backing up on a small server it means something’s wrong…
# on debian/ubuntu firstname.lastname@example.org:~$ ps ax | grep -c /usr/sbin/apache2 151 # on redhat/centos/suse etc. email@example.com:~$ ps ax | grep -c /usr/sbin/httpd 151
151? That’s a lot of processes. Why, if you’d maxed out Apache’s default ‘MaxClients’ setting then added one more for the `grep` it’d be 151. I wonder if that’s a coincidence…
If you’re running a modern webserver – one where Apache has modules enabled for PHP, Perl, Python or some other language interpreter – each of those processes is going to be big. So big that together they eat way more RAM than your computer has. For example on my little 32-bit Ubuntu 10.04 VM each Apache process eats 27Mb of physical RAM. And that’s with mod_php alone; add Python or Perl as well and it gets bigger. If you live in 64-bit land even more so.
On a machine with limited memory a 32-bit OS is almost always better.
Do you have enough RAM to run 150 processes who each need 27Mb of RAM? Apache will eat 4Gb before you even begin to think about OS services or MySQL. Unless you have tons of RAM you can’t afford that many.
[No, I don’t know why the distros all ship with a default so high. Naughty Apache, why can’t you autodetect it and set a sensible value at start time?]
So let’s figure out how much memory each of your Apache processes is using:
firstname.lastname@example.org:~$ top -u www-data -b -n 1 top - 21:21:32 up 3 days, 26 min, 2 users, load average: 66.6 66.6 66.6 Tasks: 301 total, 1 running, 300 sleeping, 0 stopped, 0 zombie Cpu(s): ... Mem: 408944k total, 310848k used, 98096k free, 12172k buffers Swap: 1048568k total, 574372k used, 474196k free, 93884k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 16228 www-data 20 0 64072 26m 4996 S 0.0 6.7 0:02.49 apache2 16275 www-data 20 0 62564 25m 3672 S 0.0 6.4 0:02.66 apache2 ... ...
On a RedHat-derived machine you’ll want to use “-u apache” – they do things differently over there.
Woah buddy, look at that “Swap: 1048568k total, 574372k used” line. That means half your memory has been swapped out to disk. That ain’t good; your computer’s been reduced to the speed of that little mechanical head skittering around inside your hard drive. The Linux kernel does its best to swap out less-used pages of RAM first but with the kind of overload you’re seeing half your code is executing off of here. Mr Babbage is on the phone and he wants his mechanical computer back…
Now you know enough about the problem to stop Apache. Kill it dead with:
# in RedHat/Suse/Centos: email@example.com:~$ /etc/init.d/httpd stop # in Debian/Ubuntu firstname.lastname@example.org:~$ /etc/init.d/apache2 stop
It might take a while for all those processes to terminate. Apache will try to finish any requests it was servicing but if you have 150+ processes all fighting for memory and CPU they’ll take a minute or two to come back from swap and terminate.
All stopped? Good. Already the load is starting to fall and your machine is much more responsive to terminal commands. You’re off the air so let’s keep it short and sweet….
Open up Apache’s main config file in your favourite editor. In RedHat land it’s /etc/httpd/httpd.conf; in Debian/Ubuntu you’re looking for /etc/apache2/apache2.conf.
If you’re running with the standard config that came with your machine Apache is likely using the Prefork MPM. This means it runs a bunch of processes waiting for incoming connections and will always try to keep a couple of spare for the next client. If you have enough memory that’s great; otherwise it results in what’s called “swapping yourself to death”.
Look for the MPM settings. It’ll be a section of config something like:
<IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 150 MaxRequestsPerChild 0 </IfModule>
Most of these settings are sane. But look at MaxClients, it’s the upper limit on the number of processes Apache will run when the whole world is trying to connect. Reduce it to a level commensurate with your machine’s amount of RAM.
As a rule of thumb:
- For 512Mb of RAM MaxClients should be no higher than 20
- For 1Gb, 20
- For 4Gb, 75
That default of 150 is way too generous for a small server. It comes from a more innocent time before everyone had mod_php (or python or perl or whatever) loaded into Apache and processes were a lot smaller.
KeepAlives considered harmful
So now you have Apache configured to run a number of processes roughly in line with what your hardware supports. While you’re tinkering with it there are a couple more useful changes you can make:
- Search for the “Timeout” setting. Usually it’s way too high – something like 300 seconds. This means that if a client browser’s connection to you hangs (like if their DSL cut out or their PC crashed) the Apache process you had serving them will hang around for 300 seconds in case they come back again. Screw ’em, drop that value right down. 15 seconds is enough.
- Search for the “KeepAliveTimeout” setting. I could write a whole essay on the way modern browsers are greedy with HTTP KeepAlives. If you can only afford a handful of Apache processes do you really want each tied up by an idle client waiting for their user to click the next link? Until your load spike subsides set it to something like how long it takes to deliver your pages to the average user. 3 seconds, not 30. HTTP KeepAlives deserve an essay all of their own – maybe one day I’ll sit down and write it. [I did]
- Depending on the kind of content you’re serving you might even switch KeepAlive to “Off”. If you’re only serving a one or two objects in the average request (i.e. the page getting hammered does not contain much CSS or images) KeepAlive costs you more under high load. Think of this again if you shift content off to a CDN.
For the time being resist the temptation to mess with your php.ini or any other interpreter-level settings. They are complex and quick to anger. You want the low-hanging fruit.
By now you should have a sane Apache config, one at least vaguely suited to your hardware. Start the service up again…
# Ubuntu/Debian email@example.com:~$ /etc/init.d/apache2 start # RedHat/CentOS/Suse firstname.lastname@example.org:~$ /etc/init.d/httpd start
If you made a mistake in the config Apache should tell you here. But probably you’re fine and Apache will go on its merry way.
So now you have a webserver where:
- Fewer clients can be connected at once, which means fewer of those heavyweight Apache processes will be competing for memory at any one time. Those processes which do exist now run a lot faster.
- Clients are not permitted to stay on the line for ages after you’ve delivered their web page. If KeepAlive is still on they get cut off after KeepAliveTimeout seconds; if you disabled it entirely they get one object per connection and the process is immediately free to serve another user.
Run `top` and keep an eye on those load figures. Are they going up or down? Is your website responsive again?
What about the database?
Of course, there might be some underlying reason why your Apache processes were backing up like the holding pattern above Heathrow. Maybe it was browsers holding open persistent connections forever (which you just stopped) or maybe it’s something else. Maybe your code running inside all those Apache processes is waiting to use some shared resource on the system, something that’s dog-slow and causes each process to pause until it’s available.
People have described optimising MySQL far better than I can in this article but here’s a few tips:
- Allocate an amount of RAM (mainly key_buffer_size and innodb_buffer_pool_size in my.cnf) appropriate for your server. The defaults are measly; allowing MySQL to hold more data in memory can be a big win.
- Workloads vary so there are no hard-and-fast rules
- Use `show processlist` to look for running queries
- The query cache rarely helps – writes to a table invalidate it
- Use the slow query log to identify queries taking a long time to run
- Slow queries can often be sped up by adding an index to the appropriate columns
- Use the smallest data type available for each column. Unsigned integers where appropriate. The smaller you make your table the more of it will fit in RAM.
I could go on.
The single biggest thing you can do to solve database issues is hit the database less. See section 5 for suggestions on caching with common web applications.
4) Reducing your Bandwidth
So it turns our your server is connected to the Internet via a piece of wet string and you’re trying to push way more data than it can handle. Did your ISP tell you it was on an “unlimited” 100Mbit connection? Did they tell you how many users that 100Mbit is shared between? Oops.
Or maybe you’re just terrified of the bandwidth bill. Either way, while enduring a traffic spike it’s a good idea to reduce your network traffic as far as possible. The benefits are not solely financial:
- Delivering less data to your client means they need be connected to your host for a shorter time. Think how long it takes to deliver a 300k jpeg to the average DSL user – that’s going to take at least a couple of seconds, time your valuable Apache process could be using to run the code which generates your dynamic content – the only bit that really changes from user to user.
- You want to deliver as few objects as possible per page impression. The more you can shift to someone else’s server (or even eliminate entirely!) the better. Get it down to the magic number of 1 and you can turn off KeepAlives and free up a lot of Apache processes.
How to do this? Simple answer: move static assets out to a CDN.
“But that’ll take ages!” you wail. “I haven’t time to do a deal with Akamai or open an Amazon CloudFront account!”
Don’t worry, there are some very simple CDN’s out there. Coral Cache is here to save your day. For free these guys will cache static objects from your server and relay them to your visitors. At times of low load the Coral Cache slows down your site (objects fall out of the cache and Coral must re-retrieve them for every hit) but when you’re sustaining a dozen hits per second that isn’t a problem. Just remember to switch if off (or get an account with a proper CDN) once the panic is over.
Big stuff first
Firefox will help you figure out what the largest objects within a page are…
- Navigate to your page
- Right-click for the context menu
- Click “view page info”
- Move to the “Media” tab. You’ll see a table of objects loaded as parts of the page.
- At the top-right hand corner you’ll see a button for selecting what columns the table should show. Ensure “size” is checked.
- Look at the sizes of objects included in your page. Starting with the largest items ask yourself “does this really need to be generated by my server for each hit?”
Of course, if your server’s really hosed loading the page will be hard. You’ll just have to guess what the largest assets were. Unless you’ve some big embedded media (videos, audio) it’s almost certainly the images.
Hope the nerds don’t mind me using their page about steam engine design as an example. Maybe steam engine design will come back into the vogue and they’ll get slashdotted.
Immediately we can see the best candidates for moving off the server: those jpeg images add 131k per page view. If you have fifty hits a second that’s going to chew 6.4 megabytes of bandwidth every second. Yowch. I don’t wanna be the guy who pays your hosting bills!
Related: this is why big sites (eBay, Facebook etc) use a separate webserver for big static assets like images. Delivering them takes much longer than the page’s dynamic content and can be done much more efficiently using simpler types of webserver.
5) Advanced: Caching
Caching at the app level
Bad news: in their default states these applications are rarely efficient.
Good news: for most of the common web applications someone else has already noticed the problem and written a plugin to fix it. See if you can find a caching plugin that’ll store generated content and regurgitate it for future hits. I did this for my WordPress installation and won at least 10x extra performance for a few seconds work.
I won’t attempt to list the caching plugins for every big web app but here are a few popular ones:
Caching at the PHP level
Is the app you’re serving written in PHP? There’s an easy win to be had by installing one of the code caches. I favour xcache but it’s a matter of personal choice.
# Ubuntu/Debian email@example.com:~$ apt-get install php5-xcache firstname.lastname@example.org:~$ service apache2 restart # RedHat/CentOS/Suse email@example.com:~$ yum install php-xcache firstname.lastname@example.org:~$ /etc/init.d/httpd restart
Tweak the settings if you like but in both cases the cache modules should give a big improvement straight out of the box.
What does it do? Two things:
- By default for every hit to a dynamic page PHP will be loading the code from disk, interpreting it and then executing it. A code cache skips the first two steps – once the code has been loaded and interpreted it’ll be stored in memory for later use. Unless you’ve changed your code (by default the module calls stat() on the file to check) subsequent hits should mean a lot less work for PHP.
- Provides a special area of memory shared between all processes serving your data. If written to do so (look for a plugin) this can be used to cache generated page content and other commonly loaded data.
If you’re running a stock install of any OS on minimal hardware as your webserver it’s common to run into problems under load. There are plenty of simple, tried-and-tested techniques for reducing it – with a few minutes and a little knowledge of your server’s internals you can squeeze an awful lot more performance out of a simple machine before needing to buy more hardware.