We run a large installation (+100k users, 250-900 concurrent at any point and time, usually ~550). 3 load balanced webservers (virtual), 2 dbs (master/slave (physical)). Memcache is running on all 3 webservers for application caching, and memcached is running on the database for session caching.
We have been running for months without issue, even got through our two heaviest usage times of the year without a sweat, however two days ago we had an issue where all 3 webservers were reporting that they could not connect to the db, but the db was fine (thankfully). Looking at the stats on each of the machines load was up to 200 (from the usual of 4), and processes shot up to 800 (from the usual ~200).
We took a dump of all apache logs and /var/log/*.log, however did not think to take a dump of the process list or of dmesg.
After investigation into the logs we couldn't figure out what caused it, we found it very odd for all 3 webservers to suddenly get overloaded simultaneously and confirmed that it was not a DOS.
We've prepared a script to dump everything we can think of should it occur again, would anyone here have any suggestions on anything else to dump/check should it occur again? Or any ideas of what might have caused all 3 webservers to stop processing?
mkdir -p ~/error_dumps/apache
dmesg > ~/error_dumps/dmesg.log
ps -all > ~/error_dumps/ps_all.log
ps aux > ~/error_dumps/ps_aux.log
vmstat > ~/error_dumps/vmstat.log
cp /var/log/debug ~/error_dumps/.
cp /var/log/*.log ~/error_dumps/.
cp /var/log/apache2/*.log ~/error_dumps/apache/.
cp /var/log/messages ~/error_dumps/messages.log
cp /var/log/syslog ~/error_dumps/syslog.log
cp /var/log/udev ~/error_dumps/udev.log
tar -zcvf crash_logs.tar.gz ~/error_dumps
rm -rf ~/error_dumps