Brute force restarting of services on a machine under heavy load in response to the slashdot effect

Posted in doh , howto

The load average is calculated as an exponential moving average of the the number of processes that are in the running or runnable state. The three load average numbers you see in “uptime”, “top”, and other utilities are the one, five, and fifteen minute moving load average of the system.

# uptime
06:58:34 up 232 days, 27 min, 1 user, load average: 0.25, 0.24, 0.31

heavy-load

So, for a single core single processor machine a load average of 1 means that on average there’s a process in the running or runnable state at all times. This means the CPU average is at 100% too. If another process wants to run, it has to wait in the queue before being executed. For a multi-core or multi-processor system, you’re not CPU bound until the load average equals the total number of cores. You can calculate a system’s CPU utilization by dividing the load average by the number of processors you see in /proc/cpuinfo

# cat /proc/cpuinfo | grep processor

Here’s a simple cron script I run to stop a postgresql database if the load gets too heavy. I call it berserker.cron. It will restart the database when the load goes back down. I found that I rarely see anything between light and heavy, but it is possible that it hovers around 2-3 and then my database is restarting while it’s already running. But that worked out fine in this situation.

#!/bin/sh
LOAD=`uptime | awk ‘{print $11}’ | sed ’s/\..*//’`
STOP=”echo /etc/init.d/postgresql stop”
RESTART=”echo /etc/init.d/postgresql restart”

if [ "$LOAD" -gt "3" ]; then
echo “Load: heavy”
$STOP
elif [ "$LOAD" -gt "2" ]; then
echo “Load: moderate”
$RESTART
fi

Please keep in mind this is a simple brute force approach. Anyone using an app that’s accessing the database will start throwing database connection errors, but this berserker.cron just doesn’t care.

Posted by admica   @   30 October 2009

Related Posts

0 Comments

No comments yet. Be the first to leave a comment !
Leave a Comment

Name

Email

Website

Previous Post
« Measure bandwidth between two computers with iperf
Next Post
YUM: Thread died in Berkeley DB library, Fatal error, run database recovery »
Powered by Wordpress   |   Lunated designed by ZenVerse