Jul 16, 2016

Level 4

EC2 Server Temporarily at 100% CPU?

I just took a look in my AWS interface, and I noticed that my server has occasionally spiked to 100% CPU usage. Here is what my CPU usage graph looks like over the past 3 days:

https://postimg.org/image/mrm3g56o1/

I just launched this new Laravel 5.2 site and have been consistently driving traffic to it from Google Adwords. But from my Google Analytics data, it doesn't seem like there was any kind of crazy spike that would cause the CPU utilization like that. Here's what my site usage looked like over that same period:

https://postimg.org/image/jn6v7gyuf/

As you can see, my traffic was higher yesterday than it was today, but CPU utilization was next to nothing.

In addition, as I have been Googling about 100% CPU utilization, most people say that their server is unresponsive at this state. However, I was checking my site periodically for orders during the most recent period on the right side of the graph when the server was at 100% CPU utilization for a few hours. It was working totally fine.

Can anyone shed some light on what might have caused this? Is this suspicious that the CPU usage spiked so suddenly to 100% and then dropped just as suddenly without any traffic change? I used to use Rackspace Managed Cloud, but now I'm managing my own server so this is all very new to me.

willvincent

9 years ago

Level 54

Could be heavy DB queries, could be file indexes rebuilding.. could be a number of things. Without knowing every specific detail about what's installed and running on your server this is impossible to answer.

BUT here's how you can find out for yourself.. next time it does this, get on the box and run

sudo watch "ps aux | sort -nrk 3,3 | head -n 5"

to monitor the top 5 (number dictated by that last param) processes utilizing CPU.

Output will look something like this:

root      5438  0.1  0.1 100448  8952 ?        Sl   Jun16  43:38 /usr/bin/python /usr/bin/fail2ban-server -b -s /var/run/fail2ban/fail2ban.sock -p /var/run/fail2ban/fail2ban.pid
www-data  5414  0.0  0.0   7856  4828 ?        S    Jun16   2:20 nginx: worker process
www-data  5413  0.0  0.0   7852  4980 ?        S    Jun16   2:09 nginx: worker process
www-data  5412  0.0  0.0   7852  4836 ?        S    Jun16   2:21 nginx: worker process
www-data  5410  0.0  0.0   7852  4972 ?        S    Jun16   2:07 nginx: worker process

That will auto-refresh every 2 seconds until you hit Ctrl-C to exit the watch process. CPU utilization percentage is the third column.

1 like

jmagaro88

9 years ago

Level 4

Thanks for the insight. I just installed an alarm on Amazon to notify me the next time CPU utilization is over 50%. I have htop installed so I could have hopped on and seen what was happening if I knew about the surge.

I have a t2 small instance with 1 CPU. If this kind of thing keeps happening, would it be better to upgrade the instance or to add a load balancer and an additional instance?

Also, is it unusual that utilization was at 100% but the site was still totally functional? Most of my googling makes it seem like 100% utilization for any sustained period knocks sites offline.

willvincent

9 years ago

Level 54

At the end of the day though, if it's not affecting performance it probably doesn't really matter.. For fun I just used stress to peg each of the four cores on my server to 100% and noticed no affect on my website whatsoever.

the modern Linux kernel is pretty good at handling multiple high-cpu-load processes at the same time generally.. so unless you start experiencing performance issues I probably wouldn't really worry about it.

willvincent

9 years ago

Level 54

Most of my googling makes it seem like 100% utilization for any sustained period knocks sites offline.

Depends on the cause of the cpu spike, and how you have things setup. In my setup I have everything behind varnish, so most end-user facing pages are served direct from memory, so there's virtually no CPU requirement to serve a page of my site at all. But even so.. unless it's part of your web stack causing the high utilization, it's probably not going to have a huge effect on it... unless the process(es) spiking the CPU are also set to not be very nice

jmagaro88

9 years ago

Level 4

As a follow-up, this 100% utilization kept happening for a couple more days. So I signed up for the AWS business support plan and had them look into what was happening. Turns out that in busy hours, my app was getting enough traffic to eat away at the CPU credits to the point that they were depleted by 6 PM almost every night.

Apparently, any Amazon EC2 server with the prefix t2 has a set CPU allowance, and if you exceed that, then your processing gets shoved to the back of the line in terms of processing priority. This was dramatically affecting the speed of my app when this happened.

So I upgraded to an m4.large instance, as apparently the m4 EC2 instances don't have the CPU limitations. Now my app is blazing!

gregrobson

9 years ago

Level 6

@jmagaro88 - Looking at your usage pattern, were you able to confirm that you were getting at least 400% more users during that time when CPU hit 100%? I would be wary if the CPU usage is not growing in line with active users. You might have one or two users that hit a few selective queries that perform badly.

On a related note I had an issue with a fresh 16.04 Ubuntu installation yesterday on a t2.micro instance. The process **kswapd0** was maxing out to 100% CPU and burned all the CPU credits. The fix mentioned in the link below appears to have fixed it.

http://askubuntu.com/questions/761885/kswapd0-use-100-cpu

As you mention the t2 instances run with a concept of "credits" where you accrue them over time, and when you get spikes you can use more CPU for a brief period of time. My t2.micro instance fell back to 10% CPU usage when the credits were being consumed at the maximum rate.

Largely AWS design these EC2 instances for occasional bursts (e.g. a mail queue processor that might be busy at 9am and 5pm sending daily reports to several hundred users, but otherwise runs with minimal load of less than 50 emails/hour).

2 likes

bashy

9 years ago

Level 65

The main thing is load avg. If that's over the amount of cores, it's going to be held up.

Good explanation of it http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

jmagaro88

9 years ago

Level 4

@gregrobson Yeah, I was definitely getting a lot of traffic. In the morning before 12 pm, I was averaging about 150 users per hour. But when the issue started to occur, I was up to about 500 users per hour. I went over the Analytics session graph with the AWS guy, and we matched the traffic spikes to the CPU usage spikes pretty closely.

That being said, I don't quite understand how CPU usage can be at 2% or less when the traffic was low, and then all of a sudden jump to 100% when the traffic increased by 3 to 4 times. I would think I would need a much bigger increase in traffic to experience that kind of CPU usage.

After my initial posts, I had set an alarm to notify me once I got up to over 50% CPU usage. On Monday, I got a notification and popped right onto htop, but I didn't see any processes that were out of the ordinary. And nothing seemed to be consistently taking up a ton of CPU usage, so I just logged off. I probably should have hung around a little longer on it to see if anything rose to the top, because I proceeded to stay at 100% CPU usage for about 4 hours until I finally exhausted my credits.

gregrobson

9 years ago

Level 6

You might want to try running something like New Relic (free trial for 14 days). https://newrelic.com/application-monitoring/pricing or Data Dog https://www.datadoghq.com/

The service can aggregate data across RAM, CPU, SQL queries, file I/O etc and can display everything that's been happening on your servers. Even if you server is running. Reading the results is a bit like reading the Matrix, but if there's a particular spike in something like time to complete queries or disk I/O it will stick out like a sore thumb.

Even if you're server is fine now, it would be better to get on top of any underlying issues.

michael.roper

9 years ago

Level 1

For anyone that runs into the issue mentioned above of the kswapd0 process taking up all the CPU, I found the following command worked to get the CPU back down again on my AWS servers that were provisioned with Forge:

sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

I found that amongst some threads on the that kswapd0 bug, and tweaked it to run correctly on my Forge instance... http://unix.stackexchange.com/questions/109496/echo-3-proc-sys-vm-drop-caches-permission-denied-as-root

3 likes

CueTracker

9 years ago

Level 1

Sorry to dig up an old thread, but I'm currently struggling with this problem, and I'm not quite sure why it turns up. I don't understand whether the problem lies with my code or with the Linux config. @michael.roper s fix works well for me, but the problem returns after several hours, about once a day, though oddly at a time when no scheduled tasks start. The rest of the day the CPU levels are well under 25%, the site I have is quite database-heavy, but the database is on RDS, I'm using Spatie's response caching as well as Redis, so it should be ok. I don't want to spend much more money running the site on a more expensive server, Should I somehow try to schedule the solution provided by michael.roper so it runs every so often?

fideloper

8 years ago

Level 11

Are you aware of how the t2 series use CPU credits?

You may be eating into CPU credits during regular usage (very common on t2.small if your database is on the same server with moderate traffic).

Once your CPU goes above 20% usage, CPU credits will begin being used. (Info on that here).

Once you eat up your credits, the server may struggle to respond to requests under load as the CPU usage is then capped at that 20% (for t2.small).

We can see that in your graph where CPU usage gets pinned at 20% and doesn't go above it. In fact, the server probably needs to use more CPU but is then capped at that, and thus doesn't go above it.

(In fact, seeing that graph makes me think that the server is certainly too small - you should resize it higher to handle that traffic).

You can find graphs of CPU Credit usage and credits remaining within Cloudwatch.

One thing I do is use RDS, on a small size to reduce cost (RDS is a relatively expensive hosted database). Removing the database from the web server reduces CPU usage greatly - I can often use t2.nano's if I also remove redis from the server.

Another option of course is to use a larger server type in AWS. This is as simple as stopping the instance, changing the instance size, and restarting it. Use a larger t2 or consider going to a m4, where there is no CPU credit system.

Please or to participate in this conversation.