kswap daemons using 100% CPU although there is enough free physical memory available.

The last 3 weeks were absolutely terrible. It started without a warning on a Tuesday afternoon. The average load on the database server increased to around 30 and did not go down. I/O wait-states were continuously > 60%. The system was heavily swapping….. but it wasn’t using all of the available physical memory. 40% of the memory (26GB in our case) hasn’t been touched by the system. Calls with the database provider InterSystems and opening a ticket with RedHat weren’t successful at the beginning. InterSystems came up with the idea that a kernel update that had been performed a week ago could be responsible for the strange behaviour. I really did a kernel update, but that was a week ago. So why the delay in the vanishing of the problem. i wasn’t really confident that the kernel could have to do with what we saw. I thought that maybe a scheduled job triggered by somebody in user land was responsible for the load. I also thought that reducing the load on the database would decrease the swapping and therefore fix the problem. THAT WAS A MISTAKE. The gaps between these issues decreased on a daily bases. It started once a day and lasted about an hour just to reappear twice a day for more than two hours a couple of days later. A restart of the database server kept it away for one business day. It wasn’t visible out of office hours, so definitely a problem with system load. Another problem that we saw whenever this problem appeared was that SAMBA and CUPS services were terminating and a huge load of print jobs couldn’t be processed. The guy from RedHat told me that he saw huge I/O on the local disk and he recommended to move data storage for services to the SAN if possible. I then moved the CUPS spool directory to the SAN, but that did not change anything. I also tried to flush the data to disk by applying this to the system:[root@db1 ~]# echo 3 > /proc/sys/vm/drop_caches
I think that I was too focused in finding a problem with the database instead of taking into account that the kernel could be the problem here. Otherwise even if, we could not roll back to the old kernel, because it wasn’t supporting the database in a clustered environment. I found this. After moving the swap space to the SAN and increasing it from 8GB to 50GB the system was working perfect in the last 4 days. Fortunately I was able to apply this change to the system without rebooting it. I wouldn’t recommend to do the change while you are seeing the problem, because disabling the swap, which is part of the change, will increase the load even more. So perform the change out of office hours with less users connected to the system. It’s a shame that this bug is so old and that there is no fix or patch available even in the latest kernel releases.

About Juergen Caris

I am 54yo, MSc(Dist) and BSc in Computer Science, German and working as a Senior Server Engineer for the NHS Lothian. I am responsible for the patient management system, called TrakCare. I am a UNIX/Linux guy, working in this sector for more than 20 years now. I am also interested in robotics, microprocessors, system monitoring, Home automation and programming.
This entry was posted in Linux. Bookmark the permalink.

Leave a Reply

Your email address will not be published.