The last 3 weeks were absolutely terrible. It started without a warning on a Tuesday afternoon. The average load on the database server increased to around 30 and did not go down. I/O wait-states were continuously > 60%. The system was heavily swapping….. but it wasn’t using all of the available physical memory. 40% of the memory (26GB in our case) hasn’t been touched by the system. Calls with the database provider InterSystems and opening a ticket with RedHat weren’t successful at the beginning. InterSystems came up with the idea that a kernel update that had been performed a week ago could be responsible for the strange behaviour. I really did a kernel update, but that was a week ago. So why the delay in the vanishing of the problem. i wasn’t really confident that the kernel could have to do with what we saw. I thought that maybe a scheduled job triggered by somebody in user land was responsible for the load. I also thought that reducing the load on the database would decrease the swapping and therefore fix the problem. THAT WAS A MISTAKE. The gaps between these issues decreased on a daily bases. It started once a day and lasted about an hour just to reappear twice a day for more than two hours a couple of days later. A restart of the database server kept it away for one business day. It wasn’t visible out of office hours, so definitely a problem with system load. Another problem that we saw whenever this problem appeared was that SAMBA and CUPS services were terminating and a huge load of print jobs couldn’t be processed. The guy from RedHat told me that he saw huge I/O on the local disk and he recommended to move data storage for services to the SAN if possible. I then moved the CUPS spool directory to the SAN, but that did not change anything. I also tried to flush the data to disk by applying this to the system:[root@db1 ~]# echo 3 > /proc/sys/vm/drop_caches
I think that I was too focused in finding a problem with the database instead of taking into account that the kernel could be the problem here. Otherwise even if, we could not roll back to the old kernel, because it wasn’t supporting the database in a clustered environment. I found this. After moving the swap space to the SAN and increasing it from 8GB to 50GB the system was working perfect in the last 4 days. Fortunately I was able to apply this change to the system without rebooting it. I wouldn’t recommend to do the change while you are seeing the problem, because disabling the swap, which is part of the change, will increase the load even more. So perform the change out of office hours with less users connected to the system. It’s a shame that this bug is so old and that there is no fix or patch available even in the latest kernel releases.
A Quote from qotd.org
Difficulties are meant to rouse, not discourage. The human spirit is to grow strong by conflict.
William Ellery Channing
1780 - 1842Archives
- September 2017
- June 2017
- February 2017
- June 2016
- December 2015
- September 2015
- July 2015
- June 2015
- January 2015
- November 2014
- August 2014
- June 2014
- May 2014
- April 2014
- February 2014
- January 2014
- September 2013
- August 2013
- July 2013
- June 2013
- May 2013
- March 2013
- February 2013
- January 2013
- November 2012
- August 2012
- June 2012
- February 2012
- January 2012
- December 2011
UserOnline
Recent comments