content top

Real life SQL Working Set Trimming issue

Good morning AskPerf! Leonard here to chat about a performance issue we often see on the Windows performance team surrounding SQL and Working Set Trimming. I recently worked an issue on a 2003 SP2 x64 Server, running SQL 2005, that was experiencing periods of slow performance and filling up the SQL log with the following error:

A significant part of sql server process memory has been paged out. This may result in performance degradation.

This error is covered in detail in the following article:

How to reduce paging of buffer pool memory in the 64-bit version of SQL Server

http://support.microsoft.com/kb/918483

This error is logged when the working set of SQL is trimmed, meaning it is paged out to disk. The working set of a program is just the memory it is using. Normally working set trimming isn’t necessarily a bad thing. It’s the way Windows takes memory from one program that hasn’t been used most recently (and might never be used again) and gives it to other programs. But for SQL to operate efficiently, it must keep a large part of its data in memory. When it gets trimmed, then SQL operations can take a significantly longer time to complete. Therefore we needed to determine what was using a large amount of memory on the system causing the problem. For almost all Windows performance issues we gather Perfmon data, as I did in this case. We had it capturing a circular log, and stop it after the SQL error occurred. I then produced a summary of the Perfmon logs and the following data stood out:

image

I could see a SIGNIFICANT range between Minimum and Maximum in Available Bytes, Private Bytes, Working Set, and Cache Bytes. This was occurring when there should have been a somewhat constant work load and therefore memory usage should not have varied greatly. The SQLserver process was the only process with significant working set values. When discovering something in the numbers like this, the next step is to see how some of the important counters change together over time. Looking at a visual graph is the best way to see that relationship.

Here we see a pattern where Available MB drops below 100 MB ~ every 2 hours and 10 min and working set is trimmed; mostly from SQL since it is the largest process:

image

Looking at the one where SQL gets trimmed to 2 GB at ~ 3:50 AM. I added in Cache bytes since we know it varies. I also wanted to know what processes were starting during this period. My favorite counter for that is Process\Elapsed Time. If you scale it to the graph, it is easy to see anything with a sharp angle (meaning it would intersect with the bottom of the graph) or that starts and stops. I removed all of the ones that were flat and was left with the following:

image

We can see above that Cache Bytes grow to 15 gig when all of this occurs. The processes that start at this time were bpcd and dbbackex, and the bphdb was restarted (these are all backup processes). Here is one with the bpcd process selected so we can see where it is running:

image

Zooming out we see the pattern repeats with a different backup process selected. This confirms that the cause of the working set trimming is always the same backup process:

image

Conclusion: The SQL backup (Veritas Netbackup) was causing cache bytes to grow up to ~15 GB and Windows was trimming SQL working set memory to give memory back to cache. This can be a common problem on 64 bit Windows where cache used for file IO can grow large enough to impact Windows performance. There were 2 possible solutions to this:

Solution 1

Implement SQL lock pages in memory. When memory is locked by SQL, it prevents Windows from taking that memory from SQL. The down side to that is Windows memory management might not be able to respond to even smaller request for memory from programs that need it. Because of this, only use lock pages in memory when need.

Solution 2

Another option would be to implement the Windows Dynamic Cache service. This is a service that was developed in-house to use Windows memory management API’s to control the amount of cache memory that is used on Windows. There are a number of registry keys that can be set to control the behavior and therefore it can be adapted to different issues and Windows configurations.

On this particular case, we decided to implement the Dynamic Cache service. To download the Microsoft Windows Dynamic Cache Service, check the following link:

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=9258

After it was installed, we set the following registry key:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DynCache\Parameters]

RegValue: MaxSystemCacheMBytes

Type: REG_DWORD

Value: 1000

Setting this registry key allowed us to set a limit on cache memory to 1GB. We choose this value because it would provide some cache, but was small enough to not impact SQL.

Other possible values for this registry key are:

0 = Limit to 90% of Physical RAM (default)

1-99 = Limit the maximum size of the System File Cache to this percentage of Physical RAM

> 200 = Limit the maximum size of the System File Cache to x Mbytes

Once this was service was implemented, we monitored the system for a week. After this time, we confirmed that both SQL and the backup processes were stable with no performance related issues.

Hopefully this post will help you tackle those fun Working Set/Cache issues on your Windows Servers.

Additional Resources

-Leonard Severt

Read More

Troubleshooting High CPU Utilization issues using Tracelog.exe

Good day AskPerf!  Vinod Rawat here to discuss/demonstrate the benefits of using the Tracelog.exe utility to troubleshoot High CPU issues.  Tracelog.exe is a command line utility that allows us to start/stop or enable trace logging, and offers the following features:

 

  • Starts and stops trace sessions, including private trace sessions, NT Kernel Logger trace sessions, and Global Logger trace sessions
  • Configures and changes the properties of trace sessions
  • Enables and disables trace providers
  • Flushes trace session buffers
  • Lists running (real-time) trace sessions
  • Lists registered trace providers
  • Measures time spent in deferred procedure calls (DPCs) and interrupt service routines (ISRs)

NOTE Tracelog.exe is included in the tools\tracing\<Platform> subdirectory of the Windows Driver Kit (WDK), where <Platform> is either i386, amd64, or ia64.

In this post, I am going to talk about troubleshooting a High CPU issue, where no user-mode processes are visibly spiking the CPU.  With that, let’s get started!

The first thing you will want to do is download Process Explorer and run it on the affected workstation/server.  In this example, we see that Interrupts and its corresponding CPU usage shown below:

clip_image002

You can also add the following Performance Monitor Counters and see the same behavior:

Processor: % Interrupt Time

Processor: %DPC Time

Processor: %Privilege Time

This is where tracelog.exe comes in handy.  Open an Elevated Command Prompt, and browse to the folder that includes the tracelog.exe utility (see Note above).  During the high interrupt time we can capture DPC/ISR ETL tracing with the following syntax:

tracelog -start -f test.etl -dpcisr -UsePerfCounter -b 64

Switches:

-start = Starts up the test.etl trace session

-f = Log to file <name>

-dpcisr = Enable kernel events for DPC/ISR analysis

-UsePerfCounter = Use Perf Counter clock

-b 64 = Sets buffer size to <n> Kbytes

NOTE When tracing DPCs and ISRs, always add the -UsePerfCounter parameter to the command. The system timer resolution is too low to measure the time spent in these activities. Tracerpt.exe, the tool that formats DPC/ISR events, requires the performance counter clock values for its reports. (Tracerpt is included in Windows XP and later versions of Windows.)

Let the tracelog.exe run for roughly 30seconds to 1 minute, then stop the tracing with following syntax:

tracelog -stop

Now we can create a report using the Tracerpt.exe tool with the following command:

tracerpt test.etl -report dpcisr.html -f html

These switches create the report file dpcisr.html in HTML format.

Now comes the fun part (The following is just a descriptive example and what I experienced on my machine).

1. Open the html file you created above

2. Locate the DPC/ISR section in the report:

clip_image003

3. Depending on your scenario, focus on either “Distribution of DPC execution times” or “ISR processor utilization” columns:

clip_image004

4. Sort by the Percent column and you will see the highest consumer (dxgkrnl.sys) at top:

clip_image005

Dxgkrnl.sys is at top taking highest-but-nominal ISR processor time on my machine above (which is obviously normal).

For those who want to go deeper than this: When a culprit driver’s ISR routines are keeping the CPU busy, you can further zoom in and find out the exact range of time these routines are spending in:

Focus on Lower Bound and Upper Bound range and the corresponding Percent. More than 88% (61.66 + 26.59 = 88.25%) count is taking more than 25 microseconds.  Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds. ISR routines belonging to the driver above (obscured) have exceeded this value, thus you know how to pin-point the culprit and then check the concerned ISR routines’ time taken.

Happy tracing!

Additional Resources:

-Vinod Rawat

Read More

System Data Collector Sets fail on an upgraded 2008 / 2008 R2 DC

Hello Askperf! Shamesh Pillai here to discuss an interesting issue I recently worked. On Microsoft Windows Server 2008, 2008 R2 Domain Controllers, Perfmon may fail when starting a System Data Collector Set. This includes the Active Directory Diagnostics, System Diagnostics, and System Performance Perfmon logging.

image

When we start any of the System Data Collector Sets, the following error may appear and/or the Data Collector Set may freeze:

Logon failure: the user has not been granted the requested logon type at this computer

image

When this fails, the following Events will be logged in the TaskScheduler event log:

Log Name: Microsoft-Windows-TaskScheduler/Operational
Source: Microsoft-Windows-TaskScheduler
Date: 11/30/2011 6:36:18 AM
Event ID: 101
Task Category: Task Start Failed
Level: Error
Keywords: (1)
User: SYSTEM
Computer: <Computer Name >

Description:
Task Scheduler failed to start "\Microsoft\Windows\PLA\System\{CFCD20D6-D2CE-42D3-A9CB-9916E676DE02}_System Diagnostics" task for user "<Username>". Additional Data: Error Value: 2147943785.

&

Log Name: Microsoft-Windows-TaskScheduler/Operational
Source: Microsoft-Windows-TaskScheduler
Date: 11/30/2011 6:36:18 AM
Event ID: 104
Task Category: Logon failure
Level: Error
Keywords:
User: SYSTEM
Computer: <Computer Name>
Description:
Task Scheduler failed to log on "<UserName>" . Failure occurred in "LsaLogonUser”. User Action: Ensure the credentials for the task are correctly specified. Additional Data: Error Value: 2147943785
.

While troubleshooting the issue, I observed that of all the servers experiencing this behavior, all were Domain Controllers that were upgraded from a Windows Server 2003 Domain Controller. Interestingly though, the issue does not happen on a freshly installed Microsoft Windows 2008, 2008 R2 Domain Controller, or if we upgrade a Member server. In my testing, we observed that “User Defined” Sets were not affected by this problem.

In Microsoft Windows Server 2008 and later operating systems, Perfmon uses Task Scheduler in the background. When we start a System Data Collector Set, it launches Taskeng in the current user context. To launch the job, the user needs to be included in the “Log on as a batch job” User Rights Assignment Policy.

We checked the “Log on as a batch job” Policy and discovered that the “Performance Log Users” group was not included. As soon as we added this group back to the Policy, we were able to successfully start our System Data Collector Sets.

By default, the “Performance Log Users” group is expected to have the “Log on as a batch job” user right. However, when the 2003 Domain Controller was upgraded to Server 2008, the “Performance Log Users” group did not get added to this Policy.

With that, we have concluded our post for today. Take care!

Additional Resources

-Shamesh Pillai

Read More
content top