You are here

Bottleneck searching on Windows


Good day

Fine day, nothing boded ill. But here came the problem - the speed of some application was unacceptably small, and last week / month / day ago everything was fine. We need to solve it quickly, spending as little time as possible. A server with problem running Windows Server 2003 or later.

I hope the following script will be brief and understandable enough and useful to both novice administrators and more serious comrades, you can always find something new for yourself. Do not immediately rush to explore the behavior of the application. First of all, is it worthwhile to see if there is enough server performance at the moment? Are there any "bottlenecks" that limit its performance?

Perfmon will help us - a fairly powerful tool that goes along with Windows. To begin with, let's define "bottleneck" - this is a resource that has reached its limit on use. Usually, it arise because of incorrect resource planning, hardware problems, or incorrect application behavior.

If we open perfmon, then we will see tens and hundreds of various sensors, and the number of these does not contribute to a rapid investigation of this problem. So, to begin with, let's outline the 5 main possible "bottlenecks" to shorten the list of sensors being examined.

It will be a processor, RAM, a data storage system (HDD / SSD), a network and processes. Next, we will examine each of these items, which sensors we will need and the threshold values for them.

Processor.

Overloaded with tasks, the processor obviously does not facilitate fast application performance. To study its resources, we select only 4 sensors:

Processor\% Processor Time

Measures the ratio of CPU time to idle time in percent. The most understandable sensor, CPU load. MS recommends changing the processor to a faster one, if the value is above 85%. But it depends on many factors, you need to know your needs and features, because this value can vary.

Processor\% User Time

Shows how much time the processor spends in the user space. If the value is large, then it means that the applications take up a lot of CPU time, it's worth looking at them, because it's time to optimize them.

Processor\% Interrupt Time

Измеряет время, которое процессор затрачивает на ожидание ответа на прерывание. Данный датчик может показать наличие «железных» проблем. MS рекомендует начинать волноваться, если данное значение превышает 15%. Это означает, что какое-то устройство начинает отвечать очень медленно на запросы и его следует проверить.

System\Processor Queue Length

Measures the time that the processor spends waiting for an interrupt response. This sensor can show the presence of "iron" problems. MS recommends starting to worry if this value exceeds 15%. This means that some device starts responding very slowly to requests and it should be checked.

Memory.

The lack of RAM can greatly affect the overall performance of the system, forcing the system to actively use a slow HDD for swapping. But even if you seem to have a lot of RAM on the server, the memory may "leak". A memory leak is an uncontrolled process of reducing the amount of free memory associated with errors in programs. Also worth mentioning is that for Windows, the amount of virtual memory is the amount of RAM and paging file.

Memory\% Committed Bytes in Use

Shows the use of virtual memory. If the value has exceeded 80%, then you should think about adding RAM.

Memory\Available Mbytes

Shows the use of RAM, namely the number of megabytes available. If the value is less than 5%, then again, you should consider adding RAM.

Memory\Free System Page Table Entries

The number of free items in the page table. And it is limited, in addition to these days, the popularity of gaining pages in 2 or more MB, instead of the classic 4kB, which does not contribute to their large number. A value of less than 5,000 may indicate a memory leak.

Memory\Pool Non-Paged Bytes

The size of this pool. This is a piece of kernel memory that contains important data and can not be unloaded into the swap. If the value exceeds 175 MB, then most likely it is a memory leak. This is usually accompanied by the appearance of events 2019 in the system log.

Memory\Pool Paged Bytes

Similar to the previous one, but this area can be unloaded to disk (swap), if they are not used. For this counter, values above 250 MB are considered critical, usually accompanied by the occurrence of events 2020 in the system log. Also talks about a memory leak.

Memory\Pages per Second

Number of hits (write / read) to page file per second due to lack of necessary data in RAM. Again, the value of more than 1000 hints at a memory leak.

HDD.

An important element that can make a significant contribution to system performance.

LogicalDisk\% Free Space

Percentage of free space. Only the sections containing system files - OS, file / swap files, etc., are of interest. MS recommends that you take care of increasing disk space if less than 15% of free space is left, because when critical loads it can abruptly terminate (temp files, Windows updates, or the same swap file). But, as the saying goes, "it depends" and you need to look at the really affordable size of the space, because The same swap file can be fixedly fixed, temps are imposed with quotas that prohibit them from expanding, and updates are distributed portion by part and rarely, or there are none at all.

PhysicalDisk\% Idle Time

Indicates how long the disk is idle. It is recommended to replace the disk with a more productive one, if this counter is below 20% of the border.

PhysicalDisk\Avg. Disk Sec/Read

The average time it takes a hard drive to read data from itself. Above 25ms is already bad, for the SQL server and Exchange, 10ms or less is recommended. The recommendation is identical to the previous one.

PhysicalDisk\Avg. Disk Sec/Write

Identically PhysicalDisk \ Avg. Disk Sec / Read, write-only. The critical threshold is also 25ms.

PhysicalDisk\Avg. Disk Queue Length

Shows the average number of I / O operations waiting for a hard disk to become available for them. It is recommended to start worrying if this number is twice the number of spindles in the system (in the absence of raid arrays, the number of spindles is equal to the number of hard drives). The former is a more efficient HDD.

Memory\Cache Bytes

The amount of memory used for the cache, part of which is a file cache. A volume larger than 300MB can talk about a problem with HDD performance or an application that actively uses the cache.

Network.

In today's world network is very important - a huge amount of data is broadcast over the network.

Network Interface\Bytes Total/Sec

The amount of data transmitted (send / receive) via the network adapter. A value exceeding 70% of the bandwidth of the interface indicates a possible problem. It is necessary either to replace the card with a more productive one, or add one more for unloading the first one.

Network Interface\Output Queue Length

Indicates the number of packets in the queue for sending. If the value has exceeded 2, then you should think about replacing the card with a more productive one.

Processes.

The performance of the server can be catastrophically dropped if there is an unoptimized application or the application starts behaving "incorrectly".

Process\Handle Count

The number of descriptors processed by the process. This can be both files and registry keys. The number of these, exceeding 10,000 can serve as an indicator of the incorrect operation of the application.

Process\Thread Count

The number of threads within the process. It's worth to study the behavior of the application more carefully if the difference between the minimum and maximum number of these exceeds 500.

Process\Private Bytes

Shows the amount of memory allocated by a process that can not be shared with other processes. If the fluctuation of this indicator exceeds 250 between the minimum and maximum, then this indicates a possible leakage of memory.

Most of the above counters do not have a clear indication that the system has a "bottleneck". All the values ​​given were based on the average statistical results and can vary for different systems in a wide range. In order to use these counters correctly, we need to know at least the performance of the system during its normal operation. This is called baseline performance - perfmon log, taken from a working freshly installed (the latter is not necessary, it is never too late to take this log or keep track of changes in baseline performance in the long term) of a system that does not have problems. This is an important point, often omitted, although in the future it can seriously cut down a possible simple system and explicitly accelerate the analysis of the data obtained from the above counters.

Taken from: ]]>https://ru.intel.com/business/community/?automodule=blog&blogid=57161&sh...]]>

0 0

Share the article with your friends in social networks, maybe it will be useful to them.


If the article helped you, you can >>thank the author<<