问题描述
下面的描述有点长,但这是一个相当棘手的问题.为了缩小搜索范围,我试图涵盖我们对这个问题的了解.这个问题更多的是一项正在进行的调查,而不是基于单一问题的调查,但我认为它也可能对其他人有所帮助.但是,如果您认为我对以下某些假设有误,请在评论中添加信息或纠正我.
2013 年 19 月 2 日更新:我们已经清除了其中的一些问号,我对主要问题有一个理论,我将在下面更新.不过还没准备好写一个解决"的回复.
UPDATE 19/2, 2013: We have cleared some question marks in this and I have a theory of what the main problem is which I'll update below. Not ready to write a "solved" response to it yet though.
2013 年 4 月 24 日更新: 生产环境已经稳定了一段时间(尽管我认为这是暂时的),我认为这是由于两个原因.1)端口增加,2)减少传出(转发)请求的数量.我将在正确的上下文中继续此更新.
UPDATE 24/4, 2013: Things have been stable in production (though I believe it is temporary) for a while now and I think it is due to two reasons. 1) port increase, and 2) reduced number of outgoing (forwarded) requests. I'll continue this update futher down in the correct context.
我们目前正在我们的生产环境中进行调查,以确定当完成太多传出异步 Web 服务请求时,为什么我们的 IIS Web 服务器无法扩展(一个传入请求可能会触发多个传出请求).
We are currently doing an investigation in our production environment to determine why our IIS web server does not scale when too many outgoing asynchronous web service requests are being done (one incoming request may trigger multiple outgoing requests).
CPU 仅占 20%,但我们在传入请求时收到 HTTP 503 错误,并且许多传出 Web 请求出现以下异常:SocketException: An operation on a socket could not be executed because the system lack enough buffer空间或队列已满" 显然某处存在可扩展性瓶颈,我们需要找出它是什么以及是否可以通过配置解决它.
CPU is only at 20%, but we receive HTTP 503 errors on incoming requests and many outgoing web requests get the following exception: "SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full" Clearly there is a scalability bottleneck somewhere and we need to find out what it is and if it is possible to solve it by configuration.
应用程序上下文:
我们在 Windows 2008 R2 64 位操作系统上使用 .NET 4.5 运行 IIS v7.5 集成托管管道.我们在 IIS 中仅使用 1 个工作进程.硬件略有不同,但用于检查错误的机器是 Intel Xeon 8 内核(16 个超线程).
We are running IIS v7.5 integrated managed pipeline using .NET 4.5 on Windows 2008 R2 64 bit operating system. We use only 1 worker process in IIS. Hardware varies slightly but the machine used for examining the error is an Intel Xeon 8 core (16 hyper threaded).
我们同时使用异步和同步 Web 请求.那些异步的正在使用新的 .NET 异步支持来使每个传入请求在应用程序中通过持久的 TCP 连接(保持活动)向其他服务器发出多个 HTTP 请求.同步请求执行时间低 0-32 毫秒(由于线程上下文切换而发生的时间更长).对于异步请求,执行时间最长可达 120 毫秒,然后请求被中止.
We use both asynchronous and synchronous web requests. Those that are asynchronous are using the new .NET async support to make each incoming request make multiple HTTP requests in the application to other servers on persisted TCP connections (keep-alive). Synchronous request execution time is low 0-32 ms (longer times occur due to thread context switching). For the asynchronous requests, execution time can be up to 120 ms before the requests are aborted.
通常每个服务器最多可处理约 1000 个传入请求.当问题开始出现时,传出请求为 ~300 个请求/秒到 ~600 个请求/秒.仅在传出异步时才会出现问题.请求在服务器上启用,我们超过一定水平的传出请求(~600 req./s).
Normally each server serves up to ~1000 incoming requests. Outgoing requests are ~300 requests/sec up to ~600 requests/sec when problem starts to arise. Problems only occurs when outgoing async. requests are enabled on the server and we go above a certain level of outgoing requests (~600 req./s).
问题的可能解决方案:
在 Internet 上搜索此问题会发现大量可能的候选解决方案.不过,它们在很大程度上依赖于 .NET、IIS 和操作系统的版本,因此在我们的上下文中找到一些东西需要时间(anno 2013).
Searching the Internet on this problem reveals a plethora of possible solutions candidates. Though, they are very much dependent upon versions of .NET, IIS and operating system so it takes time to find something in our context (anno 2013).
以下是候选解决方案列表以及到目前为止我们就配置上下文得出的结论.到目前为止,我已将检测到的问题区域分为以下主要类别:
Below is a list of solution candidates and the conclusions we have come to so far with regards to our configuration context. I have categorised the detected problem areas, so far in the following main categories:
- 一些队列已满
- TCP 连接和端口问题(2013 年 19 月 2 日更新:这是问题所在)
- 资源分配太慢
- 内存问题(2013 年 19 月 2 日更新:这很可能是另一个问题)
- Some queue(s) fill up
- Problems with TCP connections and ports (UPDATE 19/2, 2013: This is the problem)
- Too slow allocation of resources
- Memory problems (UPDATE 19/2, 2013: This is most likely another problem)
1) 一些队列已满
传出的异步请求异常消息确实表明某些缓冲区队列已被填满.但它没有说明哪个队列/缓冲区.通过 IIS 论坛(以及那里引用的博客文章)我已经能够在下面标记为 AF 的请求管道中区分可能 6 个(或更多)不同类型的队列中的 4 个.
1) Some queue(s) fill up
The outgoing asynchronous request exception message does indicate that some queue of buffer has been filled up. But it does not say which queue/buffer. Via the IIS forum (and blog post referenced there) I have been able to distinguish 4 of possibly 6 (or more) different types of queues in the request pipeline labeled A-F below.
虽然应该声明,在所有以下定义的队列中,我们可以肯定地看到 1.B) ThreadPool 性能计数器 Requests Queued 在有问题的负载期间变得非常满.所以问题的原因很可能是 .NET 级别,而不是低于此 (C-F).
Though it should be stated that of all the below defined queues, we see for certain that the 1.B) ThreadPool performance counter Requests Queued gets very full during the problematic load. So it is likely that the cause of the problem is in .NET level and not below this (C-F).
我们使用 .NET 框架类 WebClient 来发出异步调用(异步支持),而不是我们遇到的 HttpClient 遇到相同问题但请求/秒阈值低得多.我们不知道 .NET Framework 实现是否隐藏了任何内部队列或不在线程池之上.我们认为情况并非如此.
We use the .NET framework class WebClient for issuing the asynchronous call (async support) as opposed to the HttpClient that we experienced had the same issue but with far lower req/s threshold. We do not know if the .NET Framework implementation hides any internal queue(s) or not above the Thread pool. We don’t think this is the case.
线程池充当自然队列,因为 .NET 线程(默认)调度程序从线程池中挑选要执行的线程.
The Thread pool acts as a natural queue since the .NET Thread (default) Scheduler is picking threads from the thread pool to be executed.
性能计数器:[ASP.NET v4.0.30319].[Requests Queued].
Performance counter: [ASP.NET v4.0.30319].[Requests Queued].
配置可能性:
- (ApplicationPool) maxConcurrentRequestsPerCPU 应该是 5000(而不是之前的 12).所以在我们的例子中,它应该是 5000*16=80.000 个请求/秒,这在我们的场景中应该足够了.
- (processModel) autoConfig = true/false 允许 一些threadPool相关配置要根据机器配置来设置.我们使用 true ,这是一个潜在的错误候选,因为这些值可能被错误地设置为我们的(高)需求.
- (ApplicationPool) maxConcurrentRequestsPerCPU should be 5000 (instead of previous 12). So in our case it should be 5000*16=80.000 requests/sec which should be sufficient enough in our scenario.
- (processModel) autoConfig = true/false which allows some threadPool related configuration to be set according to machine configuration. We use true which is a potential error candidate since these values may be set erroneously for our (high) need.
1.C) 全局、进程范围、本机队列(仅限 IIS 集成模式)
如果线程池已满,请求开始堆积在本机(非托管)队列中.
1.C) Global, process wide, native queue (IIS integrated mode only)
If the Thread Pool is full, requests starts to pile up in this native (not-managed) queue.
性能计数器:[ASP.NET v4.0.30319].[本机队列中的请求]
Performance counter:[ASP.NET v4.0.30319].[Requests in Native Queue]
配置可能性: ????
此队列与上述 1.C) 中的队列不同.这是对我的解释HTTP.sys 内核队列本质上是一个完成端口,用户模式 (IIS) 在该端口上接收来自内核模式 (HTTP.sys) 的请求.它有一个队列限制,当超过这个限制时,您将收到一个 503 状态代码.HTTPErr 日志还将通过记录 503 状态和 QueueFull 来指示发生了这种情况.
This queue is not the same queue as 1.C) above. Here’s an explanation as stated to me "The HTTP.sys kernel queue is essentially a completion port on which user-mode (IIS) receives requests from kernel-mode (HTTP.sys). It has a queue limit, and when that is exceeded you will receive a 503 status code. The HTTPErr log will also indicate that this happened by logging a 503 status and QueueFull".
性能计数器:我无法找到此队列的任何性能计数器,但通过启用 IIS HTTPErr 日志,应该可以检测到此队列是否被淹没.
Performance counter: I have not been able to find any performance counter for this queue, but by enabling the IIS HTTPErr log, it should be possible to detect if this queue gets flooded.
配置可能性: 这是在 IIS 中设置的应用程序池,高级设置:队列长度.默认值为 1000.我已经看到将其增加到 10.000 的建议.虽然尝试这种增加并没有解决我们的问题.
Configuration possibilities: This is set in IIS on the application pool, advanced setting: Queue Length. Default value is 1000. I have seen recommendations to increase it to 10.000. Though trying this increase has not solved our issue.
虽然不太可能,但我猜操作系统实际上可能在网卡缓冲区和 HTTP.sys 队列之间有一个队列.
Although unlikely, I guess the OS could actually have a queue somewhere in between the network card buffer and the HTTP.sys queue.
当请求到达网卡时,它们应该自然地被放置在某个缓冲区中,以便被某些操作系统内核线程拾取.由于这是内核级执行,因此速度很快,因此不太可能是罪魁祸首.
As request arrive to the network card, it should be natural that they are placed in some buffer in order to be picked up by some OS kernel thread. Since this is kernel level execution, and thus fast, it is not likely that it is the culprit.
Windows 性能计数器: [网络接口].[Packets Received Discarded] 使用网卡实例.
Windows Performance Counter: [Network Interface].[Packets Received Discarded] using the network card instance.
配置可能性: ????
尽管我们的传出(异步)TCP 请求是由持久(保持活动)TCP 连接构成的,但它会在这里和那里弹出.因此,随着流量的增长,可用临时端口的数量实际上应该只因传入请求而增加.而且我们确定只有在我们启用了传出请求时才会出现问题.
This is a candidate that pops up here and there, though our outgoing (async) TCP requests are made of a persisted (keep-alive) TCP connection. So as the traffic grows, the number of available ephemeral ports should really only grow due to the incoming requests. And we know for sure that the problem only arises when we have outgoing requests enabled.
但是,由于在请求的较长时间范围内分配了端口,问题仍然可能出现.传出请求可能需要长达 120 毫秒的时间才能执行(在取消 .NET 任务(线程)之前),这可能意味着端口数会被分配更长的时间.分析 Windows 性能计数器,验证此假设,因为当问题发生时,TCPv4.[已建立连接] 的数量从正常的 2 到 3000 增加到几乎 12.000 的峰值.
However, the problem may still arise due to that the port is allocated during a longer timeframe of the request. An outgoing request may take as long as 120 ms to execute (before the .NET Task (thread) is canceled) which might mean that the number of ports get allocated for a longer time period. Analyzing the Windows Performance Counter, verifies this assumption since the number of TCPv4.[Connection Established] goes from normal 2-3000 to peaks up to almost 12.000 in total when the problem occur.
我们已经验证了配置的最大 TCP 连接数设置为默认值 16384.在这种情况下,它可能不是问题,尽管我们危险地接近最大限制.
We have verified that the configured maximum amount of TCP connections is set to the default of 16384. In this case, it may not be the problem, although we are dangerously close to the max limit.
当我们尝试在服务器上使用 netstat 时,它大部分都返回而没有任何输出,同样使用 TcpView 在开始时显示很少的项目.如果我们让 TcpView 运行一段时间,它很快就会开始快速显示新的(传入的)连接(比如 25 个连接/秒).几乎所有连接从一开始就处于 TIME_WAIT 状态,表明它们已经完成并等待清理.这些连接是否使用临时端口?本地端口始终为 80,远程端口不断增加.我们想使用 TcpView 来查看传出连接,但我们根本看不到它们,这很奇怪.这两个工具不能处理我们拥有的连接数量吗?(未完待续....不过如果你知道请填写信息...)
When we try using netstat on the server it mostly returns without any output at all, also using TcpView shows very few items in the beginning. If we let TcpView run for a while it soon starts to show new (incoming) connections quite rapidly (say 25 connections/sec). Almost all connections are in TIME_WAIT state from the beginning, suggesting that they have already completed and waiting for clean up. Do those connections use ephemeral ports? The local port is always 80, and the remote port is increasing. We wanted to use TcpView in order to see the outgoing connections, but we can’t see them listed at all, which is very strange. Can’t these two tools handle the amount of connections we are having? (To be continued.... But please fill in with info if you know it… )
还有更多,作为这里的侧踢.在这篇博文IIS 7.5、IIS 7.0 和 IIS 6.0 上的 ASP.NET 线程使用情况" ServicePointManager.DefaultConnectionLimit 应设置为 int maxValue,否则可能是问题.但在 .NET 4.5 中,这从一开始就是默认设置.
Furhter more, as a side kick here. It was suggested in this blog post "ASP.NET Thread Usage on IIS 7.5, IIS 7.0, and IIS 6.0" that ServicePointManager.DefaultConnectionLimit should be set to int maxValue which otherwise could be a problem. But in .NET 4.5, this is the default already from the start.
2013 年 19 月 2 日更新:
- 可以合理地假设我们确实达到了 16.384 个端口的最大限制.我们将除一台服务器以外的所有服务器上的端口数量增加了一倍,当我们达到旧的传出请求峰值负载时,只有旧服务器会遇到问题.那么为什么 TCP.v4.[已建立的连接] 在出现问题时从未向我们显示高于 ~12.000 的数字?我的理论:很可能,尽管尚未确定为事实(尚未),性能计数器 TCPv4.[已建立的连接] 不等于当前分配的端口数.我还没有时间学习 TCP 状态,但我猜测 TCP 状态比已建立连接"显示的更多,这会使端口被占用.虽然我们不能使用连接建立"性能计数器来检测端口用完的危险,但重要的是我们找到其他方法来检测何时达到此最大端口范围.如上文所述,我们无法在生产服务器上使用 NetStat 或应用程序 TCPview 来实现此目的.这是个问题!(我会在我认为这篇文章的后续回复中写更多关于它的信息)
- Windows 上的端口数限制为最大 65.535(尽管可能不应该使用前约 1000 个).但是应该可以通过减少 TCP 状态 TIME_WAIT 的时间(默认为 240 秒)来避免端口耗尽的问题,如许多地方所述.它应该更快地释放端口.我首先对此有点犹豫,因为我们同时使用长时间运行的数据库查询以及 TCP 上的 WCF 调用,我不想减少时间限制.虽然还没有赶上我的 TCP 状态机读取,但我认为它毕竟可能不是问题.我认为,状态 TIME_WAIT 只是为了允许与客户端正确关闭握手.因此,现有 TCP 连接上的实际数据传输不应由于此时间限制而超时.更糟糕的情况是,客户端没有正确关闭,而是需要超时.我猜所有的浏览器可能都没有正确实现这一点,它可能只是客户端的问题.虽然我在这里猜测了一下......
2013 年 19 月 2 日更新结束
2013 年 4 月 24 日更新:我们已将端口数增加到最大值.同时,我们没有像以前那样收到那么多转发的传出请求.这两者结合起来应该是我们没有发生任何事件的原因.但是,这只是暂时的,因为将来这些服务器上的传出请求数量肯定会再次增加.因此,我认为问题在于,传入请求的端口必须在响应转发请求的时间范围内保持打开状态.在我们的应用程序中,这些转发请求的取消限制为 120 毫秒,可以与处理未转发请求的正常 <1 毫秒进行比较.所以本质上,我相信端口的确定数量是我们正在使用的高吞吐量服务器(在约 16 核机器上 > 1000 个请求/秒)的主要可扩展性瓶颈.这与缓存重新加载(见下文)的 GC 工作相结合,使服务器特别容易受到攻击.
UPDATE 24/4, 2013: We have increased the number of port to to the maximum value. At the same time we do not get as many forwarded outgoing requests as earlier. These two in combination should be the reason why we have not had any incidents. However, it is only temporary since the number of outgoing requests are bound to increase again in the future on these servers. The problem thus lies in, I think, that port for the incoming requests has to remain open during the time frame for the response of the forwarded requests. In our application, this cancelation limit for these forwarded requests is 120 ms which could be compared with the normal <1ms to handle a non forwarded request. So in essence, I believe the definite number of ports is the major scalability bottleneck on such high throughput servers (>1000 requests/sec on ~16 cores machines) that we are using. This in combination with the GC work on cache reload (se below) makes the server especially vulernable.
24/4 结束更新
我们的性能计数器显示,线程池 (1B) 中排队的请求数在问题发生期间波动很大.所以这可能意味着我们有一个动态的情况,队列长度由于环境的变化而开始振荡.例如,如果在流量泛滥时激活泛洪保护机制,就会出现这种情况.事实上,我们有许多这样的机制:
Our performance counters show that the number of queued requests in the Thread Pool (1B) fluctuates a lot during the time of the problem. So potentially this means that we have a dynamic situation in which the queue length starts to oscillate due to changes in the environment. For instance, this would be the case if there are flooding protection mechanisms that are activated when traffic is flooding. As it is, we have a number of these mechanisms:
当事情变得非常糟糕并且服务器以 HTTP 503 错误响应时,负载平衡器将在 15 秒内自动将 Web 服务器从生产环境中移除.这意味着其他服务器将在该时间范围内承担增加的负载.在冷却期"期间,服务器可能会完成其请求的服务,并在负载均衡器执行下一次 ping 时自动恢复.当然,这只是好的,只要所有服务器都没有同时出现问题.幸运的是,到目前为止,我们还没有遇到这种情况.
When things go really bad and the server responds with a HTTP 503 error, the load balancer will automatically remove the web server from being active in production for a 15 second period. This means that the other servers will take the increased load during the time frame. During the "cooling period", the server may finish serving its request and it will automatically be reinstated when the load balancer does its next ping. Of course this only is good as long as all servers don’t have a problem at once. Luckily, so far, we have not been in this situation.
在 Web 应用程序中,我们有自己构造的阀门(是的.它是一个阀门".不是一个值"),由线程池中排队请求的 Windows 性能计数器触发.有一个线程在 Application_Start 中启动,每秒检查一次性能计数器值.如果该值超过 2000,则停止发起所有传出流量.下一秒,如果队列值低于 2000,则出站流量再次开始.
In the web application, we have our own constructed valve (Yes. It is a "valve". Not a "value") triggered by a Windows Performance Counter for Queued Requests in the thread pool. There is a thread, started in Application_Start, that checks this performance counter value each second. And if the value exceeds 2000, all outgoing traffic ceases to be initiated. The next second, if the queue value is below 2000, outgoing traffic starts again.
这里奇怪的是它并没有帮助我们到达错误场景,因为我们没有太多关于这种情况的日志记录.这可能意味着当流量对我们造成严重打击时,情况会很快变得糟糕,以至于 1 秒的时间间隔检查实际上太高了.
The strange thing here is that it has not helped us from reaching the error scenario since we don’t have much logging of this occurring. It may mean that when traffic hits us hard, things goes bad really quickly so that the 1 second time interval check actually is too high.
这还有另一个方面.当应用程序池中需要更多线程时,这些线程的分配速度会非常缓慢.根据我的阅读,每秒 1-2 个线程.之所以如此,是因为创建线程的成本很高,而且您不希望使用太多线程以避免在同步情况下进行昂贵的上下文切换,我认为这是很自然的.但是,这也应该意味着,如果突然出现大量流量,线程数将不足以满足异步场景中的需求,并且将开始排队请求.我认为这是一个很有可能的问题候选人.一种候选解决方案可能是增加线程池中创建的线程的最小数量.但我想这也可能会影响同步运行请求的性能.
There is another aspect of this as well. When there is a need for more threads in the application pool, these threads gets allocated very slowly. From what I read, 1-2 threads per second. This is so because it is expensive to create threads and since you don’t want too many threads anyways in order to avoid expensive context switching in the synchronous case, I think this is natural. However, it should also mean that if a sudden large burst of traffic hits us, the number of threads are not going to be near enough to satisfy the need in the asynchronous scenario and queuing of requests will start. This is a very likely problem candidate I think. One candidate solution may be then to increase the minimum amount of created threads in the ThreadPool. But I guess this may also effect performance of the synchronously running requests.
(Joey Reyes 写了关于这个 在博客文章中)由于异步请求稍后会收集对象(在我们的示例中最多为 120 毫秒),因此可能会出现内存问题,因为对象可以提升到第 1 代,并且不会像应有的那样频繁地重新收集内存.垃圾收集器上增加的压力很可能会导致发生扩展的线程上下文切换并进一步削弱服务器的容量.
(Joey Reyes wrote about this here in a blog post) Since objects get collected later for asynchronous requests (up to 120ms later in our case), memory problem can arise since objects can be promoted to generation 1 and the memory will not be recollected as often as it should. The increased pressure on the Garbage Collector may very well cause extended thread context switching to occur and further weaken capacity of the server.
但是,在问题出现期间,我们没有看到 GC 和 CPU 使用率增加,因此我们认为建议的 CPU 限制机制不是我们的解决方案.
However, we don’t see an increased GC- nor CPU usage during the time of the problem so we don’t think the suggested CPU throttling mechanism is a solution for us.
2013 年 19 月 2 日更新:我们定期使用缓存交换机制,将(几乎)完整的内存缓存重新加载到内存中,旧缓存可以获取垃圾收集.在这些时候,GC 将不得不更加努力地工作并从正常的请求处理中窃取资源.使用 Windows 性能计数器进行线程上下文切换表明,在高 GC 使用时,上下文切换的数量从正常的高值显着减少.我认为在这种缓存重新加载期间,服务器对排队请求非常脆弱,因此有必要减少 GC 的占用空间.该问题的一种潜在解决方法是只填充缓存而不总是分配内存.还有一点工作,但应该是可行的.
UPDATE 19/2, 2013: We use a cache swap mechanism at regular intervalls at which an (almost) full in-memory cache is reload into memory and the old cache can get garbage collected. At these times, the GC will have to work harder and steal resources from the normal request handling. Using Windows Performance counter for thread context switching it shows that the number of context switches decreases significantly from the normal high value at the time of a high GC usage. I think that during such cache reloads, the server is extra vulnernable for queueing up requests and it is necessary to reduce the footprint of the GC. One potential fix to the problem would be to just fill the cache without allocating memory all the time. A bit more work, but it should be doable.
2013 年 4 月 24 日更新:我仍在缓存重新加载内存调整的中间,以避免让 GC 运行太多.但是当 GC 运行时,我们通常会暂时有大约 1000 个排队请求.由于它在所有线程上运行,它自然会从正常的请求处理中窃取资源.部署此调整后,我将更新此状态,我们可以看到不同之处.
UPDATE 24/4, 2013: I am still in the middle of the cache reload memory tweak in order to avoid having the GC running as much. But we normally have some 1000 queued requests temporarily when the GC runs. Since it runs on all threads, it is naturall that it steals resources from the normal requests handling. I'll update this status once this tweak has been deployed and we can see a difference.
24/4 结束更新
推荐答案
我已经通过 Async Http Handler 实现了一个反向代理,用于基准测试(作为我博士论文的一部分)并且遇到了和你一样的问题.
I have implemented a reverse proxy through an Async Http Handler for benchmarking purposes (as a part of my Phd. Thesis) and run into the very same problems as you.
为了扩展,必须将 processModel 设置为 false 并微调线程池.我发现,与有关 processModel 默认值的文档所说的相反,当 processModel 设置为 true 时,许多线程池没有正确配置.maxConnection 设置也很重要,因为如果限制设置得太低,它会限制您的可伸缩性.请参阅 http://support.microsoft.com/default.aspx?scid=kb;en-us;821268
In order to scale it is mandatory to have processModel set to false and fine tune the thread pools. I have found that, contrary to what the documentation regarding processModel defaults says, many of the thread pools are not properly configured when processModel is set to true. The maxConnection setting it is also important as it limits your scalability if the limit is set too low. See http://support.microsoft.com/default.aspx?scid=kb;en-us;821268
关于由于套接字上的 TIME_WAIT 延迟而导致您的应用程序耗尽端口的问题,我也遇到了同样的问题,因为我在 240 秒内从有限的一组机器中注入了超过 64k 个请求的流量.我将 TIME_WAIT 降低到 30 秒没有任何问题.
Regarding your app running out of ports because of the TIME_WAIT delay on the socket, I have also faced the same problem because I was injecting traffic from a limited set of machines with more than 64k requests in 240 seconds. I lowered the TIME_WAIT to 30 seconds without any problems.
我还错误地将代理对象重用到多个线程中的 Web 服务端点.虽然代理没有任何状态,但我发现 GC 在收集与其内部缓冲区(String [] 实例)相关的内存时遇到了很多问题,这导致我的应用程序内存不足.
I also mistakenly reused a proxy object to a Web Services endpoint in several threads. Although the proxy doesn't have any state, I found that the GC had a lot of problems collecting the memory associated with its internal buffers (String [] instances) and that caused my app to run out of memory.
您应该监控的一些有趣的性能计数器是与 ASP.NET 应用程序类别下的排队请求、执行中的请求和请求时间相关的那些.如果您看到排队的请求或执行时间很短但客户端看到很长的请求时间,那么您的服务器中存在某种争用.还监视 LocksAndThreads 类别下的计数器以查找争用.
Some interesting performance counters that you should monitor are the ones related to Queued requests, requests in execution and request time under the ASP.NET apps category. If you see queued requests or that the execution time is low but the clients see long request times, then you have some sort of contention in your server. Also monitor counters under the LocksAndThreads category looking for contention.
这篇关于在 IIS 7.5 上使用传出异步 Web 请求时的可伸缩性问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!