Version: 1.1
Last Modified: November 2004
This document is the definitive
launching pad for troubleshooting Network Load Balancing (NLB) on Windows®
Server 2003. This is a living document and we
expect updated versions of this document to be released every few months,
taking into account the latest kinds of troubleshooting issues seen in the
field, and the latest troubleshooting techniques.
The document includes the following
sections:
Practices –
General guidelines for troubleshooting NLB problems; includes pointers to
online resources and advice on how to most effectively use this document.
Symptoms
– A list of problem symptoms, such as “intermittent client connectivity,” and
suggestions as to how to identify the root cause of the problem.
Causes
– A list of problem root causes and their potential solutions.
This section has the following
sub-sections:
Suggestions for Using this
Document
1. It’s a good idea to gain
background knowledge about NLB before embarking on troubleshooting. Consult the
section Background Information for
pointers to NLB white papers and the NLB FAQ.
2. Gather information from the NLB
computers – from the System Event Log, NLB Manager, and “nlb.exe display”. See the section Overview of
Available Tools for details.
3. Navigate down the list of symptoms in the Symptoms section, some of which call for additional
investigative actions. If any symptom matches the problem you are seeing,
follow the associated link to the possible root causes, which include
corrective action where possible.
4. If you could not find a matching symptom, try doing
a text search in this document for a relevant keyword or error text –
“convergence”, “connectivity”, “VIP”, “DIP”, “switch”, etc.
5. If still unsuccessful,
sequentially scan the Symptoms and Causes
sections, or view the NLB FAQ (the online location of the FAQ is listed in the
section Background Information). If
relevant information isn’t found pose your troubleshooting question to the
Microsoft online community newsgroup news:microsoft.public.windows.server.clustering.
http://www.microsoft.com/windows2000/techinfo/howitworks/cluster/nlb.asp
– An overview of NLB. Even though this is written for Windows® Server 2000, it
largely applies to both Windows Server 2000 and Windows Server 2003.
http://www.microsoft.com/windowsserver2003/techinfo/overview/clustering.mspx
–Technical Overview
of Windows Server 2003 Clustering Services
http://www.microsoft.com/windowsserver2003/evaluation/overview/technologies/clustering.mspx – What’s New in Clustering Technologies
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/nlbfaq.mspx
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/default.mspx
The above site provides a collection
of documents on the following topics
news:microsoft.public.windows.server.clustering
– Microsoft Communities newsgroup for posting NLB questions
http://www.microsoft.com/windowsserver2003/technologies/clustering/default.mspx
- A collection of links to documents describing clustering services.
http://www.microsoft.com/technet/prodtechnol/windowsserver2003/proddocs/entserver/microsoft_WLBS.asp
– Windows Server 2003 product documentation on NLB.
http://www.microsoft.com/windows/reskits/default.asp
– The Windows Server resource kits.
http://www.microsoft.com/resources/documentation/WindowsServ/2003/all/deployguide/en-us/dpgsdc_overview.asp
– Windows Server 2003 Deployment Guide: Planning Server Deployments.
NLB Specific Chapters:
Designing Network Load Balancing
Deploying Network Load Balancing
http://www.microsoft.com/business/reducecosts/efficiency/consolidate/msa.mspx
– Microsoft® Systems Architecture (MSA) is a technology architecture that has
been rigorously tested and proven in a partnered lab environment to provide
exceptional planning and implementation guidance.
Windows
Load Balancing Service Does Not Work on Token Ring
Windows
2000 Interoperability Between MSCS and NLB
Using
Terminal Server with Windows Load Balancing Service
Using
Crossover Cable Causes Load Balancing Not to Work
Testing
NLB with Homer Shows All Traffic Handled by a Single Host
System
Error 52 When You Connect to an NLB Cluster Name
Support
WebCast: Network Load Balancing in Microsoft Windows
2000
Support
WebCast: Microsoft Windows Terminal Services: How to
Configure
PRB:
Application Center 2000 Replicates NLB Equal Load Weight Setting as Load Weight
50
PRB:
Address Conflict When You Change an Application Center NLB Cluster
PRB:
Adding a Cluster Member May Delete Existing IP Addresses on the Target Server
PRB:
"550 Quoted Name Does Not Match IP Address" SMTP Error Message
Configuring
Network Load Balancing
Only
TCP/IP Should Be Bound to Virtual Network Adapter in WLBS Host
NLB
Operations Affect All Network Adapters on the Server
Network
Load Balancing Connection to a Virtual IP Address Not Made Across a Switch
Load
Balanced Service May Not Work Properly With IP Fragmentation
L2TP
Sessions Lost When Adding a Server to an NLB Cluster
IP
Address Conflict Switching Between Unicast and Multicast NLB Cluster
IP
Address Assignment for NLB with Multiple Network Adapters
INFO:
Using NIC Teaming Adapters with Network Load Balancing May Cause Network
Problems
How
WLBS Handles the Dedicated IP Address
HOW
TO: Install Network Load Balancing Service That Was Previously Uninstalled in
Windows 2000
HOW
TO: Configure Network Load Balancing Parameters in Windows 2000
HOW
TO: Configure an IP Address for NLB with One Network Adapter
How
to Configure WLBS with Multiple Virtual IP Addresses
How
to Configure HTTPMon to Monitor NLB or WLBS Web sites
How
NLB Hosts Converge When Connected to a Layer 2 Switch
FIX:
Message Queuing Messages Not Validated with Network Load Balancing
Description
of Network Load Balancing Features
Configuration
Options for WLBS Hosts Connected to a Layer 2 Switches
Client
Sessions May Be Lost While Accessing a Web Farm Program
Cannot
Use Wlbs.exe Remote Control Commands From Load
Balanced VPN Servers
"NLB
Failed to Start" Error Message on Windows 2000 If NLB Is Not Installed
WLBS
Cluster Is Unreachable from Outside Networks
NLB
Cluster Does Not Converge When the MTU Size Is Less Than
the Default Value
HOW
TO: Set Up TCP/IP for Network Load Balancing in Windows Server 2003
HOW
TO: Configure Network Load Balancing Parameters in Windows Server 2003
Cannot
Ping IP Addresses After You Enable Network Load Balancing on Network Adapter
"RPC
Server Is Unavailable" Error Message When You Connect to NLB Cluster Host
through NLB Manager
The System Event Log often contains messages that can provide
important clues as to the cause of a problem. One of the first steps when
troubleshooting should be to view the event log for entries generated by NLB.
Several events were added to NLB in Windows Server 2003 with a view towards
troubleshooting.
NLB
Manager, a GUI NLB configuration tool that
was added in Windows Server 2003, is useful for detecting configuration
mismatches among cluster nodes. NLB Manager can be installed
on Windows® XP or later. For a non-server OS, you access the binary by
installing the Windows Server 2003 Administration Pack, located in the i386
directory on the Windows Server 2003 media (run AdminPak.msi).
NLB Manager will attempt to connect
to each of the specified computers and correlate the cluster configuration
information from all of the nodes. It will report:
·
Configuration mismatches, such as
mixed operation modes (Unicast vs. multicast) or mismatched port rules
·
A variety of common errors; for
example, a cluster IP address missing from the TCPIP configuration
·
The convergence state on each
computer. A healthy cluster will have all nodes in the Active state
Configuration mismatches often
result in one or more nodes stuck perpetually in the Converging state. Consult
the Help and
The Microsoft® Network Monitor is an administrative tools for advanced
debugging. Run Network Monitor on the individual hosts as well as on test
clients to capture packet logs. For information on how to install and use it,
search for Network Monitor in the Help and
When investigating connectivity
problems the following progression is recommended as it starts from simple tests,
moving toward those having additional dependencies:
For each case above, start simple
and add complexity after verifying a test passes:
Nlb.exe has a couple of new diagnostic commands that can be used to assist in
troubleshooting NLB deployments. They
include:
Because the state of
port rules can change dynamically through administrative operations such as
enable, disable and drain, this command will query the NLB kernel-mode driver
to retrieve the current state, which may be transient. This command also returns rudimentary packet
statistics; however these statistics are reset each time the load distribution
changes in a manner that may not be absolutely consistent across hosts. Therefore, one should not attempt to use them
to make absolute determinations concerning the balance of load in the system.
Consult the nlb.exe help (nlb.exe
/?) for more information and command line syntax.
Often customers want to know how
well the cluster will scale as machines are added to the cluster. Will the performance
scale linearly? If not, how much capacity does one get by adding another
machine? The scaling performance of an NLB cluster varies with the
characteristics of the service being load balanced, so there are not hard and
fast answers. The NLB FAQ (see the Background
Information section for a link) shows the scaling factor for an example.
This section offers tips on how to
collect metrics so that you can assess the scaling factor for your load
balanced application. To do this you must gather performance metrics from the
machines under a variety of test conditions. Typically you will want to know
how many requests/sec your load balanced application can handle as the number
of machines in the cluster increases.
For your test setup you will need a
client tool, running on one or more client machines, to generate load for your
load balanced application. In addition you need a consistent way to offer load
to the cluster and assess how much load is offered during a given test run.
Typically, client tools offer load from a pool of threads. The amount of load
offered is usually controlled by a handful of settings such as pool size and
sleep time between requests. These properties are usually fixed for the duration
of a test run. Each thread transmits a request and synchronously blocks until
either the response is received or the attempt errors out.
Such tools can’t provide a fixed
load rate across data runs. For example, if a change were to increase the
latency of the requests this would cause a drop in total throughput (and
request rate). One compensates for this on the client side by increasing the
size of the pool as latency increases. However, changes between data runs must
be made carefully so that a comparison across data runs reflects changes in the
performance on the server rather than changes to the clients or the ambient
environment.
It is customary to use some other
metric such as server CPU utilization as a guide in testing. The idea is to
compare result of two data runs that have the same CPU utilization. (Assuming
operation in a linear region, one can also apply a scaling rule to results to
compensate for small differences in CPU utilization across data runs.) Often
this is done offering the server as much load as it can handle in the two
scenarios, hence driving the CPU utilization close to 100%. But with the
servers operating at peak, non-linear affects are likely to creep in and
distort the results.
Instead the following is
recommended:
Symptoms are categorized into the following
sections. Scan through the topics to identify the closest match, and refer to
the named subsection for more specific diagnosis as well as a discussion of
more specific symptoms.
Problems when Performing a
Configuration and Management Operation – A configuration or management
operation, such as binding NLB, adding a port rule or stopping a cluster, did
not succeed.
No
Connectivity to VIPs – A cluster has been setup on one or more computers,
but the cluster is entirely unresponsive. No requests addressed to the VIPs
(Virtual IP Addresses or cluster IP Address) are answered, whether from clients
on the same LAN or across one or more routers. Note the distinction between
this topic and the next two topics.
Intermittent connectivity to VIPs
– Clients are seeing intermittent connectivity to the VIPs. That is, any one
client experiences intermittent connectivity. This is distinct from the next
topic, where different clients see different behavior.
Some clients can connect to VIPs
but not others – In this topic the connectivity problems are associated
with specific clients or perhaps the locality of the clients relative to the
cluster.
Cannot connect to or from DIP – This
topic discusses connectivity problems for traffic addressed to a specific DIP
(dedicated IP address) or originating from the DIP.
Uneven load balancing or poor
performance – All clients can connect all of the time, but load is not balanced
evenly among the nodes in the cluster. Or there are other performance problems
such as slow response or low throughput.
Problems relating to client
authentication – Clients are having problems authenticating to the virtual
service via SSL or Kerberos. Note: problems connecting to NLB nodes for
administrative purposes (for example, via NLB Manager or WMI) are covered in a
previous topic.
Problems relating to session
persistence – The application or protocol it uses (SSL, VPN, IPSec, etc.) requires some form of session persistence that
is not being preserved.
Problems relating to convergence
– Cluster nodes converge into separate clusters, or one or more nodes remains
in the “converging” state.
Problem specific to Application X
or Protocol Y – A problem specific to a particular application (such as
read-only file shares) or protocol (such as SSL).
This section covers symptoms for problems
involving configuration or management operations (such as binding NLB, adding a
port rule or stopping a cluster) not having the intended effect. This includes
using NLB Manager, WMI, Network Configuration UI, nlb.exe or wlbs.exe.
Note: NLB needs to be bound to one
or more network adapters on each cluster member. This is done by using NLB
Manager (recommended) or through the Network Configuration Manager.
Possible Cause:
Possible Causes:
Related Symptoms:
Further Diagnosis:
First verify that you have basic
connectivity to the host by “pinging” the host using ICMP echo (the ping.exe
command). If ICMP is not enabled on your network, you may need to use other
mechanisms such as “net view”. If you do not have IP-level connectivity to the
computer on which NLB is to be setup then the problem is out of scope of this
troubleshooting document.
The next step is to verify that you
are able to establish a WMI session with the host (NLB Manager uses WMI to
remotely configure a host, and WMI in turn uses RPC). Use the following steps:
If the above steps succeed, it means
that you are able to establish WMI sessions to the host.
If you have IP-level connectivity
but are not able to establish any WMI session, possible causes are explored in
the following sections:
If you can view Microsoft cimv2 classes but NLB manager still fails,
see the following sections for possible causes:
Note: Remote control is a legacy feature in Windows Server 2003. It has
security vulnerabilities and its use is discouraged.
Further
Diagnosis:
Possible
Causes:
Possible Cause:
When there is at least one port rule
that applies to a port-specific cluster IP address (as opposed to All IP
addresses; port-specific cluster IPs are new in Windows
Server 2003), then, none of the port rules for that cluster, including, those
that apply to All IP addresses will show up as instances of the aforementioned
classes. This is the intended behavior. The port rules will, instead, show up
as instances of the new “MicrosoftNLB_PortruleEx”
class. This new class is backwards compatible, meaning that it will work even
when there is no port rule that applies to a specific cluster IP address. Use
of this new class is strongly recommended under all conditions.
Possible Cause:
Related Symptoms:
Possible
Causes:
Use NLB Manager to
avoid this kind of configuration mistake.
When configuring an NLB cluster,
basic connectivity tests need to be performed to verify that the hosts are set
up properly. This section describes symptoms commonly seen during initial
testing when a connectivity problem is likely to be encountered. While
troubleshooting connectivity problems ensure that you test from a client
computer that is not a member of the cluster. If the following symptoms do not
address your problem, you will need to use netmon to
capture the network activity on both client and servers while reproducing the
problem. These captures can be analyzed to pinpoint the cause of the problem.
Possible Cause:
Possible Causes:
Further Diagnosis:
Possible Cause:
Possible Causes:
Further Diagnosis:
To diagnose these problems, try the
following:
Possible Causes:
·
The load-balanced service one the
host (or hosts) is not running, or is misconfigured.
Possible Causes:
·
The switch to which the cluster
hosts are connected may have learned the cluster MAC address, though this is
rare. This can cause TCP traffic to be delivered to the wrong NLB host,
resulting in a connection reset (unicast mode only). See the section Switch is learning the MAC Address
for details.
·
The switch to which the cluster
hosts are connected is a layer 3 switch. NLB requires layer 2 switching.
See the section Switch is
operating in Layer-3 mode for details.
·
The cluster has been partitioned,
resulting in multiple and/or incorrect hosts responding to client requests,
resulting in a connection reset.
·
If this is web traffic, HTTP keep-alives may be enabled on the web servers. In this
instance, TCP resets are the expected behavior.
·
A failure during the bind process
may have forced NLB to use a less reliable mechanism to track TCP connections.
This can cause multiple NLB hosts to service the same TCP connection, resulting
in a connection reset. Check the event log for a related warning.
·
Too many active TCP connections may
have caused NLB to exhaust the resources used to track TCP connections and reliably
ensure connection affinity. This can cause multiple NLB hosts to service the
same TCP connection, resulting in a connection reset. Check the event log
for a related warning.
·
Network packet loss and/or delay
that causes frequent TCP SYN retransmission, along with changes in cluster
membership, can cause multiple NLB hosts to accept the same TCP connection,
resulting in connection reset.
Possible
Causes:
·
The switch to which the cluster
hosts may have connected has learned the cluster MAC address, though this is
rare. This can cause TCP traffic to be delivered to the wrong NLB host,
resulting in a connection reset (unicast mode only). See the section Switch is learning the MAC Address
for details.
·
The switch to which the cluster
hosts are connected is a layer 3 switch. NLB requires layer 2 switching.
See the section Switch is
operating in Layer-3 mode for details.
Possible
Causes:
·
The act of binding NLB to the network
adapter may have failed either in NLB, TCP/IP or some other protocol or
intermediate driver. Check the event log for a related error. NLB
will fail if:
o
The network adapter is
not 802.3 compliant
o
The network adapter does
not support programmatically changing its MAC address (unicast mode only).
o
It fails to allocate a
required resource
·
An IP address conflict
on the network caused by this host may be hampering its connectivity
·
The network adapter to
which NLB is bound does not support NDIS multi-packet receives. NLB has
set a performance bar that requires this functionality in the network adapter
miniport drivers. If it is not present, NLB drops all incoming traffic
·
There may be some other
problem in the network unrelated to NLB
·
The switch to which the cluster
hosts are connected may have learned the cluster MAC address on another switch
port, though this is rare. This will intermittently prevent traffic from
reaching this host (unicast mode only). See the section Switch is learning the MAC Address
for details
Possible
Causes:
·
An IP address conflict
on the network caused by this host may be hampering its connectivity
·
If a drain, disable,
stop or drainstop operation has been performed on the
host, convergence must complete before its share of the client traffic will be
taken on by other cluster members. A failure to complete convergence may
therefore be denying service to some clients during this process.
Ensure that the problem indicates that
the client is unable to reach the service. For example consider the case of a
client requesting service from IIS and receiving an HTTP 500 error. This is not
a connectivity problem; the client connected to IIS, but the service was not
able to respond. Proceed only if there was no such communication with the
service.
A connectivity problem of this sort
is almost always caused by one or more hosts malfunctioning in the cluster,
rather than being a problem with a specific client. To determine the host on which
to focus troubleshooting perform the following:
This symptom applies to Windows
Server 2000 only. When a cluster membership change occurs, the control and data
channels for a specific client can end up on different hosts. The result is
that these clients have no connectivity to the VIP. See the section IPSec Problems
for more information.
The following possible causes apply
only if the relevant port rule has ‘None’ client affinity, or cluster
membership is changing frequently during the investigation. Otherwise the
symptom would be that described in section Some
clients can get service through the VIP but others can’t.
Possible Causes:
Possible Causes:
This troubleshooting item assumes that there is client connectivity to
the VIP. If this isn’t the case see the section No connectivity to VIPs to troubleshoot the
VIP first. Once that is resolved, return to this section if you continue to
have problems with the dedicated IP address (DIP).
Possible Causes:
Possible Causes:
Possible Causes:
Possible Causes:
For this scenario we assume that the
hosts in question are in the ‘Converged’ or ‘Converging’ state and the relevant
port rule is ‘Enabled’ on them. Before proceeding see the section Overview of Available Tools for instructions on how to verify that this is the case.
Note: NLB
does not dynamically adjust the load distribution, nor does it make
distribution decisions on a per-connection basis. NLB statistically maps
incoming TCP connections to hosts, taking into account the statically
configured load weights. If there are relatively few TCP connections (less than
10 per host), or single affinity is enabled and there are relatively few
clients (less than 10 per host), uneven load balancing is to be expected.
Possible Causes:
Possible Causes:
Possible Causes:
Possible Cause:
Possible Cause:
This section describes problems
concerning a client’s inability to authenticate with a service running on an
NLB cluster. The first step is to attempt to reproduce the problem on a
single-node cluster. If there is a problem it is most likely a problem unrelated
to NLB. If the problem reproduces only with a multi-node NLB cluster, the
problem could be due to the fact that the specific authentication scheme has
not been setup to be used in a clustered environment. For example, Kerberos
does not work by default in an NLB cluster. See the section Kerberos authentication
isn’t working through NLB for information on how to configure Kerberos to
work with NLB.
This section lists problems that
occur when a client establishes a session with a particular host but subsequent
connections are directed to a different host, causing intermittent
connectivity. NLB has very limited support for session persistence: single
affinity and special support for preserving L2TP and IPSec
sessions.
Single affinity will preserve a
session from a particular IP address to a particular host as long as there are
no changes in the set of hosts that belong to the cluster. Because
load-distribution and affinity are closely related see also “How do I configure
my cluster to handle load non-uniformly?” in the NLB FAQ, which discusses the
impact of changes to load weights.
Windows Server 2003 only: L2TP and IPSec sessions are preserved even if there are load-weight
or membership (as long as the host of the sessions remains in the cluster)
changes.
Possible Cause:
Possible Cause:
To diagnose these problems, try the
following:
Possible Causes:
o
Hosts with a conflicting number of
port rules
o
Hosts with conflicting port rule
settings (ranges, protocols, affinities, etc.)
o
Multiple hosts utilizing the same
host ID
o
Hosts with conflicting
Bi-Directional Affinity (BDA) settings. Note: BDA is administered by ISA
Server. Check your ISA configuration.
o
Mismatched cluster modes of
operation (for example, unicast vs. multicast)
Convergence is described in the
section “How Does NLB Cluster Convergence Work?” in the NLB FAQ. This symptom
is caused by missed heartbeats.
o
The switch is configured to block
broadcast and/or multicast traffic. NLB heartbeats are broadcast or
multicast (depending on the mode of operation) to all hosts.
o
The network is overloaded, resulting
in transient, but consistent heartbeat loss between subsets of the cluster.
o
If cluster hosts are connected to
different switches (either connected logically by a VLAN, or perhaps in a redundant
switch configuration), communication between the multiple switches may be
impaired, resulting in the total or partial loss of connectivity between
subsets of the cluster.
NLB works with most TCP- and
UDP-based protocols, but makes no guarantees about sending the client back to
the same host across multiple connections or requests. Thus the application or
service must meet one of the following criteria:
Furthermore, to achieve good
load-balancing for TCP-based protocols, the client must not open a single TCP
connection and use request pipe-lining to the exclusion of multiple
connections. NLB load-balances TCP connections not application requests.
See the section Kerberos
authentication for load balanced web sites for instructions on how to configure
Kerberos to work on computers that are part of an NLB cluster.
See "Does NLB support
applications using .NET Remoting?" in the NLB
FAQ.
See "Does NLB support
applications COM+ applications?" in the NLB FAQ.
See "Can NLB load-balance
NetBIOS applications?" in the NLB FAQ.
Under certain circumstances a client
can take approximately 8 to 10 seconds to establish a session when connecting
to SQL Server through NLB. This occurs when the destination is an IP address,
rather than a name, and the client uses Windows authentication.
As part of the authentication
process, the client first attempts to authenticate using Kerberos. Kerberos
requires the identity of both client and server; the client attempts to resolve
the identity of the server as part of this process. If the client can’t
associate a name with the IP address it reverts to NTLM for authentication,
which does not require server identity.
The client’s attempt to resolve the
IP address to a name, which fails, is what takes so long. To work around this
issue one can edit the hosts file on the client (it is in %windir%\system32\drivers\etc) to map a
fake name to the shared NLB IP address. This will bypass the name resolution
step taken by the client, cause an immediate failure when using Kerberos, and
then proceed with NTLM authentication.
NLB is installed by default on
Windows Server 2003, however it may be uninstalled manually, in which case it
must be manually re-installed before NLB can be bound to any adapter locally or
remotely (via NLB Manager). Re-install NLB by clicking Install in the Network
Connections properties dialog box, select Service and click Add, select
“Network Load Balancing” and then click OK. Restart the system and then
configure NLB properties.
This section lists problem causes
that prevent a management console (running NLB Manager, a WMI script or a third
party management application that uses WMI) from administering a cluster
member.
Check if the status of “Remote
Procedure Call (RPC)” service is “Started”, in Control Panel, double-click
Services. This service is started by default and is essential for
administration of the host using NLB Manager.
There is a firewall between the NLB
Manager client and the host: Network Load Balancing Manager uses Windows
Management Instrumentation (WMI) which in turn has a dependency on Remote
Procedure Call (RPC) and Distributed Component Object Model (DCOM). This can
present problems when trying to use Network Load Balancing Manager to
administer servers that are on the other side of a firewall from the Network
Load Balancing Manager computer. By default, DCOM can randomly use a wide range
of ports. Firewalls, on the other hand, are typically configured to allow
traffic from only a limited number of specific ports. Therefore, in order to
use NLB Manager from behind a firewall to manage servers on the other side of
the firewall, you must first configure DCOM to use only a specific range of
ports. You must then configure your firewall to allow traffic through those
ports. For more information, see the link to the whitepaper, “Distributed COM
with Firewalls” at http://www.microsoft.com/com/wpaper/dcomfw.asp.
In addition to the steps described
in the white paper, you must also configure the firewall to allow ICMP echo
requests. Alternatively, instead of allowing ICMP echo requests, you can run NLB
Manager with the "noping" option. If you
use this option, you will experience a delay if NLB Manager attempts to contact
a server that is not available. For more information on using the "noping" option, see the Network Load Balancing
documentation. You can find this documentation by performing the following
procedure on any computer running a product in the Window Server 2003 family:
The WMI client may not possess
Administrator credentials on the local host, which are needed to access and
interact with the instances of WMI classes in the root\microsoftnlb
namespace.
The “SeLoadDriverPrivilege” privilege is either not present, or
present but disabled in the access token of the WMI client process/thread. To
resolve this problem, enable the “SeLoadDriverPrivilege”
privilege in the access token before calling WMI (IWbemLocator::ConnectServer
method) to connect to the computer. This privilege must be enabled before connecting
to the computer using WMI (IWbemLocator::ConnectServer
method). For information about enabling privileges and access tokens, refer to
the Windows Server 2003 SDK documentation at “Security” -> “Authorization”
-> “About Authorization” -> “Access Control” -> “Privileges”.
A proxy ARP reply is an ARP reply
sent by one network entity on behalf of another. Proxy ARP replies are easily identified by
inspecting the source address information of the ARP reply. In a conventional Ethernet/IP ARP reply, the
ARP source hardware address is the same as the source address of the Ethernet
frame. In a proxy Ethernet/IP ARP reply,
the two source addresses differ, implying that one host is answering on behalf
of another.
NLB does not explicitly generate
proxy ARP replies, however, for other reasons critical to the NLB
load-balancing algorithm, NLB spoofs ARP replies giving them the illusion of
being proxy ARP replies. In multicast mode,
NLB must spoof the ARP reply to map requests for virtual IP addresses to the
shared NLB multicast cluster MAC address; therefore, the source address of the
Ethernet frame (the physical MAC address of the responder) differs from the
source address of the ARP frame (the multicast cluster MAC address). In unicast mode, NLB is forced to “mask” the
source address of all outgoing frames in order to prevent switches from
learning the location of the shared NLB cluster MAC address. Therefore, the
source address of the Ethernet frame (the masked source MAC address) differs
from the source address of the ARP frame (the cluster MAC address). It is for these reasons that NLB clusters
appear to generate proxy ARP replies.
First of all, note that some router
vendors do support proxy ARP replies, but as a configuration option that is
turned off by default for security purposes, so be sure to check this
first. If a router does not accept proxy
ARP replies, there are two courses of action; (1) replace the router with one
that does accept proxy ARP replies, or (2) force NLB to generate non-proxy ARP
replies.
The first option is straight-forward
and solves the issue at hand, but may be too costly to consider. It is possible to force NLB to work with
these routers, but with associated consequences discussed below. To force NLB to generate non-proxy ARP
replies, several changes need to be made to the NLB configuration and topology
of the network. Refer to the section Even though it should, the switch
isn’t learning the cluster MAC address in this document for information on
configuring an NLB cluster to operate in this manner.
If all hosts of an NLB cluster are
plugged into a hub and uplinked to a switch, it is
desirable for the switch to learn the location of the NLB cluster MAC address
and associate it with the port to which the hub is uplinked. The advantage of this configuration is that
it is possible to limit the flooding of cluster-bound traffic on the
switch. To enable an NLB cluster in this
mode:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\WLBS\Parameters\Interface\{GUID}\MaskSrcMac = 1
Where {GUID} is the
GUID of the particular NLB instance on which heartbeats should be
captured. If more than one NLB instance
exists, then you can use the “ClusterIPAddress”
registry value in each hive to differentiate between instances.
Note that there are consequences to
this configuration choice, which are the result of the NLB hosts being homed to
a hub. While it is true that NLB
requires that cluster-bound traffic be replicated to all cluster hosts, the
return traffic from the cluster to the clients needn’t be replicated. A hub
floods inbound and outbound traffic to all ports, resulting in unnecessary
bandwidth consumption.
This issue can be circumvented by
configuring the network stack on the NLB hosts to direct outbound traffic
through a different adapter that is connected directly to a switch. However, the lack of outgoing traffic through
the hub may cause the switch to “forget” (timeout) the cluster MAC address
association, once again, resulting in the flooding of cluster-bound traffic on
the switch until an ARP response refreshes the address tables on the switch.
Network Monitor is a network
protocol that resides logically above NLB in the network protocol stack. Therefore, it is important to realize that a
sniff captured on an adapter to which NLB is bound will show the outbound
packets before NLB has seen them and inbound packets after NLB has seen
them. Therefore, a sniff taken on an NLB
host will differ from a sniff taken promiscuously from another host on the same
LAN.
By default, in unicast mode, it is
true that NLB masks the source MAC address of outgoing frames to prevent the
switch from learning the cluster MAC address and associating it with a specific
switch port. However, a Network Monitor
sniff from an NLB host will show outbound packets with a source MAC address
equal to the NLB shared cluster MAC address; i.e., they appear to be
“unmasked”, which would allow the switch to learn the cluster MAC address. It is important to realize, however, that
because Network Monitor is seeing the packets before NLB, the outbound packets
simply have not been masked yet; they will be properly masked before being sent
out on the wire. Examining a promiscuous
sniff from a non-NLB host on the same LAN will show that this is true.
Likewise, in multicast mode, a Network
Monitor sniff on an NLB host will show that all source and destination MAC
addresses are the physical MAC address of the NLB host. However, a promiscuous sniff from a non-NLB
host on the same LAN will show that the cluster-bound packets are, in fact,
addressed to the NLB shared multicast cluster MAC address. The destination MAC
address is altered by NLB before the Network Monitor and TCP/IP see the packet.
When the hosts of an NLB Unicast
cluster are connected to a switch, NLB depends on the fact that the switch
never learns the Unicast MAC address and takes special steps to prevent this
from happening. Very rarely, typically due to a failure when performing an NLB
configuration change, a switch does associate the unicast cluster MAC address
with a particular port. This causes intermittent connectivity problems. Consult
the section Intermittent
connectivity to VIPs for how to diagnose this condition. This should be a
transient condition that will correct itself in a few minutes, or the switch
can be reset or the
NLB is not supported when the hosts
are homed to a switch operating at Layer-3. Instead, create a VLAN for all the
nodes in the NLB cluster, and configure that VLAN to operate in Layer-2 mode.
In general, NLB is compatible with
Adaptive Fault Tolerance (AFT) teaming solutions from third party vendors,
which can be used to provide fault-tolerance at the link layer below NLB. However, a specific set of issues can cause
problems that need to be addressed manually.
Note that these problems are only applicable in the NLB unicast mode of
operation.
First, teaming solutions tend to
assign the MAC address of the team using the MAC address of the designated
primary team member. This is fine in a
non-NLB deployment, but NLB, of course, needs to modify the MAC address of the
adapter to which it is bound in unicast mode.
This is accomplished in the NDIS model by changing a registry setting in
the registry hive for the adapter. While
most network adapter driver vendors respect this method by which to change the MAC
address of an adapter, network adapter teaming solutions historically ignore
this mechanism and do not change the MAC address of the team when this registry
key is modified. Therefore, in order to
change the MAC address of the team to the NLB shared cluster MAC address, it is
necessary to manually configure the team MAC address in the teaming software
configuration utility and set it to the NLB cluster MAC address.
Second, some network adapter teaming
vendors utilize a lightweight low-level heartbeat mechanism to verify
connectivity to the network. While this
extra level of fault detection is desirable, unfortunately, it often causes
problems in NLB deployments. The teaming
driver sits below NLB in the NDIS stack such that teaming heartbeats do not traverse
the NLB driver, and hence NLB does not spoof the share cluster MAC address. The
result is that the switch learns the location of the shared cluster MAC address
and directs all cluster traffic to one host only. If provided by the teaming
software vendor, it may be necessary to turn off this heartbeat mechanism in
the teaming software configuration.
There is a known problem using this network
adapter (driver version unknown). The symptom is that ICMP traffic is delivered
but TCP traffic isn’t. A TCP SYN never makes it from off of the network to the
TCP stack. Update to the latest driver and retest; replace the adapter with a
different model if this doesn’t fix the problem.
In Windows Server 2003, NLB has
raised the performance bar for network adapters that it will support. Specifically, NLB requires that adapters to
which it binds support multi-packet receives.
Any packets that are passed to NLB by the adapter through the old
receive indicate path are dropped by NLB.
The vast majority of server-class adapters satisfy this
requirement. However, in the event that
this condition occurs, NLB logs a warning event in the System event log. To resolve the issue, first ensure that the
drivers for the adapter are the latest available from the vendor. If the problem is not rectified with the
latest driver, it will be necessary to replace the adapter with one that
supports the required functionality.
SSL session is established between
an external client and a server in the NLB cluster by following the SSL
Handshake protocol, at the end of which, a session ID is issued by the server
(and sent to the client) to identify the SSL session. Subsequent TCP
connections from the client will present this session ID to identify it as
being part of the already established SSL session. In order for the SSL session
to be preserved, these (subsequent) TCP connections must be accepted by the
same server in the NLB cluster that had previously issued the session ID. This
is where Single affinity steps in: By definition, Single affinity causes all
TCP connections from a given client IP address to be handled by the same server
in the NLB cluster. In this case, Single affinity would cause the initial SSL
Handshake protocol related TCP connection (where the session ID is issued) and
subsequent TCP connections from the client to be handled by the same server in
the NLB cluster.
However, Single affinity will preserve a
session from a particular IP address to a particular host as long as there are
no changes in the set of hosts that belong to the cluster. If cluster nodes are
added or there is a change in the load weights, session affinity may be lost.
Thus Single affinity provides “best effort” session persistence. The server
application must be able to deal with occasional misdirected sessions by
transparently re-negotiating new sessions.
There are no topics in this section
at this time.
There are no topics in this section
at this time.
There is a router or firewall
between client and server that is filtering out traffic. For a router,
filtering is achieved by setting access control list (ACL) entries to allow or
deny traffic.
NLB clusters homed to Extreme
Networks Layer 3 switches running ExtremeWare 6.x
will experience problems with some IP addresses when NLB is initially bound.
The IP addresses affected are those that were bound to the network adapter prior
to binding NLB. Traffic addressed to or sourced from these IP addresses will be
dropped by the switch for approximately 30 minutes, at which time the entries
are aged out.
The problem is that the original IP
addresses are retained in the IP forwarding database even though these entries
have the incorrect MAC address. Upgrade to version 7.x or higher of ExtremeWare to correct this issue.
In this scenario, a DIP is defined
in the NLB configuration, but the specified IP address has not been added to
TCP/IP. One would not expect to receive a reply to an outgoing ICMP request.
But the reply is indeed seen because of the way that NLB handles ICMP traffic.
In this case the outgoing ICMP request uses a VIP as the source IP address and
the reply will be addressed to the VIP. By default NLB passes ICMP traffic up
on all hosts (this traffic is not load balanced) so the reply will be seen by
the client and the ping will succeed.
Note that UDP or TCP traffic
initiated from the host won’t work in this scenario. As with ICMP traffic, the
outgoing packets will use a VIP as the source address. Replies will be
load-balanced, and the probability is low that the packet will be load-balanced
back to the source host. Adding the DIP to TCP/IP properties will fix this
because the outbound traffic will then be sourced with the DIP.
Properly measuring load distribution
can be very tricky, and proper assessment relies on summing rate counters over
time or resetting level counters when a change to cluster membership is made.
Typically one uses performance counters from the load-balanced application or
service. Alternatively, one can use network counters such as Network
Interface\Packets Received/sec.
Since NLB load balances TCP
connections, use a performance counter primitive that reflects the connection
activity of the service. In IIS, for example, Web Service\Connection
Attempts/sec is an excellent counter. Web Service\GET Requests/sec is not a
good counter to use because multiple GET requests can be pipelined over a
single TCP connection.
NLB divides up the theoretical
address space of all clients into a fixed number of quanta called bins.
Convergence is the process by which cluster hosts negotiate the ownership of
these bins. Good load distribution equates to having a population of clients
that is well distributed across the bins. When the observed load distribution
differs from the distribution specified by the load weights, and sources of
error have been eliminated as the cause, there remain two possible contributors
to the observed behavior:
When both client and server reside
in the intranet environment, the client can obtain the requested content most
efficiently if the requests bypass the proxy servers. When a URL consists of an
IP address, we have observed that Internet Explorer always uses a proxy server
to reach the destination, even if the destination is an intranet address. This
can cause a small delay as it adds an unnecessary hop for the client. However,
this alone is not likely to be the cause of a significantly slower response.
Download times become extremely slow
when authentication schemes are also imposed in this scenario. We have further
observed that while HTTP/1.1 is used for the request when made directly to the
intranet destination, HTTP/1.0 is used when the request is made via the proxy
server. Thus each resource (for example, a picture in GIF format) is obtained
via an independent TCP connection. Under these conditions page load times of 10
to 50 times the normal load time are not uncommon.
The information contained in this document
represents the current view of Microsoft Corporation on the issues discussed as
of the date of publication. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of
Microsoft, and Microsoft cannot guarantee the accuracy of any information
presented after the date of publication.
This document is for informational
purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS
DOCUMENT.