NLB Troubleshooting Overview for Windows® Server 2003

 

Version: 1.1

Last Modified: November 2004

 

1 Introduction. 3

2 Practices. 3

3 Symptoms. 10

4 Causes. 27

 

1 Introduction

This document is the definitive launching pad for troubleshooting Network Load Balancing (NLB) on Windows® Server 2003. This is a living document and we expect updated versions of this document to be released every few months, taking into account the latest kinds of troubleshooting issues seen in the field, and the latest troubleshooting techniques.

The document includes the following sections:

Practices – General guidelines for troubleshooting NLB problems; includes pointers to online resources and advice on how to most effectively use this document.

Symptoms – A list of problem symptoms, such as “intermittent client connectivity,” and suggestions as to how to identify the root cause of the problem.

Causes – A list of problem root causes and their potential solutions.

2 Practices

This section has the following sub-sections:

Suggestions for Using this Document

Background Information

Overview of Available Tools

[Back to top]

2.1 Suggestions for Using this Document

1.   It’s a good idea to gain background knowledge about NLB before embarking on troubleshooting. Consult the section Background Information for pointers to NLB white papers and the NLB FAQ.

2.   Gather information from the NLB computers – from the System Event Log, NLB Manager, and “nlb.exe display”. See the section Overview of Available Tools for details.

3.   Navigate down the list of symptoms in the Symptoms section, some of which call for additional investigative actions. If any symptom matches the problem you are seeing, follow the associated link to the possible root causes, which include corrective action where possible.

4.   If you could not find a matching symptom, try doing a text search in this document for a relevant keyword or error text – “convergence”, “connectivity”, “VIP”, “DIP”, “switch”, etc.

5.   If still unsuccessful, sequentially scan the Symptoms and Causes sections, or view the NLB FAQ (the online location of the FAQ is listed in the section Background Information). If relevant information isn’t found pose your troubleshooting question to the Microsoft online community newsgroup news:microsoft.public.windows.server.clustering.

[Back to top of Practices]

2.2 Background Information

Technical Whitepapers

http://www.microsoft.com/windows2000/techinfo/howitworks/cluster/nlb.asp – An overview of NLB. Even though this is written for Windows® Server 2000, it largely applies to both Windows Server 2000 and Windows Server 2003.

http://www.microsoft.com/windowsserver2003/techinfo/overview/clustering.mspx Technical Overview of Windows Server 2003 Clustering Services

http://www.microsoft.com/windowsserver2003/evaluation/overview/technologies/clustering.mspx What’s New in Clustering Technologies

FAQ

http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/nlbfaq.mspx

NLB Communities Page and Newsgroup

http://www.microsoft.com/technet/prodtechnol/windowsserver2003/technologies/clustering/default.mspx

The above site provides a collection of documents on the following topics

  • NLB Overview and Architecture documents
  • Best Practices
  • Product and Support Links
  • Clustered and Application Solutions
  • Whitepapers and Technical Documents
  • Other NLB Links

news:microsoft.public.windows.server.clustering – Microsoft Communities newsgroup for posting NLB questions

Miscellaneous Links

http://www.microsoft.com/windowsserver2003/technologies/clustering/default.mspx - A collection of links to documents describing clustering services.

http://www.microsoft.com/technet/prodtechnol/windowsserver2003/proddocs/entserver/microsoft_WLBS.asp – Windows Server 2003 product documentation on NLB.

http://www.microsoft.com/windows/reskits/default.asp – The Windows Server resource kits.

http://www.microsoft.com/resources/documentation/WindowsServ/2003/all/deployguide/en-us/dpgsdc_overview.asp – Windows Server 2003 Deployment Guide: Planning Server Deployments.

NLB Specific Chapters:

Designing Network Load Balancing

Deploying Network Load Balancing

http://www.microsoft.com/business/reducecosts/efficiency/consolidate/msa.mspx – Microsoft® Systems Architecture (MSA) is a technology architecture that has been rigorously tested and proven in a partnered lab environment to provide exceptional planning and implementation guidance.

Knowledge Base Articles

Windows 2000 Knowledge Base Articles

Windows Load Balancing Service Does Not Work on Token Ring

Windows 2000 Interoperability Between MSCS and NLB

Using Terminal Server with Windows Load Balancing Service

Using Crossover Cable Causes Load Balancing Not to Work

Testing NLB with Homer Shows All Traffic Handled by a Single Host

System Error 52 When You Connect to an NLB Cluster Name

Support WebCast: Network Load Balancing in Microsoft Windows 2000

Support WebCast: Microsoft Windows Terminal Services: How to Configure

PRB: Application Center 2000 Replicates NLB Equal Load Weight Setting as Load Weight 50

PRB: Address Conflict When You Change an Application Center NLB Cluster

PRB: Adding a Cluster Member May Delete Existing IP Addresses on the Target Server

PRB: "550 Quoted Name Does Not Match IP Address" SMTP Error Message

Configuring Network Load Balancing

Only TCP/IP Should Be Bound to Virtual Network Adapter in WLBS Host

NLB Operations Affect All Network Adapters on the Server

Network Load Balancing Connection to a Virtual IP Address Not Made Across a Switch

Load Balanced Service May Not Work Properly With IP Fragmentation

L2TP Sessions Lost When Adding a Server to an NLB Cluster

IP Address Conflict Switching Between Unicast and Multicast NLB Cluster

IP Address Assignment for NLB with Multiple Network Adapters

INFO: Using NIC Teaming Adapters with Network Load Balancing May Cause Network Problems

How WLBS Handles the Dedicated IP Address

HOW TO: Install Network Load Balancing Service That Was Previously Uninstalled in Windows 2000

HOW TO: Configure Network Load Balancing Parameters in Windows 2000

HOW TO: Configure an IP Address for NLB with One Network Adapter

How to Configure WLBS with Multiple Virtual IP Addresses

How to Configure HTTPMon to Monitor NLB or WLBS Web sites

How NLB Hosts Converge When Connected to a Layer 2 Switch

FIX: Message Queuing Messages Not Validated with Network Load Balancing

Description of Network Load Balancing Features

Configuration Options for WLBS Hosts Connected to a Layer 2 Switches

Client Sessions May Be Lost While Accessing a Web Farm Program

Cannot Use Wlbs.exe Remote Control Commands From Load Balanced VPN Servers

"NLB Failed to Start" Error Message on Windows 2000 If NLB Is Not Installed

WLBS Cluster Is Unreachable from Outside Networks

Windows Server 2003 Knowledge Base Articles

NLB Cluster Does Not Converge When the MTU Size Is Less Than the Default Value

HOW TO: Set Up TCP/IP for Network Load Balancing in Windows Server 2003

HOW TO: Configure Network Load Balancing Parameters in Windows Server 2003

Cannot Ping IP Addresses After You Enable Network Load Balancing on Network Adapter

"RPC Server Is Unavailable" Error Message When You Connect to NLB Cluster Host through NLB Manager

[Back to top of Practices]

2.3 Overview of Available Tools

System Event Log

The System Event Log often contains messages that can provide important clues as to the cause of a problem. One of the first steps when troubleshooting should be to view the event log for entries generated by NLB. Several events were added to NLB in Windows Server 2003 with a view towards troubleshooting.

NLB Manager

NLB Manager, a GUI NLB configuration tool that was added in Windows Server 2003, is useful for detecting configuration mismatches among cluster nodes. NLB Manager can be installed on Windows® XP or later. For a non-server OS, you access the binary by installing the Windows Server 2003 Administration Pack, located in the i386 directory on the Windows Server 2003 media (run AdminPak.msi).

NLB Manager will attempt to connect to each of the specified computers and correlate the cluster configuration information from all of the nodes. It will report:

·         Configuration mismatches, such as mixed operation modes (Unicast vs. multicast) or mismatched port rules

·         A variety of common errors; for example, a cluster IP address missing from the TCPIP configuration

·         The convergence state on each computer. A healthy cluster will have all nodes in the Active state

Configuration mismatches often result in one or more nodes stuck perpetually in the Converging state. Consult the Help and Support Center for more information about NLB Manager.

Microsoft® Network Monitor

The Microsoft® Network Monitor is an administrative tools for advanced debugging. Run Network Monitor on the individual hosts as well as on test clients to capture packet logs. For information on how to install and use it, search for Network Monitor in the Help and Support Center in Windows Server 2003.

General Troubleshooting Techniques

When investigating connectivity problems the following progression is recommended as it starts from simple tests, moving toward those having additional dependencies:

  • Test with ICMP ping
  • Test the load-balanced service using an IP address
  • Test the load-balanced service via a name, invoking name resolution

For each case above, start simple and add complexity after verifying a test passes:

  • Test from the NLB host – If this fails look for missing IP addresses in the TCP/IP or NLB configurations (NLB Manager does this check). Also look for configuration of the application if it can specify the addresses on which the service listens.
  • Test from a client on the LAN with the NLB hosts – If this fails look for problems associated with switching. For example, examine the CAM table which maps MAC addresses to switch ports.
  • Test from a client on a different LAN from the NLB hosts – If this fails looks for problems associated with routing. Examine the ARP table, for example.

Nlb.exe has a couple of new diagnostic commands that can be used to assist in troubleshooting NLB deployments.  They include:

  • queryport:  Queries the current state of a specified port rule and returns one of the following designations:
    • Not found:  The specified port rule was not found.
    • Enabled:  The specified port rule is configured to handle incoming load.
    • Disabled:  The specified port rule is configured to handle none of the incoming load.
    • Draining:  The specified port rule is in the process of draining its active connections.

Because the state of port rules can change dynamically through administrative operations such as enable, disable and drain, this command will query the NLB kernel-mode driver to retrieve the current state, which may be transient.  This command also returns rudimentary packet statistics; however these statistics are reset each time the load distribution changes in a manner that may not be absolutely consistent across hosts.  Therefore, one should not attempt to use them to make absolute determinations concerning the balance of load in the system.

  • params:  Queries the current configuration from the NLB driver.  This output is nearly identical to that of “nlb.exe display” with the important distinction that display reads the configuration from the registry, while params reads the configuration directly from the NLB driver.  Under normal operating conditions the two parameter sets will be in sync. However, they can differ, for example, if the NLB configuration has been changed without reloading the NLB driver, or if a configuration error has forced the NLB driver to reject the parameter set in the registry and retain its current configuration.

Consult the nlb.exe help (nlb.exe /?) for more information and command line syntax.

2.4 Assessing Scaling Performance of NLB

Often customers want to know how well the cluster will scale as machines are added to the cluster. Will the performance scale linearly? If not, how much capacity does one get by adding another machine? The scaling performance of an NLB cluster varies with the characteristics of the service being load balanced, so there are not hard and fast answers. The NLB FAQ (see the Background Information section for a link) shows the scaling factor for an example.

 

This section offers tips on how to collect metrics so that you can assess the scaling factor for your load balanced application. To do this you must gather performance metrics from the machines under a variety of test conditions. Typically you will want to know how many requests/sec your load balanced application can handle as the number of machines in the cluster increases.

 

For your test setup you will need a client tool, running on one or more client machines, to generate load for your load balanced application. In addition you need a consistent way to offer load to the cluster and assess how much load is offered during a given test run. Typically, client tools offer load from a pool of threads. The amount of load offered is usually controlled by a handful of settings such as pool size and sleep time between requests. These properties are usually fixed for the duration of a test run. Each thread transmits a request and synchronously blocks until either the response is received or the attempt errors out.

 

Such tools can’t provide a fixed load rate across data runs. For example, if a change were to increase the latency of the requests this would cause a drop in total throughput (and request rate). One compensates for this on the client side by increasing the size of the pool as latency increases. However, changes between data runs must be made carefully so that a comparison across data runs reflects changes in the performance on the server rather than changes to the clients or the ambient environment.

 

It is customary to use some other metric such as server CPU utilization as a guide in testing. The idea is to compare result of two data runs that have the same CPU utilization. (Assuming operation in a linear region, one can also apply a scaling rule to results to compensate for small differences in CPU utilization across data runs.) Often this is done offering the server as much load as it can handle in the two scenarios, hence driving the CPU utilization close to 100%. But with the servers operating at peak, non-linear affects are likely to creep in and distort the results.

 

Instead the following is recommended:

  • Make a “calibration” run against a single machine with NLB unbound. Tweak the client settings until the load pushes the server to 80% CPU utilization. Save these settings.
  • Repeat the calibration procedure for the same machine, but with NLB bound. Retweak the settings to get 80% CPU utilization. You may need more test clients to get this utilization up because of an increase in latency. Save these settings.
  • Add one machine to the cluster and repeat the previous step. Repeat this process until calibration settings are determined for each cluster size of interest.
  • With the calibrated client settings run your tests with and without NLB and collect throughput metrics from each machine (also verify that CPU utilization remains at 80% through the test runs). Compare the aggregate throughput across data runs as the cluster size increases.

 

[Back to top of Practices]

[Back to top]

3 Symptoms

Symptoms are categorized into the following sections. Scan through the topics to identify the closest match, and refer to the named subsection for more specific diagnosis as well as a discussion of more specific symptoms.

Problems when Performing a Configuration and Management Operation – A configuration or management operation, such as binding NLB, adding a port rule or stopping a cluster, did not succeed.

No Connectivity to VIPs – A cluster has been setup on one or more computers, but the cluster is entirely unresponsive. No requests addressed to the VIPs (Virtual IP Addresses or cluster IP Address) are answered, whether from clients on the same LAN or across one or more routers. Note the distinction between this topic and the next two topics.

Intermittent connectivity to VIPs – Clients are seeing intermittent connectivity to the VIPs. That is, any one client experiences intermittent connectivity. This is distinct from the next topic, where different clients see different behavior.

Some clients can connect to VIPs but not others – In this topic the connectivity problems are associated with specific clients or perhaps the locality of the clients relative to the cluster.

Cannot connect to or from DIP – This topic discusses connectivity problems for traffic addressed to a specific DIP (dedicated IP address) or originating from the DIP.

Uneven load balancing or poor performance – All clients can connect all of the time, but load is not balanced evenly among the nodes in the cluster. Or there are other performance problems such as slow response or low throughput.

Problems relating to client authentication – Clients are having problems authenticating to the virtual service via SSL or Kerberos. Note: problems connecting to NLB nodes for administrative purposes (for example, via NLB Manager or WMI) are covered in a previous topic.

Problems relating to session persistence – The application or protocol it uses (SSL, VPN, IPSec, etc.) requires some form of session persistence that is not being preserved.

Problems relating to convergence – Cluster nodes converge into separate clusters, or one or more nodes remains in the “converging” state.

Problem specific to Application X or Protocol Y – A problem specific to a particular application (such as read-only file shares) or protocol (such as SSL).

[Back to top of Symptoms]

3.1 Problems when Performing a Configuration and Management Operation

This section covers symptoms for problems involving configuration or management operations (such as binding NLB, adding a port rule or stopping a cluster) not having the intended effect. This includes using NLB Manager, WMI, Network Configuration UI, nlb.exe or wlbs.exe.

Note: NLB needs to be bound to one or more network adapters on each cluster member. This is done by using NLB Manager (recommended) or through the Network Configuration Manager.

3.1.1 Symptom: Network Load Balancing is not listed as an available component to be installed in the LAN Connection properties.

Possible Cause:

3.1.2 Symptom: “Network Connections” properties user interface does not open/work.

Possible Causes:

  • You need to be a system administrator on the computer to be able to make configuration changes.
  • Simultaneous use of “NLB Manager” and “Network Connections” Properties user interface (NCP UI) could affect the functionality of both components. In particular, attempts to update the configuration on a host via NLB Manager when the NCP UI is active on that host will cause the update to fail. Use only one of them at a time. Use of NLB Manager is preferred.

3.1.3 Symptom: NLB Manager, WMI script or third party application cannot connect to a host on which NLB is to be setup or is already setup.

Related Symptoms:

  • Instances of WMI classes in the root\microsoftnlb WMI namespace are inaccessible.
  • The “LoadAllSettings” method in the MicrosoftNLB_ClusterSetting, MicrosoftNLB_NodeSetting NLB WMI class returns “Access Denied” error.
  • Authentication failure when attempting to configure a host using a third party WMI application or WMI scripts.
  • When attempting to use Network Load Balancing Manager to connect to a host in the NLB cluster, you receive the error "Host unreachable".

Further Diagnosis:

First verify that you have basic connectivity to the host by “pinging” the host using ICMP echo (the ping.exe command). If ICMP is not enabled on your network, you may need to use other mechanisms such as “net view”. If you do not have IP-level connectivity to the computer on which NLB is to be setup then the problem is out of scope of this troubleshooting document.

The next step is to verify that you are able to establish a WMI session with the host (NLB Manager uses WMI to remotely configure a host, and WMI in turn uses RPC). Use the following steps:

  1. Launch the “wbemtest.exe”
  2. Click “Connect” and connect to “\\<host-name-or-IP-address>\root\cimv2” namespace. If you have problems with credentials, enter credentials of an administrator on the remote host.
  3. Click “Open Class” and enter “Win32_NetworkAdapterConfiguration” for target class name and then click OK. The “Object editor” window for the class will open.
  4. Click “Instances”. The “Query Result” window open and, after a few seconds of network delay, list the instances of the class.

If the above steps succeed, it means that you are able to establish WMI sessions to the host.

If you have IP-level connectivity but are not able to establish any WMI session, possible causes are explored in the following sections:

If you can view Microsoft cimv2 classes but NLB manager still fails, see the following sections for possible causes:

3.1.4 Symptom:  Remote Control is not working – When invoking the Network Load Balancing remote-control commands from a computer outside the cluster, there is no response from one or more cluster hosts.

Note: Remote control is a legacy feature in Windows Server 2003. It has security vulnerabilities and its use is discouraged.

Further Diagnosis:

  • Open the NLB cluster property page in NLB Manager or via the “Network Connections” Properties user interface. Ensure that remote control is enabled on the hosts which are not responding to remote-control commands. Furthermore re-enter the password for remote control access.
  • Determine whether there is connectivity to the cluster IP address configured on the NLB cluster property page. The section No Connectivity to VIPs lists the causes of problems with connectivity to the cluster IP address.
  • The cluster IP address/cluster MAC address may not be correctly configured on the hosts that are not responding to remote-control commands. If this is the cause of the problem, the hosts will also not respond to the application traffic destined to the cluster IP address.

3.1.5 Symptom: Remote Control is not working with Dedicated IP Address – When using the dedicated IP address of a host to specify it as a target for a remote-control command, there is no reply.  Specifying the host by its priority (ID), however, is successful.

Possible Causes:

3.1.6 Symptom: Port rules do NOT show up as instances of MicrosoftNLB_PortRuleLoadbalanced, MicrosoftNLB_PortRuleFailover, MicrosoftNLB_PortRuleDisabled WMI classes

Possible Cause:

When there is at least one port rule that applies to a port-specific cluster IP address (as opposed to All IP addresses; port-specific cluster IPs are new in Windows Server 2003), then, none of the port rules for that cluster, including, those that apply to All IP addresses will show up as instances of the aforementioned classes. This is the intended behavior. The port rules will, instead, show up as instances of the new “MicrosoftNLB_PortruleEx” class. This new class is backwards compatible, meaning that it will work even when there is no port rule that applies to a specific cluster IP address. Use of this new class is strongly recommended under all conditions.

3.1.7 Symptom: After the cluster hosts start, they begin converging but never complete convergence.

Possible Cause:

  • Cluster and/or Port rule parameters are not consistent on all cluster hosts. NLB Manager will detect this case.

3.1.8 Symptom: IP Address Conflict – After installing Network Load Balancing and restarting a cluster host, the error "The system has detected an IP address conflict with another system on the network..." is displayed.

Related Symptoms:

  • Cluster IP address does not show up in the list displayed by “IPCONFIG.EXE”; instead an all-zero entry entry shows up.

Possible Causes:

  • The cluster IP address has been added to the TCP/IP properties on a computer that does not have NLB bound to the network adapter.
  • The cluster mode (Unicast vs. multicast) is not configured consistently across the hosts in the cluster.

Use NLB Manager to avoid this kind of configuration mistake.

 [Back to top of Symptoms]

3.2 No Connectivity to VIPs

When configuring an NLB cluster, basic connectivity tests need to be performed to verify that the hosts are set up properly. This section describes symptoms commonly seen during initial testing when a connectivity problem is likely to be encountered. While troubleshooting connectivity problems ensure that you test from a client computer that is not a member of the cluster. If the following symptoms do not address your problem, you will need to use netmon to capture the network activity on both client and servers while reproducing the problem. These captures can be analyzed to pinpoint the cause of the problem.

3.2.1 Symptom: Can’t ping the virtual IP address from within the LAN

Possible Cause:

  • The virtual IP address has not been added to the TCP/IP properties for the adapter.

3.2.2 Symptom: The virtual IP address can be pinged from within the LAN, but not outside the LAN

Possible Causes:

  • The router is rejecting the ARP reply or gratuitous ARP from the NLB hosts.
  • For the multicast or IGMP modes, there is no static ARP entry in the router to map the Unicast virtual IP address to the NLB multicast MAC address.
  • The router or intervening firewall is filtering out ICMP traffic

Further Diagnosis:

3.2.3 Symptom: Ping by virtual IP works but ping by name doesn’t

Possible Cause:

  • The administrator selected the option to automatically register this adapter’s IP addresses with WINS and/or DNS. The administrator expects that this registers the cluster name (the “Full Internet name” in the NLB setup dialog) with the virtual IP address. However, NLB does not perform any name registration tasks. Automatic name registration must not be used on the adapter to which NLB is bound. Name registration for the VIP and DIP must be made statically.

3.2.4 Symptom: Ping works but can’t connect to the load-balanced service

Possible Causes:

  • The application or service is not started or is failing on all ‘Converged’ hosts
  • The application or service being load-balanced is not UDP- or TCP-based, nor does it run on top of IPSec. NLB only load-balances UDP, TCP and IPSec protocols.
  • The application or service is not configured to accept connections via the VIP
  • No host is ‘Converged’ in the cluster or no host has the relevant port rule in the ‘Enabled’ state. Note that clients may be able to connect to the load-balanced application when the hosts are in the ‘Converging’ state. In this case, if the cluster was previously converged, the previous convergence criteria are used to determine load distribution. If the cluster did not previously converge then no host will handle load until convergence completes.
  • See the section A firewall or router is filtering traffic between client and server to the load-balanced service but not ICMP traffic
  • See the section Extreme Networks Layer 3 switch running ExtremeWare 6.x
  • The adapter is not compatible with NLB. See the section Compaq NC3163 Fast Ethernet adapter.

Further Diagnosis:

  • Use nlb.exe (see the section Overview of Available Tools) to examine the convergence and port rule state of the hosts in the cluster
  • Determine whether you can connect to the service via the dedicated IP address, following the General Troubleshooting Techniques outlined in the section Overview of Available Tools.

[Back to top of Symptoms]

3.3 Intermittent Connectivity to VIPs

3.3.1 Diagnosis

To diagnose these problems, try the following:

  • Check the configuration and state of load-balanced services, such as IIS.
  • Check the CAM table on the switch to ensure that it has not associated the cluster MAC address with any of the switch ports.
    • Since NLB does not filter ICMP by default, you can also start Netmon on all NLB hosts and ping a virtual IP to test this.  As long as the hosts are otherwise configured correctly (i.e., VIPs are added to TCP/IP, etc.), you should expect to see the ICMP requests and subsequent replies on all NLB hosts.  If they are seen only on a single host, that may indicate that the switch learned the cluster MAC address.
    • Note that teaming NIC implementations from third party vendors are often a culprit, as they often send low-level “heartbeat”-like traffic amongst themselves for failover detection.  These frames allow the switch to learn the location of the cluster MAC.
  • Check to make sure that the switch is either a layer 2 device, or that it is a layer 3 device configured to operate in layer 2 mode.
  • Use “nlb.exe query” to check the membership of the cluster to ensure that it is consistent and is what is expected.
  • Check the IIS configuration to determine whether or not HTTP keep-alives are enabled.  Note that this is not a misconfiguration.  HTTP keep-alives are a performance enhancing aspect of HTTP 1.1 that allows multiple requests to utilize the same TCP connection, thereby reducing the overhead of TCP connection negotiation.  Users should be aware of how HTTP keep-alives work and the benefits and side-effects thereof.
  • Check the event log for NLB or TCP/IP events that may indicate a problem (warnings or errors).  They include, but are not limited to:
    • IP Address conflict
    • NLB failure to allocate the necessary resources for reliable connection tracking.
    • NLB exhaustion of resources
    • NLB failure to bind to an unsupported adapter (incorrect network medium, failure to support changing the MAC address, or failure to support multi-packet receive).
    • NLB failure to allocate other necessary resources
  • Test connectivity both into and out of the cluster using ping.
  • Check Network Monitor captures to try to determine whether the problem is NLB (noted by a lack of traffic in the capture), or perhaps an application (noted by incoming requests, but a lack of responses).
  • Check connectivity to other services (other than the one being debugged), such as telnet, etc.

3.3.2 Symptom:  One or more web servers do not respond to requests, though they respond to pings.

Possible Causes:

·         The load-balanced service one the host (or hosts) is not running, or is misconfigured.

3.3.3 Symptom:  An unusual number of TCP connections to the cluster are being reset.

Possible Causes:

·         The switch to which the cluster hosts are connected may have learned the cluster MAC address, though this is rare.  This can cause TCP traffic to be delivered to the wrong NLB host, resulting in a connection reset (unicast mode only). See the section Switch is learning the MAC Address for details.

·         The switch to which the cluster hosts are connected is a layer 3 switch.  NLB requires layer 2 switching. See the section Switch is operating in Layer-3 mode for details.

·         The cluster has been partitioned, resulting in multiple and/or incorrect hosts responding to client requests, resulting in a connection reset.

·         If this is web traffic, HTTP keep-alives may be enabled on the web servers.  In this instance, TCP resets are the expected behavior.

·         A failure during the bind process may have forced NLB to use a less reliable mechanism to track TCP connections. This can cause multiple NLB hosts to service the same TCP connection, resulting in a connection reset.  Check the event log for a related warning.

·         Too many active TCP connections may have caused NLB to exhaust the resources used to track TCP connections and reliably ensure connection affinity. This can cause multiple NLB hosts to service the same TCP connection, resulting in a connection reset.  Check the event log for a related warning.

·         Network packet loss and/or delay that causes frequent TCP SYN retransmission, along with changes in cluster membership, can cause multiple NLB hosts to accept the same TCP connection, resulting in connection reset.

3.3.4 Symptom:  Traffic inexplicably alternates between cluster hosts, breaking TCP connections.

Possible Causes:

·         The switch to which the cluster hosts may have connected has learned the cluster MAC address, though this is rare.  This can cause TCP traffic to be delivered to the wrong NLB host, resulting in a connection reset (unicast mode only). See the section Switch is learning the MAC Address for details.

·         The switch to which the cluster hosts are connected is a layer 3 switch.  NLB requires layer 2 switching. See the section Switch is operating in Layer-3 mode for details.

3.3.5 Symptom:  The network adapter to which NLB is bound does not seem to be working properly.

Possible Causes:

·         The act of binding NLB to the network adapter may have failed either in NLB, TCP/IP or some other protocol or intermediate driver.  Check the event log for a related error.  NLB will fail if:

o        The network adapter is not 802.3 compliant

o        The network adapter does not support programmatically changing its MAC address (unicast mode only).

o        It fails to allocate a required resource

·         An IP address conflict on the network caused by this host may be hampering its connectivity

·         The network adapter to which NLB is bound does not support NDIS multi-packet receives.  NLB has set a performance bar that requires this functionality in the network adapter miniport drivers.  If it is not present, NLB drops all incoming traffic

·         There may be some other problem in the network unrelated to NLB

·         The switch to which the cluster hosts are connected may have learned the cluster MAC address on another switch port, though this is rare.  This will intermittently prevent traffic from reaching this host (unicast mode only). See the section Switch is learning the MAC Address for details

3.3.5 Symptom:  Some NLB hosts do not seem to handle traffic consistently.

Possible Causes:

·         An IP address conflict on the network caused by this host may be hampering its connectivity

·         If a drain, disable, stop or drainstop operation has been performed on the host, convergence must complete before its share of the client traffic will be taken on by other cluster members.  A failure to complete convergence may therefore be denying service to some clients during this process.

 [Back to top of Symptoms]

3.4 Some clients can connect to VIPs but not others

Ensure that the problem indicates that the client is unable to reach the service. For example consider the case of a client requesting service from IIS and receiving an HTTP 500 error. This is not a connectivity problem; the client connected to IIS, but the service was not able to respond. Proceed only if there was no such communication with the service.

A connectivity problem of this sort is almost always caused by one or more hosts malfunctioning in the cluster, rather than being a problem with a specific client. To determine the host on which to focus troubleshooting perform the following:

  • Use nlb.exe (see the section Overview of Available Tools) to check the converged state of the cluster – you only need consider the hosts that are currently in the cluster
  • Use nlb.exe (see the section Overview of Available Tools) to check the state of the relevant port rule – you only need to consider the converged hosts whose relevant port rule is ‘Enabled’
  • With this subset of hosts, use performance counters for the load balanced application to determine which host isn’t handling traffic. You may also look at Network Interface\Packets Received/sec for the NLB NIC as a substitute. However it is an inferred measure of load and hence may not be a reliable indicator.

3.4.1 Symptom: Virtual Private Network (VPN) calls fail when cluster membership changes

This symptom applies to Windows Server 2000 only. When a cluster membership change occurs, the control and data channels for a specific client can end up on different hosts. The result is that these clients have no connectivity to the VIP. See the section IPSec Problems for more information.

3.4.2 Symptom: Client has intermittent connectivity to the VIP

The following possible causes apply only if the relevant port rule has ‘None’ client affinity, or cluster membership is changing frequently during the investigation. Otherwise the symptom would be that described in section Some clients can get service through the VIP but others can’t.

Possible Causes:

  • One or more hosts don’t have the VIP configured in TCP/IP
  • One or more hosts have default gateways or static routes configured such that there is no return path to the client
  • The load-balanced service or application isn’t started, or it is failing, on one or more hosts
  • The load-balanced service or application isn’t configured to accept traffic from clients via the virtual IP address on one or more hosts
  • A host is leaving the cluster, for example, via an ‘nlb stop’ command, but convergence is not completing and the previous convergence criteria are being used. Until convergence completes, the other hosts in the cluster will not handle this host’s portion of the client traffic.
  • The switch learned the MAC address of the cluster. Only the host homed to this switch port will receive cluster traffic, and it will handle only its portion of clients. This situation applies only to the Unicast mode. See the section Switch is learning the MAC Address for details.
  • The network adapter to which NLB is bound on one or more hosts is not compatible with NLB. See section Compaq NC3163 Fast Ethernet adapter for a discussion of such an example.

3.4.3 Symptom: Some clients can get service through the VIP but others can’t

Possible Causes:

[Back to top of Symptoms]

3.5 Cannot connect to or from dedicated IP address

This troubleshooting item assumes that there is client connectivity to the VIP. If this isn’t the case see the section No connectivity to VIPs to troubleshoot the VIP first. Once that is resolved, return to this section if you continue to have problems with the dedicated IP address (DIP).

3.5.1 Symptom: Can’t ping the DIP address of a cluster member

Possible Causes:

  • The DIP isn’t configured in TCP/IP
  • The switch has learned the location of the cluster MAC address (Unicast mode only). Only the host homed to the switch port associated with the cluster MAC will receive traffic. Since the DIPs are associated with the cluster MAC, only traffic destined for this host’s DIP will be received. See the section Switch is learning the MAC Address for details.

3.5.2 Symptom: Can’t ping the DIP of a peer within the same cluster

Possible Causes:

  • The cluster is configured in Unicast mode. Intra-host communication between peers of the same cluster is not possible in Unicast mode because all hosts share the same NLB MAC address.
  • The destination DIP isn’t configured in the TCP/IP properties

3.5.3 Symptom: Can’t connect to a listening service via the DIP

Possible Causes:

  • The DIP isn’t configured or isn’t listed in the correct order in the TCP/IP properties. The DIP must be configured in the initial dialog when the TCP/IP properties are rendered.
  • The DIP isn’t configured on the NLB Hosts tab. In this case, the DIP will be treated as a VIP and be load-balanced. The probability is low that the packet will be load-balanced to the host with the DIP configured in TCP/IP.
  • The listening service isn’t configured to accept requests from the DIP
  • The switch has learned the location of the cluster MAC address (Unicast mode only). Only the host homed to the switch port associated with the cluster MAC will receive traffic. Since the DIPs are associated with the cluster MAC, only traffic destined for this host’s DIP will be received. See the section Switch is learning the MAC Address for details.
  • NLB Hosts are homed to a Layer 3 switch (see section Extreme Networks Layer 3 switch running ExtremeWare 6.x)

3.5.4 Symptom: Ping works from an NLB host, but can’t initiate UDP or TCP traffic from it

Possible Causes:

  • The DIP isn’t configured on the NLB Hosts tab. In this case, the DIP will be treated as a VIP and be load-balanced. The probability is low that the packet will be load-balanced to the host with the DIP configured in TCP/IP.
  • The DIP is defined on the NLB Hosts tab but the IP address has not been added to the TCP/IP properties for the adapter. See the section Outgoing ICMP works if the DIP isn’t properly configured.
  • The switch has learned the location of the cluster MAC address (Unicast mode only), and this host is not homed to this location. Since the DIP is associated with the cluster MAC, this host will not receive replies. See the section Switch is learning the MAC Address for details.
  • NLB Hosts are homed to a Layer 3 switch (see section Extreme Networks Layer 3 switch running ExtremeWare 6.x)

3.6 Uneven load balancing or poor performance 

For this scenario we assume that the hosts in question are in the ‘Converged’ or ‘Converging’ state and the relevant port rule is ‘Enabled’ on them. Before proceeding see the section Overview of Available Tools for instructions on how to verify that this is the case.

Note: NLB does not dynamically adjust the load distribution, nor does it make distribution decisions on a per-connection basis. NLB statistically maps incoming TCP connections to hosts, taking into account the statically configured load weights. If there are relatively few TCP connections (less than 10 per host), or single affinity is enabled and there are relatively few clients (less than 10 per host), uneven load balancing is to be expected.

3.6.1 Symptom: The measured load distribution is not that predicted by the load weights

Possible Causes:

  • The load distribution was measured incorrectly. See the section Measuring Load Distribution.
  • The hosts are in the Converging state. Until Convergence completes, the hosts load-balance according to their previous convergence criteria.
  • Clients keep connections open much longer than the time since the last membership change. The “settle time” for redistributing load is driven by the lifetime of a typical connection and is finite only in the presence of new connections.
  • The latency is varying from host to host within the cluster either because of hardware differences, or load weights that are out of proportion to the hardware’s capabilities. Latency can lead to queuing in the application or service and cause a drag on the rate at which connections are accepted.
  • A combination of port rule affinity, in-use client IP addresses and cluster size prevent reaching the expected load-distribution. This case is very rare except perhaps in a test environment. See the section Reasons for Uneven Load Distribution for more information.

3.6.2 Symptom: A host that is not the DEFAULT host handles all of the traffic

Possible Causes:

  • The current connections were established when only one host was in the cluster. This host is not currently the DEFAULT host.
  • A test tool delivers simulated load by opening a single TCP connection and pipelining requests over the connection. To observe load-balancing each request should open a new connection.
  • The switch has learned the location of the cluster MAC address (Unicast mode only). Only the host homed to the switch port associated with the cluster MAC is receiving traffic. Service is being denied to all clients whose load is not handled by this host. See the section Switch is learning the MAC Address for details.

3.6.3 Symptom: The DEFAULT host handles all of the traffic

Possible Causes:

  • The current connections were established when only one host was in the cluster. This host is currently the DEFAULT host.
  • A port rule has not been defined to cover the port range of the load-balanced application or service.
  • A test tool delivers simulated load by opening a single TCP connection and pipelining requests over the connection. To observe load-balancing each request should open a new connection.
  • The switch has learned the location of the cluster MAC address (Unicast mode only). Only the host homed to the switch port associated with the cluster MAC is receiving traffic. Service is being denied to all clients whose load is not handled by this host. See the section Switch is learning the MAC Address for details.

3.6.4 Symptom: Traffic comes in fits and bursts and response time is slow or requests time out

Possible Cause:

  • One or more hosts have a duplex mismatch with the network device to which they are homed.

Further Diagnosis:

  • Tests with large-payload ICMP (ping) requests to the default gateway should exhibit similar behavior. For troubleshooting purposes manually configure the duplex settings on the network device and hosts.

3.6.5 Symptom: Traffic comes in fits and bursts and response time is slow or requests time out

Possible Cause:

  • Client is receiving load via proxy servers and the site requires authentication

Further Diagnosis:

  • Analyze netmon captures taken while reproducing the symptom.

[Back to top of Symptoms]

3.7 Problems relating to Client Authentication

This section describes problems concerning a client’s inability to authenticate with a service running on an NLB cluster. The first step is to attempt to reproduce the problem on a single-node cluster. If there is a problem it is most likely a problem unrelated to NLB. If the problem reproduces only with a multi-node NLB cluster, the problem could be due to the fact that the specific authentication scheme has not been setup to be used in a clustered environment. For example, Kerberos does not work by default in an NLB cluster. See the section Kerberos authentication isn’t working through NLB for information on how to configure Kerberos to work with NLB.

[Back to top of Symptoms]

3.8 Problems relating to Session Persistence

This section lists problems that occur when a client establishes a session with a particular host but subsequent connections are directed to a different host, causing intermittent connectivity. NLB has very limited support for session persistence: single affinity and special support for preserving L2TP and IPSec sessions.

Single affinity will preserve a session from a particular IP address to a particular host as long as there are no changes in the set of hosts that belong to the cluster. Because load-distribution and affinity are closely related see also “How do I configure my cluster to handle load non-uniformly?” in the NLB FAQ, which discusses the impact of changes to load weights.

Windows Server 2003 only: L2TP and IPSec sessions are preserved even if there are load-weight or membership (as long as the host of the sessions remains in the cluster) changes.

3.8.1 Symptom: A VPN session is established successfully, but some time later, on a subsequent re-connect, the session is lost and has to be re-negotiated

Possible Cause:

  • It is likely that the VPN session affinity has been lost and the client’s subsequent connection is directed to another server. One cause of this is that port rules are set with “None” affinity: The support for VPN on NLB clusters requires that port rules have “single” or “Class C” affinity.

3.8.2 Symptom: An SSL session is established but a subsequent attempt by the client to connect using this session fails and/or results in excessive delays when reconnecting

Possible Cause:

  • Most likely the client’s SSL session has been broken and the subsequent connection is directed to a different host. See the section SSL Session Affinity for details.

[Back to top of Symptoms]

3.9 Problems relating to Convergence

3.9.1 Diagnosis

To diagnose these problems, try the following:

  • Check the event log to try to determine the cause of convergence.  Each time a host initiates convergence, it attempts to determine why convergence started and logs the reason in the system event log.  The reason for convergence may help isolate the root cause of the convergence problem.
  • Use “nlb.exe query” to check the convergence status and current membership of the cluster.  You can check the results several times over a matter of minutes to determine the stability of the cluster membership.
  • Use NLB manager to check the NLB configuration on all hosts in the cluster and ensure that it is consistent and correct.
  • If the cluster is part of an ISA deployment, use the relevant ISA configuration and management tools to check the BDA configuration of NLB (which is managed by ISA).
  • Perform basic connectivity testing on network components such as cables and switch ports.  Try using another network adapter, cable or connecting to another switch port to verify the functionality of the hardware network components.
  • Check the duplex settings on the network adapters and the switch.
  • Check network statistics in the switch and protocol stack (netstat –s) to try to ascertain whether or not packet loss is occurring in the network.
  • Check the event logs to see if NLB or TCP/IP has reported errors relating to resource exhaustion.
  • Check the CPU usage of the servers.  If this is too high, it may indicate heartbeat loss due to resource starvation.
  • Check the configuration of the switch(es) to ensure that broadcast and/or multicast (depending on the mode that NLB is configured in) are not being blocked by the switch.
  • If hosts are connected to different switches, check VLAN and/or redundant switch configuration to ensure that connectivity exists between the cluster hosts.
  • If remote control is enabled, use “nlb query <Cluster IP address>” to remotely query the membership and state of the cluster from both a cluster node (which forces broadcast transmission of the request) and a non-cluster node.  If the results of the queries differ from each other, or from a local query, it can help ascertain which hosts are seeing what kinds of traffic.

3.9.2 Symptom: The cluster is perpetually converging

Possible Causes:

  • An NLB configuration problem may be preventing convergence from completing.  Possibilities include:

o        Hosts with a conflicting number of port rules

o        Hosts with conflicting port rule settings (ranges, protocols, affinities, etc.)

o        Multiple hosts utilizing the same host ID

o        Hosts with conflicting Bi-Directional Affinity (BDA) settings.  Note: BDA is administered by ISA Server.  Check your ISA configuration.

o        Mismatched cluster modes of operation (for example, unicast vs. multicast)

  • A bad network adapter, cable or switch port may be preventing reliable heartbeat communication between NLB hosts.
  • Mismatched duplex settings on the network adapter and switch may be causing unreliable heartbeat communication.
  • A Windows Server 2000 host may be attempting to join a cluster in which Windows Server 2003-only features are in use (including BDA, virtual clusters).

3.9.3 Symptom: The cluster periodically and unexpectedly re-converges

Convergence is described in the section “How Does NLB Cluster Convergence Work?” in the NLB FAQ. This symptom is caused by missed heartbeats.

Possible Causes:

  • A bad network adapter, cable or switch port
  • Mismatched duplex settings on the network adapter and switch
  • The switch, network adapter or protocol stack is dropping packets due to insufficient resources.

3.9.4 Symptom: When convergence completes, multiple hosts claim to be the “Default” host

Possible Causes:

  • Multiple clusters may exist on the same subnet.  Each cluster has its own “Default” host
  • The cluster has partitioned itself into sub-clusters. See below

3.9.5 Symptom: The cluster converges separately into sub-clusters

Possible Causes:

  • The network has been partitioned, blocking the reliable transmission of heartbeat messages between cluster hosts.  Typical causes include:

o        The switch is configured to block broadcast and/or multicast traffic.  NLB heartbeats are broadcast or multicast (depending on the mode of operation) to all hosts.

o        The network is overloaded, resulting in transient, but consistent heartbeat loss between subsets of the cluster.

o        If cluster hosts are connected to different switches (either connected logically by a VLAN, or perhaps in a redundant switch configuration), communication between the multiple switches may be impaired, resulting in the total or partial loss of connectivity between subsets of the cluster.

 [Back to top of Symptoms]

3.10 Problem specific to Application X or Protocol Y 

NLB works with most TCP- and UDP-based protocols, but makes no guarantees about sending the client back to the same host across multiple connections or requests. Thus the application or service must meet one of the following criteria:

  • It is stateless
  • The deployment maintains session state for a given request such that it is available to all hosts

Furthermore, to achieve good load-balancing for TCP-based protocols, the client must not open a single TCP connection and use request pipe-lining to the exclusion of multiple connections. NLB load-balances TCP connections not application requests.

3.10.1 Symptom: Kerberos authentication isn’t working through NLB

See the section Kerberos authentication for load balanced web sites for instructions on how to configure Kerberos to work on computers that are part of an NLB cluster.

3.10.2 Symptom: .Net Remoting pipelines method invocations through a single TCP connection

See "Does NLB support applications using .NET Remoting?" in the NLB FAQ.

3.10.3 Symptom: COM+ application isn’t being load-balanced

See "Does NLB support applications COM+ applications?" in the NLB FAQ.

3.10.4 Symptom: Problems load-balancing a NetBIOS application

See "Can NLB load-balance NetBIOS applications?" in the NLB FAQ.

3.10.5 Symptom: SQL readonly cluster are slow

Under certain circumstances a client can take approximately 8 to 10 seconds to establish a session when connecting to SQL Server through NLB. This occurs when the destination is an IP address, rather than a name, and the client uses Windows authentication.

 

As part of the authentication process, the client first attempts to authenticate using Kerberos. Kerberos requires the identity of both client and server; the client attempts to resolve the identity of the server as part of this process. If the client can’t associate a name with the IP address it reverts to NTLM for authentication, which does not require server identity.

 

The client’s attempt to resolve the IP address to a name, which fails, is what takes so long. To work around this issue one can edit the hosts file on the client (it is in %windir%\system32\drivers\etc) to map a fake name to the shared NLB IP address. This will bypass the name resolution step taken by the client, cause an immediate failure when using Kerberos, and then proceed with NTLM authentication.

 

[Back to top of Symptoms]

[Back to top]

4 Causes

4.1 Setup

4.1.1 Cause: Network Load Balancing is uninstalled

NLB is installed by default on Windows Server 2003, however it may be uninstalled manually, in which case it must be manually re-installed before NLB can be bound to any adapter locally or remotely (via NLB Manager). Re-install NLB by clicking Install in the Network Connections properties dialog box, select Service and click Add, select “Network Load Balancing” and then click OK. Restart the system and then configure NLB properties.

4.2 WMI/RPC Management Connectivity Problems

This section lists problem causes that prevent a management console (running NLB Manager, a WMI script or a third party management application that uses WMI) from administering a cluster member.

4.2.1 Cause: RPC not enabled on host

Check if the status of “Remote Procedure Call (RPC)” service is “Started”, in Control Panel, double-click Services. This service is started by default and is essential for administration of the host using NLB Manager.

4.2.2 Cause: Firewall between management console and host blocks RPC

There is a firewall between the NLB Manager client and the host: Network Load Balancing Manager uses Windows Management Instrumentation (WMI) which in turn has a dependency on Remote Procedure Call (RPC) and Distributed Component Object Model (DCOM). This can present problems when trying to use Network Load Balancing Manager to administer servers that are on the other side of a firewall from the Network Load Balancing Manager computer. By default, DCOM can randomly use a wide range of ports. Firewalls, on the other hand, are typically configured to allow traffic from only a limited number of specific ports. Therefore, in order to use NLB Manager from behind a firewall to manage servers on the other side of the firewall, you must first configure DCOM to use only a specific range of ports. You must then configure your firewall to allow traffic through those ports. For more information, see the link to the whitepaper, “Distributed COM with Firewalls” at http://www.microsoft.com/com/wpaper/dcomfw.asp

In addition to the steps described in the white paper, you must also configure the firewall to allow ICMP echo requests. Alternatively, instead of allowing ICMP echo requests, you can run NLB Manager with the "noping" option. If you use this option, you will experience a delay if NLB Manager attempts to contact a server that is not available. For more information on using the "noping" option, see the Network Load Balancing documentation. You can find this documentation by performing the following procedure on any computer running a product in the Window Server 2003 family:

  1. Click Start and then click on Help and Support.
  2. Navigate to the following Help and Support topic: Administration and Scripting Tools\Command-line reference\Command-line reference A-Z\Nlbmgr.

4.2.3 Cause: Lack of Administrator credentials when attempting to configure a host

The WMI client may not possess Administrator credentials on the local host, which are needed to access and interact with the instances of WMI classes in the root\microsoftnlb namespace.

4.2.4 Cause:SeLoadDriverPrivilege” privilege not present or disabled

The “SeLoadDriverPrivilege” privilege is either not present, or present but disabled in the access token of the WMI client process/thread. To resolve this problem, enable the “SeLoadDriverPrivilege” privilege in the access token before calling WMI (IWbemLocator::ConnectServer method) to connect to the computer. This privilege must be enabled before connecting to the computer using WMI (IWbemLocator::ConnectServer method). For information about enabling privileges and access tokens, refer to the Windows Server 2003 SDK documentation at “Security” -> “Authorization” -> “About Authorization” -> “Access Control” -> “Privileges”.

4.3 Switch and Router Problems

4.3.1 Cause: Router doesn’t accept proxy ARP replies

A proxy ARP reply is an ARP reply sent by one network entity on behalf of another.  Proxy ARP replies are easily identified by inspecting the source address information of the ARP reply.  In a conventional Ethernet/IP ARP reply, the ARP source hardware address is the same as the source address of the Ethernet frame.  In a proxy Ethernet/IP ARP reply, the two source addresses differ, implying that one host is answering on behalf of another.

NLB does not explicitly generate proxy ARP replies, however, for other reasons critical to the NLB load-balancing algorithm, NLB spoofs ARP replies giving them the illusion of being proxy ARP replies.  In multicast mode, NLB must spoof the ARP reply to map requests for virtual IP addresses to the shared NLB multicast cluster MAC address; therefore, the source address of the Ethernet frame (the physical MAC address of the responder) differs from the source address of the ARP frame (the multicast cluster MAC address).  In unicast mode, NLB is forced to “mask” the source address of all outgoing frames in order to prevent switches from learning the location of the shared NLB cluster MAC address. Therefore, the source address of the Ethernet frame (the masked source MAC address) differs from the source address of the ARP frame (the cluster MAC address).  It is for these reasons that NLB clusters appear to generate proxy ARP replies.

First of all, note that some router vendors do support proxy ARP replies, but as a configuration option that is turned off by default for security purposes, so be sure to check this first.  If a router does not accept proxy ARP replies, there are two courses of action; (1) replace the router with one that does accept proxy ARP replies, or (2) force NLB to generate non-proxy ARP replies.

The first option is straight-forward and solves the issue at hand, but may be too costly to consider.  It is possible to force NLB to work with these routers, but with associated consequences discussed below.  To force NLB to generate non-proxy ARP replies, several changes need to be made to the NLB configuration and topology of the network.  Refer to the section Even though it should, the switch isn’t learning the cluster MAC address in this document for information on configuring an NLB cluster to operate in this manner.

4.3.2 Cause: Even though it should, the switch isn’t learning the cluster MAC address

If all hosts of an NLB cluster are plugged into a hub and uplinked to a switch, it is desirable for the switch to learn the location of the NLB cluster MAC address and associate it with the port to which the hub is uplinked.  The advantage of this configuration is that it is possible to limit the flooding of cluster-bound traffic on the switch.  To enable an NLB cluster in this mode:

  • If the NLB hosts are currently plugged into a switch, they must all be moved to a hub that is uplinked to the switch.
  • The cluster must be configured to operate in unicast mode.  Neither multicast nor IGMP multicast modes can be made to utilize non-proxy ARP replies.
  • Using the registry editor, or the NLB WMI provider, the “MaskSrcMac” registry setting on all NLB hosts must be set to zero (default is one) and NLB must be reloaded (nlb.exe reload) on all hosts.  With this change NLB will not to alter the source MAC address of outgoing frames, including ARP replies.  The location of the registry value is:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\WLBS\Parameters\Interface\{GUID}\MaskSrcMac = 1

Where {GUID} is the GUID of the particular NLB instance on which heartbeats should be captured.  If more than one NLB instance exists, then you can use the “ClusterIPAddress” registry value in each hive to differentiate between instances.

Note that there are consequences to this configuration choice, which are the result of the NLB hosts being homed to a hub.  While it is true that NLB requires that cluster-bound traffic be replicated to all cluster hosts, the return traffic from the cluster to the clients needn’t be replicated. A hub floods inbound and outbound traffic to all ports, resulting in unnecessary bandwidth consumption. 

This issue can be circumvented by configuring the network stack on the NLB hosts to direct outbound traffic through a different adapter that is connected directly to a switch.  However, the lack of outgoing traffic through the hub may cause the switch to “forget” (timeout) the cluster MAC address association, once again, resulting in the flooding of cluster-bound traffic on the switch until an ARP response refreshes the address tables on the switch.

4.3.3 Cause: If the switch is not permitted to learn the cluster MAC address, why do I see it in a Network Monitor sniff?

Network Monitor is a network protocol that resides logically above NLB in the network protocol stack.  Therefore, it is important to realize that a sniff captured on an adapter to which NLB is bound will show the outbound packets before NLB has seen them and inbound packets after NLB has seen them.  Therefore, a sniff taken on an NLB host will differ from a sniff taken promiscuously from another host on the same LAN.

By default, in unicast mode, it is true that NLB masks the source MAC address of outgoing frames to prevent the switch from learning the cluster MAC address and associating it with a specific switch port.  However, a Network Monitor sniff from an NLB host will show outbound packets with a source MAC address equal to the NLB shared cluster MAC address; i.e., they appear to be “unmasked”, which would allow the switch to learn the cluster MAC address.  It is important to realize, however, that because Network Monitor is seeing the packets before NLB, the outbound packets simply have not been masked yet; they will be properly masked before being sent out on the wire.  Examining a promiscuous sniff from a non-NLB host on the same LAN will show that this is true.

Likewise, in multicast mode, a Network Monitor sniff on an NLB host will show that all source and destination MAC addresses are the physical MAC address of the NLB host.  However, a promiscuous sniff from a non-NLB host on the same LAN will show that the cluster-bound packets are, in fact, addressed to the NLB shared multicast cluster MAC address. The destination MAC address is altered by NLB before the Network Monitor and TCP/IP see the packet.

4.3.4 Cause: Switch is learning the MAC Address

When the hosts of an NLB Unicast cluster are connected to a switch, NLB depends on the fact that the switch never learns the Unicast MAC address and takes special steps to prevent this from happening. Very rarely, typically due to a failure when performing an NLB configuration change, a switch does associate the unicast cluster MAC address with a particular port. This causes intermittent connectivity problems. Consult the section Intermittent connectivity to VIPs for how to diagnose this condition. This should be a transient condition that will correct itself in a few minutes, or the switch can be reset or the CAM entries deleted.

4.3.5 Cause: Switch is operating in Layer-3 mode

NLB is not supported when the hosts are homed to a switch operating at Layer-3. Instead, create a VLAN for all the nodes in the NLB cluster, and configure that VLAN to operate in Layer-2 mode.

4.4 Teaming NICs

4.4.1 Cause: NLB is bound to an Adaptive Fault Tolerance team

In general, NLB is compatible with Adaptive Fault Tolerance (AFT) teaming solutions from third party vendors, which can be used to provide fault-tolerance at the link layer below NLB.  However, a specific set of issues can cause problems that need to be addressed manually.  Note that these problems are only applicable in the NLB unicast mode of operation.

First, teaming solutions tend to assign the MAC address of the team using the MAC address of the designated primary team member.  This is fine in a non-NLB deployment, but NLB, of course, needs to modify the MAC address of the adapter to which it is bound in unicast mode.  This is accomplished in the NDIS model by changing a registry setting in the registry hive for the adapter.  While most network adapter driver vendors respect this method by which to change the MAC address of an adapter, network adapter teaming solutions historically ignore this mechanism and do not change the MAC address of the team when this registry key is modified.  Therefore, in order to change the MAC address of the team to the NLB shared cluster MAC address, it is necessary to manually configure the team MAC address in the teaming software configuration utility and set it to the NLB cluster MAC address.

Second, some network adapter teaming vendors utilize a lightweight low-level heartbeat mechanism to verify connectivity to the network.  While this extra level of fault detection is desirable, unfortunately, it often causes problems in NLB deployments.  The teaming driver sits below NLB in the NDIS stack such that teaming heartbeats do not traverse the NLB driver, and hence NLB does not spoof the share cluster MAC address. The result is that the switch learns the location of the shared cluster MAC address and directs all cluster traffic to one host only. If provided by the teaming software vendor, it may be necessary to turn off this heartbeat mechanism in the teaming software configuration.

4.5 Unsupported network adapters

4.5.1 Cause: Compaq NC3163 Fast Ethernet adapter

There is a known problem using this network adapter (driver version unknown). The symptom is that ICMP traffic is delivered but TCP traffic isn’t. A TCP SYN never makes it from off of the network to the TCP stack. Update to the latest driver and retest; replace the adapter with a different model if this doesn’t fix the problem.

4.5.2 Cause: Adapter does not support multi-packet receive functionality

In Windows Server 2003, NLB has raised the performance bar for network adapters that it will support.  Specifically, NLB requires that adapters to which it binds support multi-packet receives.  Any packets that are passed to NLB by the adapter through the old receive indicate path are dropped by NLB.  The vast majority of server-class adapters satisfy this requirement.  However, in the event that this condition occurs, NLB logs a warning event in the System event log.  To resolve the issue, first ensure that the drivers for the adapter are the latest available from the vendor.  If the problem is not rectified with the latest driver, it will be necessary to replace the adapter with one that supports the required functionality.

4.6 SSL Session Affinity

SSL session is established between an external client and a server in the NLB cluster by following the SSL Handshake protocol, at the end of which, a session ID is issued by the server (and sent to the client) to identify the SSL session. Subsequent TCP connections from the client will present this session ID to identify it as being part of the already established SSL session. In order for the SSL session to be preserved, these (subsequent) TCP connections must be accepted by the same server in the NLB cluster that had previously issued the session ID. This is where Single affinity steps in: By definition, Single affinity causes all TCP connections from a given client IP address to be handled by the same server in the NLB cluster. In this case, Single affinity would cause the initial SSL Handshake protocol related TCP connection (where the session ID is issued) and subsequent TCP connections from the client to be handled by the same server in the NLB cluster.

However, Single affinity will preserve a session from a particular IP address to a particular host as long as there are no changes in the set of hosts that belong to the cluster. If cluster nodes are added or there is a change in the load weights, session affinity may be lost. Thus Single affinity provides “best effort” session persistence. The server application must be able to deal with occasional misdirected sessions by transparently re-negotiating new sessions.

4.7 IPSec Problems

There are no topics in this section at this time.

4.8 Excessive Dropped Traffic

There are no topics in this section at this time.

4.9 Connectivity Problems

4.10.1 Cause: A firewall or router is filtering traffic between client and server

There is a router or firewall between client and server that is filtering out traffic. For a router, filtering is achieved by setting access control list (ACL) entries to allow or deny traffic.

4.10.2 Cause: Extreme Networks Layer 3 switch running ExtremeWare 6.x

NLB clusters homed to Extreme Networks Layer 3 switches running ExtremeWare 6.x will experience problems with some IP addresses when NLB is initially bound. The IP addresses affected are those that were bound to the network adapter prior to binding NLB. Traffic addressed to or sourced from these IP addresses will be dropped by the switch for approximately 30 minutes, at which time the entries are aged out.

The problem is that the original IP addresses are retained in the IP forwarding database even though these entries have the incorrect MAC address. Upgrade to version 7.x or higher of ExtremeWare to correct this issue.

4.11 Dedicated IP Address Configuration Problems

4.11.1 Cause: Outgoing ICMP echo requests work if a DIP isn’t properly configured

In this scenario, a DIP is defined in the NLB configuration, but the specified IP address has not been added to TCP/IP. One would not expect to receive a reply to an outgoing ICMP request. But the reply is indeed seen because of the way that NLB handles ICMP traffic. In this case the outgoing ICMP request uses a VIP as the source IP address and the reply will be addressed to the VIP. By default NLB passes ICMP traffic up on all hosts (this traffic is not load balanced) so the reply will be seen by the client and the ping will succeed.

Note that UDP or TCP traffic initiated from the host won’t work in this scenario. As with ICMP traffic, the outgoing packets will use a VIP as the source address. Replies will be load-balanced, and the probability is low that the packet will be load-balanced back to the source host. Adding the DIP to TCP/IP properties will fix this because the outbound traffic will then be sourced with the DIP.

4.12 Measuring Load Distribution

Properly measuring load distribution can be very tricky, and proper assessment relies on summing rate counters over time or resetting level counters when a change to cluster membership is made. Typically one uses performance counters from the load-balanced application or service. Alternatively, one can use network counters such as Network Interface\Packets Received/sec.

Since NLB load balances TCP connections, use a performance counter primitive that reflects the connection activity of the service. In IIS, for example, Web Service\Connection Attempts/sec is an excellent counter. Web Service\GET Requests/sec is not a good counter to use because multiple GET requests can be pipelined over a single TCP connection.

4.13 Reasons for Uneven Load Distribution

NLB divides up the theoretical address space of all clients into a fixed number of quanta called bins. Convergence is the process by which cluster hosts negotiate the ownership of these bins. Good load distribution equates to having a population of clients that is well distributed across the bins. When the observed load distribution differs from the distribution specified by the load weights, and sources of error have been eliminated as the cause, there remain two possible contributors to the observed behavior:

  • Port rule affinity coupled with too few client IP addresses in use – If the port affinity is Single or Class C and there are few client IP addresses in use, load distribution will be coarse since the bin occupancy will be small.
  • Cluster size – Because the number of bins is constant, some cluster sizes are better than others for distributing load evenly across hosts. With a cluster size of 7 hosts, the load distribution will vary by about 10% from one host to another. At a cluster size of 16 hosts the load distribution variance will be 25%.

4.14 Reasons for Slow Download Times of Web Pages

When both client and server reside in the intranet environment, the client can obtain the requested content most efficiently if the requests bypass the proxy servers. When a URL consists of an IP address, we have observed that Internet Explorer always uses a proxy server to reach the destination, even if the destination is an intranet address. This can cause a small delay as it adds an unnecessary hop for the client. However, this alone is not likely to be the cause of a significantly slower response.

Download times become extremely slow when authentication schemes are also imposed in this scenario. We have further observed that while HTTP/1.1 is used for the request when made directly to the intranet destination, HTTP/1.0 is used when the request is made via the proxy server. Thus each resource (for example, a picture in GIF format) is obtained via an independent TCP connection. Under these conditions page load times of 10 to 50 times the normal load time are not uncommon.

[Back to top]

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

 

This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.