In
Windows Server 2003, the replication process is responsible for keeping each
domain controller updated with the latest Active Directory information. The
replication process is also responsible for keeping DNS replicas
synchronised.
As
you can see, replication is a very important part of the Windows Server 2003
network operating system. So what do you do when replication fails? For that
matter, how do you even know when a failure has occurred? Here are some answers
to these questions and how to fix the replication process.
How
does replication work?
Before you can fix the replication process, you need to understand how it works. As I mentioned earlier, replication is used to keep both domain controllers and DFS replicas synchronized. There are a few other tasks that use replication as well. For the purposes of this article, I will focus my discussion on Active Directory replication that occurs between domain controllers.
Before you can fix the replication process, you need to understand how it works. As I mentioned earlier, replication is used to keep both domain controllers and DFS replicas synchronized. There are a few other tasks that use replication as well. For the purposes of this article, I will focus my discussion on Active Directory replication that occurs between domain controllers.
If
you have ever worked with Windows NT, then you are probably familiar with the
PDC and BDC domain controller roles. In such an environment, if someone needs to
make an update to the Security Accounts Manager, the update gets applied to the
PDC. The PDC then alerts the BDCs to the update and the BDCs download the
updates and use them to update their own copies of the Security Accounts
Manager. This structure is known as single master replication.
In
contrast, Windows 2000 and Windows 2003 use multi-master replication. In
multi-master replication, there is no PDC or BDC. Every domain controller
contains a writable copy of the Active Directory database. If an administrator
makes an update to the Active Directory, the update is applied to the closest
available domain controller. The domain controller then uses the replication
process to apply the update to the other domain controllers.
Because
of the multi-master replication model, the Active Directory must have a
technique for resolving conflicts. For example, suppose that two different
administrators are making changes to the same attribute of the same user account
at the same time. Now, suppose that those changes get written to two different
domain controllers. When the next replication cycle occurs, you will have two
domain controllers attempting to write contradictory data to the other domain
controllers.
To
get around this problem, Windows relies on a "most recent change wins"
mentality. This means that Windows looks at the timestamp for both changes.
Whichever of the two changes was made most recently will be the change that
takes precedence. The other change will be overwritten.
I
mention this because I've seen situations in which two administrators try to
apply updates to user accounts and can't figure out why some of their changes
are undone. If you suspect that you might have a replication problem, do a
little checking to make sure that two or more people are not trying to update
the same information at the same time.
Another
aspect of replication that I want to touch on is something called Inter-site
replication. Inter-site replication is domain controller replication across two
or more sites.
The
idea behind Active Directory sites is that you want to avoid congesting slow WAN
links with excessive replication traffic. Imagine for a moment that you have a
domain spanning two offices and that each of the two offices has ten domain
controllers. Also, imagine that these two offices are separated by a slow WAN
link.
In
a situation like this, every time anyone makes a change to the Active Directory,
the change is replicated to nineteen other domain controllers. It also means
that, since there are nineteen other domain controllers that have to be updated,
nineteen different copies of the same data are flowing across your network. To
make matters worse, ten separate copies of the same identical data are flowing
across your WAN link.
Now,
imagine that someone is performing an Active Directory-intensive process, such
as creating a hundred new user accounts. This process would cause at least a
thousand different update sequences to flow across your WAN link. It is very
possible that all of this traffic could choke out the link, preventing other,
more important, traffic from flowing across it.
The
solution to this problem is to divide the domain into two sites. In a situation
like this, one domain controller in each environment is designated as a
bridgehead server. The bridgehead server is responsible for sending and
receiving batches of Active Directory updates. To see how sites work, let's
return to my example of the company with ten domain controllers in each office,
separated by a WAN link.
In
this situation, if someone in an office made an update to a domain controller,
only nine updates would be sent out instead of nineteen. These updates are
designed to update the domain controllers in the local site. Remember, however,
that one of these domain controllers is acting as the bridgehead server for the
site. The bridgehead server receives the updates and then sends a single copy of
the update across the WAN link to the remoter bridgehead server. The remote
bridgehead server receives the update and then distributes it to the domain
controllers in the remote domain.
As
you can see, only a single copy of the update was transmitted across the WAN
link instead of ten separate copies. When implemented correctly, sites can
drastically reduce replication-related network traffic.
Troubleshooting
replication
Anytime that you make an Active Directory update and the update isn't accessible to those accessing other domain controllers within a reasonable amount of time, there's a chance that you may have a replication problem. For example, imagine that an Administrator creates a new user account. The Administrator then calls the user to say that the new user account should be ready to use within about 20 minutes (after the next replication cycle completes), After about half an hour, the user calls back and says that she can't log in because Windows is telling her that her account doesn't exist. The Administrator checks and, sure enough, the account exists. In this case, the account exists on the domain controller that the Administrator is connected to, but the account has yet to be replicated to the domain controller that is processing the user's login, thus giving the illusion that the account doesn't exist.
Anytime that you make an Active Directory update and the update isn't accessible to those accessing other domain controllers within a reasonable amount of time, there's a chance that you may have a replication problem. For example, imagine that an Administrator creates a new user account. The Administrator then calls the user to say that the new user account should be ready to use within about 20 minutes (after the next replication cycle completes), After about half an hour, the user calls back and says that she can't log in because Windows is telling her that her account doesn't exist. The Administrator checks and, sure enough, the account exists. In this case, the account exists on the domain controller that the Administrator is connected to, but the account has yet to be replicated to the domain controller that is processing the user's login, thus giving the illusion that the account doesn't exist.
If
the company only has a few domain controllers, the administrator can actually
use the Active Directory Users And Computers console to see which domain
controllers the account has been written to. To do so, simply right-click on the
domain name and select the Connect To Domain Controller command from the
resulting shortcut menu. In doing so, the administrator will be able to connect
individually to various domain controllers and see if the new account has been
replicated.
This
technique works great for small organizations, but what if your domain has 200
domain controllers? You don't want to have to individually check each one. This
is where a tool called the Replication Monitor comes in. The Replication Monitor
is a tool that allows you to see exactly what is happening with the replication
process. It allows you to view the status of Active Directory replication and
force replication if necessary.
The
Replication Monitor is one of the Windows 2003 Support Tools and, therefore,
isn't installed automatically as part of the operating system. (This tool is
also included in the Windows 2000 Support Tools.) To install the Windows 2003
Support Tools, insert your Windows 2003 Server CD. Now, open My Computer and
browse the CD's contents. Navigate to the CD's \SUPPORT\TOOLS folder, and then
run the SUPTOOLS.MSI file.
When
installation completes, there will be an option for the Support Tools on the
Start | All Programs menu, but the Replication Monitor is not listed on this
menu. To open the Replication Monitor, you must go to the \PROGRAM FILES\SUPPORT
TOOLS folder and run the REPLMON.EXE file.
When
the Replication Monitor opens, you'll see a big, mostly empty screen. This
console is divided into two columns. The column on the left simply says
Monitored Servers, and the column on the right says Log. In a large organization
if all domain controllers were automatically monitored, there would be so much
data displayed that it would be very difficult to make sense of it all.
The
first time I ever used the Replication Monitor, I was slightly upset that I was
unable to automatically monitor all of my domain controllers. After all, I
wanted a tool that would tell me where replication was failing, not a tool that
would make me guess which server was failing and would then tell me if my guess
was right. In a way, though, the Replication Monitor does tell you which server
is failing.
Let's
go back to the situation in which the Administrator creates a user account but
the user can't access the account because it has never been replicated.
In
a situation like this, you can use the replication monitor in conjunction with
the information that you know to figure out which domain controllers are failing
to receive replication updates.
For
example, the administrator knows that the domain controller on which he created
the account has a copy of the account. The administrator can even find out which
domain controller he is connected to by using the Connect To Domain Controller
option in the Active Directory Users And Computers console. Upon selecting this
option, the console will tell you which domain controller you are currently
connected to before asking you which domain controller you would like to connect
to.
The
other useful tidbit of information in this situation is the user's physical
location. By looking at which building the user is located in, the Administrator
can determine if the user is trying to authenticate through a domain controller
in the same site as the administrator's domain controller or through a domain
controller in a remote site. For the sake of argument, let's assume that both
the user and the administrator are in the same building and are, therefore,
accessing domain controllers in a common site.
In
a situation like this, every domain controller in a site sends updates to every
other domain controller in the site. The administrator knows that the domain
controller he is attached to is functional, so he can tell the Replication
Monitor to monitor that domain controller. He can then watch to see which domain
controllers fail to be updated. If there is a failure replicating Active
Directory information to all of the other domain controllers, then the
administrator's domain controller is probably the one with the problem. If,
however, only one domain controller fails to receive updates, then that's the
domain controller with the problems.
To
perform such an operation, right-click on the Monitored Servers container within
the Replication Monitor and select the Add Monitored Server command from the
resulting shortcut menu. This will cause Windows to display the Add Server To
Monitor dialog box. You can either enter the server's name directly or you can
select the server from a list. Upon entering the server name, Windows will
display the Active Directory in tree form. You will notice in that multiple
domains are listed.
Expand
the desired domain and you will see the other domain controllers in this domain.
If you look at Figure A, you will notice that there is a red X over the icon for
server Brien. In this case, I have purposefully taken this server offline so
that you can see what a replication failure looks like. If you select the
failing server, you can see log information that gives you additional
information about the failure.
In
a situation like this, the first thing you would want to do is right-click on
the failing server, and select the Synchronize With This Replication Partner
command from the resulting shortcut menu. When you do, the Replication Monitor
will attempt to force replication. Of course, in this case, forcing replication
is impossible because the server is down.
Fixing
a replication problem
Once you have identified the problem server, the next step is to fix the problem. In every real life replication failure that I have ever seen, the problem was one of three things: the server was down; the server was having trouble with network communications; or the server's hard disk was full.
Once you have identified the problem server, the next step is to fix the problem. In every real life replication failure that I have ever seen, the problem was one of three things: the server was down; the server was having trouble with network communications; or the server's hard disk was full.
Therefore,
I recommend going to the server and checking out the basics. Make sure that the
server has plenty of hard disk space. Next, make sure that you can ping the
functional domain controllers. It's important that you be able to ping by both
IP address and host name. If you find that you can ping by IP address but not by
host name, then it's likely that the machine is having trouble communicating
with a DNS server. Make sure that TCP/IP is configured correctly and that the
server's designated DNS server is functional.
If
everything checks out on the server, but it still can't receive replication
updates, you are not completely out of luck. The truth is that there are quite a
few less common problems that can cause replication troubles. This is especially
true if you are dealing with replication across a site link. For example, when
replicating across a site, your designated bridgehead server may be too busy to
effectively handle its bridgehead duties. You can find a description of these
less common problems and their solutions in Microsoft's TechNet.