When good Domain Controllers go bad!

Scenario

It’s a pleasant day and all is well with the world. Colleagues are skipping around the office with smiles on faces…until…duh duh daaa! One by one, services start failing:

  • Printers go offline:
    • First, for Win7 users
    • Then for all clients
    • Can still print from server though
  • File shares go offline
  • Active Directory replication fails
  • DNS console will not open

Basically, your main Domain Controller (DC) has just taken a dump…and so have you!

These are the steps I took to troubleshoot the issues and get everything back online.

Solution

Gather Information

Run the following commands to gather useful information:

ipconfig /all > c:\ipconfig.txt (from each DC/DNS Server)
dcdiag /v /c /d /e /s: > c:\dcdiag.txt
dcdiag /test:dns /s: /DnsBasic > c:\dcdiag-dnsbasic.txt
repadmin /showrepl dc* /verbose /all /intersite > c:\showrepl.txt (dc* is a placeholder for the starting name of the DCs if they all begin the same - if more then one DC exists)
repadmin /replsum > c:\replsum.txt

Pour through the txt files and note down the errors. Some of mine included:

  • repadmin /showrepl
    • Last error: 1256 (0x4e8): The remote system is not available.
    • Last error: 5 (0x5): Access is denied.
    • WARNING: KCC could not add this REPLICA LINK due to error.
    • result 1722 (0x6ba): The RPC server is unavailable.
  • repadmin /replsum
    • (1722) The RPC server is unavailable.
    • (5) Access is denied.
  • dcdiag /test:dns /s: /DnsBasic
    • The host could not be resolved to an IP address. Check the DNS server, DHCP,server name, etc.
    • Got error while checking LDAP and RPC connectivity. Please check your firewall settings.
    • Error: No LDAP connectivity.
    • invalid DNS server:
    • No host records (A or AAAA) were found for this DC.
    • Warning: no DNS RPC connectivity (error or non Microsoft DNS server is running).
    • Name resolution is not functional.
  • dcdiag /v /c /d /e /s:
    • EventID: 0x40000004 – The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server.
    • EventID: 0xC00004B2 – The DFS Replication service failed to contact domain controller  to access configuration information.
    • EventID: 0xC000138A – The DFS Replication service encountered an error communicating with partnerfor replication group Domain System Volume.
    • The replication generated an error (-2146893022): The target principal name is incorrect.
    • Error: Detected circular loop trying to locate the ISTG.
  • repadmin /syncall
    • -2146893022 (0x80090322): The target principal name is incorrect.
    • SyncAll exited with fatal Win32 error: 8440 (0x20f8): The naming context specified for this replication operation is invalid.

Some information seemed to conflict as similar tests for certain services failed (like DNS) yet you could still ping by name and confirm using nslookup. Moving on.

Go through the errors one by one and search online for solutions. Here are some of the URLs I used to troubleshoot errors:

By now things might seem to snowball, but stay calm and keep trying recommended steps from Microsoft, recording your steps along the way:

To stop the KDC

  1. At a command prompt, type the following command and press ENTER:
  2. net stop KDC
  3. If the KDC cannot stop, set its startup state to disable and restart.

To purge the ticket cache

  1. At a command prompt, type the following command and press ENTER:
  2. klist purge
  3. Answer Yes for each ticket

To reset the computer account password on the PDC emulator

  1. At a command prompt, type the following command and press ENTER:
  2. netdom resetpwd /server:/userd:\administrator /passwordd:*

Some other commands I used included:

dcdiag /test:CheckSecurityError /s
dcdiag /testdomain:
nltest /logon_query
nltest /dclist:
nltest /domain_trusts
nltest /DSQUERYDNS
nltest /DSREGDNS
nltest /sc_verify:
nltest /dsgetdc: /force
net config rdr
dsquery * forestroot -scope subtree -filter "(serviceprincipalname=)" -attr * -s

nltest /dsgetdc: /gc gave this error:
Getting DC name failed: Status = 1355 0x54b ERROR_NO_SUCH_DOMAIN

nltest /server: /sc_query: gave this error:
I_NetLogonControl failed: Status = 1355 0x54b ERROR_NO_SUCH_DOMAIN

Know when to quit

My troubleshooting ran on to a second day. By now, users were using a workaround to access printers and file shares, but the DC errors continued. At this point, I decided to demote the DC and just leave it as a file and print server; which is best practice anyway.

After taking a snapshot of the DC (via VMware vCenter), I proceeded to go through the standard steps to demote a DC:

  1. Transfer all FSMO roles to another DC – this failed with a generic error (http://social.technet.microsoft.com/Forums/en/winserverDS/thread/3f49ddbc-c948-43ac-af21-2f5a4f3dce9b).
  2. Run dcpromo to demote DC – this also failed.

Great. Now the only option was a forceful removal of the DC (http://technet.microsoft.com/en-us/library/cc731871(v=ws.10).aspx). I

dcpromo /forceremoval worked fine. I then removed the DC from Sites and Services, at which point the FSMO roles were transferred to another DC, so I didn’t need to seize them. You used to have to go through a Metadata Cleanup, after forcing a demotion, but now this is done for you when you remove the DC from Sites and Services. This can be confirmed by following the steps here: http://www.petri.co.il/delete_failed_dcs_from_ad.htm

Although this is much easier using 2008 R2, you will still need to tidy up a little in other areas:

  1. Remove all entries of failed DC in Name Server Tabs on all relevant DNS zone properties.
  2. Backup and restore DHCP database to another server.
  3. Tombstone WINs entries from failed DC:
    1. From another DC, go to WINS >Active Registrations > right-click > Delete Owner.
    2. Select failed DC.
    3. Replicate deletion to other servers (tombstone).
    4. The new DC will then take ownership of the records.
  4. Uninstall above roles from failed DC.
  5. Update DHCP and devices with static IPs to use the new DC’s IP Address for DNS and WINS. You did spin up a new DC right?!?!

Another great tip I found was from this thread on Spiceworks:

If we really want to be safe then open a command prompt with elevated privileges and run the following command
csvde –f C:\\ad_details.csv
This exports all contents of ASDIEdit to an excel file in the root of C drive called “ad_details.csv” Open this in Excel and do a find all for . If it finds any references then we have lingering objects and will need to perform a Metadata Cleanup.

Conclusion

Although this was a nightmare to troubleshoot – and I have a chip on my shoulder as I didn’t find the root-cause or fix the DC – I have more confidence in the steps to force the removal of a screwed up DC. Next time I’ll learn to let go a little faster.

Update: I’ve just found more notes on this that may be useful in future:

Comments

  1. I’m on my second day on this endless quest to try to fix one DC that won’t replicate and has so many errors on it, but I’m at the point that it’s time for it go. Sadly this error seemed that it started with an a W32time that was not taken care of for over 1 year by the previous IT guy…the pains of Domain Controllers Arghhh!!

    • I feel your pain. I’ve seen terrible problems off the back of time-sync issues. It’s so important to have all servers in sync. If not, the strangest things can happen.