Friday, October 03, 2008

Windows Server 2003 – read this if you abruptly lose network connectivity on a restart

I rebooted our corporate Windows Server 2003 today. I was moving it to a UPS. No problem – except when I restarted I had no network connectivity.

First I saw a “service didn’t start, check the event viewer” message. The event viewer just told me I couldn’t register with the domain. I couldn’t do that because I didn’t have network access. I got the usual “may have limited connection” error.

I did all the usual things (ipconfig, repair connection, swap cables, switch accounts, login as local user, test everything, etc etc) but they all passed. The big breakthrough was when I investigated the advanced boot options on restart. Windows 2003 includes a “safe start with network” option. When I did that I had a network connection.

There was a lot more work to do before I found that disabling IPSEC service, then rebooting after disabling it, fixed everything.

I easily blew 6-8 hours of work today.

Lesson 1: Run Safe Boot/Safe Start with networking first.

Then you work your way through this Microsoft kb article. I’ll excerpt some key points, then pass on a trick, then I’ve got to go home and finish up the work I couldn’t do today …

How to troubleshoot startup problems in Windows Server 2003

How to Start the Computer in Safe Mode
When you start the computer in Safe mode, Windows loads only the drivers and computer services that you need. You can use Safe mode when you have to identify and resolve problems that are caused by faulty drivers, programs, or services that start automatically.
If the computer starts successfully in Safe mode but it does not start in normal mode, the computer may have a conflict with the hardware settings or the resources. There may be incompatibilities with programs, services, or drivers, or there may be registry damage. In Safe mode, you can disable or remove a program, service, or device driver that may prevent the computer from starting….
How to Use System Configuration Utility

System Configuration Utility (Msconfig.exe) automates the routine troubleshooting steps that Microsoft Product Support Services technicians use when they diagnose Windows configuration issues…

… Click the General tab, and then click Selective Startup.

…Note You might be able to determine more quickly which service is causing the problem by testing the services in groups. Divide the services into two groups--select the check boxes of the first group, and clear the check boxes of the second group. Restart your computer, and then test for the problem. If the problem occurs, the faulty service is in the group with the selected check boxes. If the problem does not occur, the faulty service is in the group with the cleared check boxes. Repeat this process on the faulty group until you have isolated the faulty service.

It took hours.

Here’s the trick. Boot in Safe Mode first. Then run msconfig.exe and look at the services. Assuming things work in safe mode, the ones that are running (sort by that column) are good. Now uncheck all services, check the ones that are currently running, apply, restart.

When you restart you’re in the equivalent of Safe Mode, but you can use msconfig.exe to add services in blocks.

The UI of this app is dismal. I sorted alphabetically, then did screen captures to a Word document to get a complete alpha sorted list. I printed that to guide my tedious enabling of sets. (In theory you can do the binary sort approach faster. Long story, can’t explain.)

One thing to watch for.

When you enable “Error Reporting Service” you start getting … error reports! Wow. So if gets enabled with a bunch of other items, you might think you’ve found a problem. Wrong. It’s just that now you’re getting the error reports.

IPSEC.

So now I have to figure out what the #$!#% happened. I don’t think we’ve done any software installs on that box or tweaked any services. Did some antiviral update trigger a problem?

Update: This experts exchange article may be related, but the responses are not accessible. A clue:

Description: The IPSec driver has entered Block mode. IPSec will discard all inbound and outbound TCP/

IP network traffic that is not permitted by boot-time IPSec Policy exemptions. User Action: To restore full unsecured TCP/IP connectivity, disable the IPSec services, and the restart the computer. For detailed troubleshooting information, review the events in the Security event log
This suggests an interaction between Group or Local security policy, IPSEC block mode, and loss of network access. I wonder if a corruption or misconfiguration of a local policy setting could cause this.

Update: This article connects group policy file corruption to IPSEC problems and loss of network access, and points out there are definite bugs with group policy editing. I didn't touch local or group policy on our server, but perhaps another admin might have. I now see there have been nasty unfixed bugs.

Update: I'll take a look at these when I get back to work on Monday, then update this post. I think we're narrowing things down to a corruption of misconfiguration of a group policy file that activated IPSEC and disabled, without any meaningful entry in the event monitor, all network TCP/IP traffic.
  • http://support.microsoft.com/kb/870910: looks like a pretty pertinent kb article
  • http://support.microsoft.com/kb/914962: IPSEC bugs fixed in SP2. So did some later upgrade break them again? Clearly I need to check windows update for the server.
  • http://support.microsoft.com/kb/898060: After SP1 a security update broke IPSEC. Should be ok in SP2, but did it get broken again?
  • http://marc.info/?l=patchmanagement&m=121632162501913&w=2: A fairly recent DNS spoof prevention security update from Microsoft has broken IPSEC on some machines.
  • http://support.microsoft.com/default.aspx?scid=kb;en-us;816579: In place upgrades when WS 2003 is truly hosed. I don't think this applies, but nice to know.
Lots of evidence that the Windows 2003 IPSEC architecture and TCP/IP stack are pretty fragile. No wonder Microsoft famously redid the network stack in Vista. They weren't reacting to XP, they were reacting to Windows Server.

So Monday I'll look at windows update and try opening, reviewing and savng the IPSEC and Group Policy files. If they're corrupted they may cause other problems.

Update 12/14/08: I'm grateful to an anonymous visitor for finding the underlying issue. S/he references two Microsoft kb articles, I've added a less important but related third article.
A botched security update 953230 (MS08-037) causes a variety of Windows 2003 failures due to a UDP port conflict. Essentially Microsoft switched to random port assignments, which is good, but they forgot some ports might be in use (bad). Depending on what gets randomly whacked, you may lose a service.

The latter references the problem I had:
Event Type: Error
Event Source: IPSec
Event Category: None
Event ID: 4292
Date: Date
Time: Time
User: N/A
Computer: Server_name
Description: The IPSec driver has entered Block mode. IPSec will discard all incoming and outgoing TCP/IP network traffic that is not permitted by boot-time IPSec Policy exemptions.
User Action: To restore full unsecured TCP/IP connectivity, disable the IPSec services, and then restart the computer. For detailed troubleshooting information, review the events in the Security event log.
Update 12/31/08: Nope, it didn't work.

I finally got around to applying Microsoft's fix and it didn't work!

So even after I reserved these ports:
3343-3343
1645-1646
1812-1813
2883-2883
4500-4500
I still got the service failure notice on restart and lost my network connections. Guess I'll have to wait for a service pack. I removed the registry changes I'd made (why ask for trouble?) and again disabled IPSEC services.

8 comments:

  1. I’ve been struggling with IPSEC problems for a while. I’ve read many articles but nothing to cover the problems I’ve experienced. I manage over 50 servers using terminal services and find at times some of them will not be accessible after a remote reboot. After having someone onsite reboot the server it always comes up fine. After the remote reboots the event log show the IPSec driver has entered Block mode which of course prevents access. So my problem is strictly with a reboot done with terminal services. Since IPSec works fine otherwise, the policy must be fine as well.

    ReplyDelete
  2. Just a follow-up to my post above. I believe I found the cause and answer to this problem and why it is intermittent. The problem and resolutions can be found here: http://support.microsoft.com/kb/956188. This article also discusses this issue: http://support.microsoft.com/kb/956189

    ReplyDelete
  3. Thanks very much for following-up with the solution. I'd just disabled IPSec and put it on the back burner. Now I'll fix it.

    I added your references as an update to the original post.

    ReplyDelete
  4. I had the same problem with a windows 2003 32 bit server. The system didn't come up after WSUS patching. I rolled the system back to a pre-patching snapshot but still neworking only in safe mode. Disabling IPSec service did the trick.

    Did you ever find a root cause for the IPSec issue?

    ReplyDelete
  5. Thanks, you just saved me hours of diagnostics!!!!

    ReplyDelete
  6. Yep. We had this happen to us. Windows Update, reboot, system was offline. After many dead ends, found this article, disabled the IPSEC service (previously set to automatic) and restarted the server. Obviously, something happened here after having no issues with this heretofore, and still a bit of a mystery as to why this had to happen.

    ReplyDelete
  7. I have been fighting this issue for a couple days! I just found and read through this and tried turning off IPSEC with no results. Does anyone have any further updates on this issue?

    ReplyDelete
  8. I have been fighting this issue for a couple days! I just found and read through this and tried turning off IPSEC with no results. Does anyone have any further updates on this issue?

    ReplyDelete