HP Proliant DL360 G3 Red Health Light and Interlock LED amber

by Joe Payne 16. May 2011 22:13

Well it just goes to show you are never done learning in the IT business.

A few weeks ago I decided to deploy my backup web server as the primary was showing odd hardware errors in Server 2003.  However I couldn’t even get the backup to show me a POST boot.  In fact, it just stared at me with a red health LED on the front and nothing more.  The power supplies didn’t even light their green LEDs despite hearing the fans spin inside.  Inside the server showed only an amber Interlock light. 

So I cold-booted the primary and brought it back online.  I pulled the backup DL 360 G3 and brought it home for the usual kitchen-table bench testing.  Yes, the wife just loves it when I do that.

But nothing I did resolved the seemingly dead server.  I reseated CPU’s, CPU power boards, RAM, power supplies and anything else that remotely looked removable.  Nothing.

I finally got around to ordering another backup DL 360 G3 last week off eBay.  It arrived today.  Finally, my anxiety of running live sites on a single box with no backup would soon be a thing of the past.  I unbox the new backup DL 360 G3, throw it too on the kitchen table and ………. nothing.  The EXACT same symptoms as the previous backup server.

So by the general laws of armchair logic, I knew this was no longer a problem with the hardware.  There has to be something else.  I hopped on Google, did the usual searches and spent a good 30 minutes reading post-after-post.  Finally, I found the answer.  Within 10 seconds I had both servers running great.

The problem is the recycler companies that buy these units up and sell them on eBay.  They don’t know HP units as well as they advertise.  And they DO NOT “test” them to ensure they boot.  At least they don’t AFTER they install two power supplies in a system specifically configured for a single power supply.

That’s right, the problem was the motherboard has a configuration switch (SW2 on DL360 G3) with 4 positions.  The 4th position specifically tells the system whether there are two power supplies or just one.  It’s not automatic in the G3 series. 

The recyclers clearly acquired these units as single power supply, tested them, then loaded up the additional power supply under the assumption it would be auto-detected.  This isn’t true in the G3 series if the 4th switch on SW2 is not set correctly.  Toggling the 4th SW2 switch on the motherboard immediately eliminated the problem on both servers.

Now the power LED on each power supply lights up.  The Interlock light no longer lights up.  And the servers proceed to POST boot without a problem.

Hopefully this blog post will save somebody the headache and expense I’ve endured for such a painfully simple solution.

Tags: , , , ,

Personal | Tech Support

HP Compaq Servers and intermittent traffic errors

by Joe Payne 24. February 2009 23:09

Well since I haven't had time to post some blog entries this month, I thought I'd do so while fighting this really big problem with certain client file servers.

 Over the past few months, I've noticed increasingly more intermittent VPN tunnel connections using the Sonicwall GVC.  Now and then my VPN connection would just "die", yet my local internet was fine.  I could drop the tunnel and reconnect - everything would be fine.  For a while.  An inconvenience but nothing to cry about.

As time went on, this became more frequent.  Recently it's gotten to the point where I have bounce back-and-forth between the primary and backup firewalls trying to get a VPN connection that will stay working for 5 minutes.  Wow has it been frustrating.  And it always seemed to be when the traffic load got higher.  Like when a large image appeared or a logo screen popped up during RDP sessions.

These servers were pretty static for the last 2 years except for the usual Microsoft updates.  All were HP Compaq DL380 G3 units running Windows Server 2003.  W2K3 had been patched up to SP2 however I forget how long ago.

All along, we never really noticed any "internal" LAN issues.  Occasionally I'd get complaints on speed but usually the problem was gone by the time I was on the case (within 15-30 minutes of the report).  Chalking that up to typical network bandwidth spikes was easy.

But, enter client #2.  These guys have brand new DL360 G5 units running all sorts of the newer drivers, one even running Server 2008 SP1.  Client #2 starts seeing some really strange traffic issues, like DHCP suddenly not responding to station requests.  Bouncing the switch seemed to help, rebooting the DHCP server (W2k8 box) fixed it one time.  Then those issues went away for about a week so it was dismissed as deployment gremlins.  All except one lonely IBM Thinkpad that simply would not get an IP from DHCP.  It could ping devices on the same switch but saw nothing on the other 48-port NetGear Gigabit switch.  Now things are getting bizarre.

So now things are interesting.  Devices that link up, see same-switch peers but cannot see across multiple switches.  Since all these devices involved both at client 1 and 2 have worked flawlessly in the past, there had to be something common to everyone.  I started going down the road of 'blame the switch' and had some credible evidence too.  But overall, I just wasn't seeing the same issues internally on the scale and frequency I was seeing them remotely.

Then tonight, in yet another recon mission into possible causes, I noticed something odd.  Client 1 Win2K8 server was showing "no buffers" receive errors on one NIC.  But not on the OTHER nic.  This made no sense since the NICs were teamed and should be multi-casting - I should see duplicate errors on both NICs.

Finally I had something more "concrete" to work with in the search engines.  I quickly came up with a very, very long thread on the HP support forums and it darn near described my issues to the last detail!  AHA!  Gotcha :)

Here's the link to the huge thread of a whole lot of frustrated HP server customers:  http://forums11.itrc.hp.com/service/forums/bizsupport/questionanswer.do?threadId=1153566

Apparently there are issues with Broadcom NICs when certain advanced TCP Offload and Chimney settings are enabled in Windows.  Certain Broadcom NICs have the ability to offload some of the TCP traffic work to the NIC instead of being handled by the OS.  This features are apparently enabled in W2k3 SP2 by default and can cause all sorts of traffic issues depending on the load being put on the NIC.

The HP Compaq solution is to update the firmware on the NICs, disable the offending features in the registry or NIC advanced settings and then updating the HP NIC drivers.   The specific steps listed (your specific cp*.exe file may differ):

1: Upgrade Bios and firmware from disk FW800.2008_0207.37.iso (firmware-8.00-0.zip)

The order is important as every time you install the driver the settings goes back to default wich is enabled.
2: Upgrade drivers: cp008415.exe
3: Upgrade NCU (Network Configuration Utility) cp008413.exe

4: Edit registry.
EnableRSS == 0
EnableTCPA == 0
Enabletcpchimney == 0

5: run the following command: Netsh int ip set chimney DISABLED

6: Go inside NCU and on each nic go to advanced settings:
Remove the enabled tick for TCP offload engine.
Remove the enabled tick for Receive-Side Scaling.

7: Boot

I'm going through all of this tonight - so far it seems to have helped Client 2 with the newer G5 servers.  However I'm not seeing the same improvement with client 1 and the G3 servers.  It could be that Client 1 has 12 servers and I've only updated 1 - the problem now is keeping the VPN connection alive long enough to get the updates done on the other servers.   We shall see.........


Tags: , , ,

Personal | Tech Support

Month List