Quantcast
Channel: VMware Communities: Message List - vSphere™ Storage
Viewing all articles
Browse latest Browse all 5826

Fiber Channel -- failed H:0x5 D:0x0 P:0x0

$
0
0

Hi everyone... running into a pretty baffling problem as we're looking to establish FC connectivity from our ESXi 5.x hosts to an IBM Gen2 XIV.

 

We're deploying this at two separate sites with identical hardware -- save for that one site is running ESXi 5.0 and the other site is running ESXi 5.1.  Both configurations are a number of Dell blades with QLogic HBA's talking to an in-bladecenter Brocade switch which is ISL'd up to our fiber "core" switch environment off which the XIV hangs directly.  Using WWPN based zoning and two fabrics (two core FC switches and two FC switches in each blade center).

 

Our problem is that at the site running ESXi 5.1 HBA rescans on the ESXi hosts take a *long* time -- when we had only a few LUNs rescans were taking ~5 minutes... now that we have 20 or so LUNs exposed we're up to 35-40 minutes for a rescan to complete.  This also makes host reboots take an excessive amount of time (which makes sense).  At the site running ESXi 5.0, everything is "normal" -- rescans complete in under a minute.

 

The following errors can be observed in vmkernel.log throughout the rescan (and actually at any time though the day -- they seem to be more frequent during rescan activity):

 

2013-03-29T20:24:32.112Z cpu6:8198)ScsiDeviceIO: 2329: Cmd(0x4124003e56c0) 0x1a, CmdSN 0x73f5a from world 0 to dev "eui.001738000f86088c" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

 

We have tickets open with both VMware (who pointed us to our storage vendor, IBM) and IBM.  IBM made a few suggestions (nothing major) which we implemented but didn't help.  From their perspective, our SAN environment looks fine -- no errors on ports, zoning correct, firmware up to date, etc.  We're continuing to work the support angle, but wanted to throw the issue out here in case anyone has any suggestions.

 

At this point, our next steps are more along the lines of divide and conquer / trial and error.  With the newer version of ESXi there's also a newer version of firmware on the QLogic HBA's than there is at the site running ESXi 5.0 (where there are no problems).  We may try downgrading one of our 5.1 hosts to 5.0 to see if the problem follows.  Next steps are trying to reproduce the issue from an OS other than ESXi, and then perhaps from a standalone host attached direct to the BC switch.  And so on and so on.  Unfortunately, this is all PRD gear, so everything has to be scheduled and it'll take a while to get through all of the trial & error.

 

Anything here jump out to anyone that could help us jumpstart solving the problem?

 

I'll note that in watching the logs, *some* LUNs seem to throw errors more frequently than others (and the number of errors is pretty consistent across each group).  I thought this perhaps had to do with some of the LUNs being detected as supporting "Hardware Acceleration" and others not (which is also baffling since these are all LUNs on the same XIV -- why wouldn't they all support or not support HW Acceleration?).

 

Thanks in advance!

 

Ray



Viewing all articles
Browse latest Browse all 5826

Trending Articles