Morning,
The hosts are on U1 Build 702118 and yes it is Gen 7 or 8 .. I will have to check that as it has been removed for now - but I am intrigued - what would the fix be because at the moment we are reluctant to put it back into production.
-> We ran into this issue where blades nic's would be online but refuse to send traffic. It's a bug in the emulex code in our case and after two months emulex via HP was unable to release a driver to fix the issue. It just kept happening we would install a beta driver provided by HP and it would happen again at random times. We solved the issue by replacing the emulex cards with broadcom. This does present an issue because the broadcom cards are unable to do FCoE so it's a trade off... I know this I will never buy emulex again their support teams and VMware and HP are horrible. VMware in my case was very helpful but could not solve the issue.
As for a solution - a week now down the line and we are still in the middle of a blame war, but I THINK I have fixed the issue for now. Oh and a storage processor reboot is planned for next week already. We still have the odd issue, with standby luns missing, some hosts missing some LUNs etc., which is hopefully solved by the reboots.
-> These issues are critical your storage vendor should have some solutions for these problems. I would 100% reboot asap but also if you are missing LUN's then your vendor needs to root cause that issue. If your array is acting so oddly I would not trust any data on it and do a reboot of controllers asap.
Host in cluster = High latency on LUN X
Host rebuilt and put back in cluster = High latency on LUN X
Put host in maintenance mode = Latency gone.
Take host out of maintenance mode = High latency on LUN X
Remove host from cluster = Latency gone
Remove host from vCenter = Latency gone
->Odd question... What about a physical host (non-vmware) with LUN X given your above statement about missing lun's etc... your array is in a very bad place.
So it smelt fishy really and if it was the storage, then why is the issue gone in the conditions above?
-> My guess because when it's out of maint mode and in the cluster HA tries to write to data store and the data store is really boinked up. When not in cluster no HA writting to datastore so no metrics... do this test take out of cluster and then run a vm workload I bet the latency is high.
I am still waiting for the answers, anyway, what is involved in the conditions where the latency is high ? The vCenter / HA agent.
So it clearly was a VMware issue after all, and not an issue on the storage itself.
->I'll be honest that I don't agree I still think it's a issue with the storage array... Try a vm work load on the lun X without HA enabled and see how it is... is that works ok then it is HA datastore heartbeats
The only option is obviously Business Critical Support with onsite support - but we cannot really afford the $50k / year (or whatever that costs nowadays) - so we are at the mercy of VMware and other vendors involved.
->Yeah given your solution is almost all HP I would look into escalating that agreement before the vmware one.
I agree that vendor fun is a pain in the butt.... There is no real answer here I end up solving it myself most of the time. Sorry I did not respond sooner but glad I could help. I really would press the storage vendor for root cause on storage issues... I doubt you will get anything but end of the day you need to know how to avoid it from happening again. Also get emulex cards gone if you can.
Thanks,
J