I'm having similar issues with 3PAR StoreServ 7400. I only experienced the problem with 1 host out of 10 most recently, however. All of the 9 working hosts enumerated the paths as {TPG_id=258,TPG_state=STBY}{TPG_id=257,TPG_state=AO}. As a matter of fact, I have 3 other clusters which all enumerated the TPG_IG as 257 and 258 - and they all switched over without issue. The non-working host enumerated the LUNs as {TPG_id=258,TPG_state=STBY}{TPG_id=256,TPG_state=AO} before and after the failover. After a reboot, the host changed to {TPG_id=258,TPG_state=STBY}{TPG_id=257,TPG_state=AO} and a subsequent test failover succeeded without issue.
[Side note - I did not let the host get to a point that it froze. I noticed in prior failures like this that if I watch the vmkernel.log I could ID and see that the switchover did not recover paths and fail back to the original source array and it would correct the problem. Like you said, a reboot also corrects the problem, but that kills the whole point of Metro Storage Clustering, in my opinion.]
My question would be is it coincidence that working hosts have TPG_ID of 257 and the failing host has a TPG_ID for the Active/Optimized path as 256? Does the TPG_ID have significance - does it represent something? I am really doubting it is random since I have approximately 21 hosts that enumerate the same way - {TPG_id=258,TPG_state=STBY}{TPG_id=257,TPG_state=AO} - and one broken one that enumerates with a different ID.
Any help from vSphere storage gurus is appreciated! HP Support is working very hard with me to resolve this, but I thought I would throw this specific TPG_ID question out to the VMware community.