Hi Freddy
Your question should be answered in a very long blog-article. Please forgive me if I answer in several posts - just as time allows.
For the record: due to my job I believe I see more problems with VMFS than most users who just run their own vSphere-environment.So I of course have a biased view.
From my point of view it would be tempting to make a statement like:
expected survival rate after one year:
Windows 2008 system installed on certified hardware: 98%
same system on thick provisioned eagerzeroed vmdk running on ESXi: 97%
same system on thin provisioned vmdk running on ESXi: 60%
same as before plus automatic backup by Veeam or similar: 50 %
I dont think those numbers are completely off but I dont have statistics to backup such a claim.
So I rather ask the questions:
- what happens when problems occur ?
- how well are powerfailure or similar problems handled ?
- how well does the system check for errors and how well can it repair themselves ?
- which problems can a user handle himself ?
- which problems can be solved by VMware support ?
- is there any documentation for troubleshooting ?
- are there 3rd party tools that can be used if a problem occurs ?
- how severe do the problems have to be to result in a complete loss ?
- does the filesystem itself offer any repair or selfhealing features ?
I do remote support for this problems since about 2007 - the last 4 years as a consultant for a VMware partner.
The experience I gathered in that time can be summarized like this:
- smallest errors in a thin-vmdk mapping table render the vmdk as unreadable
- smallest errors in a snapshot graintable render the snapshot as unreadable
- loss of the partitiontable for the VMFS-volume has to be expected when the system has a powerfailure
- for small and medium VMware customers calling VMware support for help with damaged thin vmdks, snapshots or VMFS-volumes usually is not worth the effort
- VMFS seems to have no redundant functions to fix small problems after a reboot
- the heartbeat functions that enable cluster access can not be resetted by the user - that means that ESXi often denies to use/read a volume even if the reason to do so no longer exists
- 99 % of the vSphere admins I talk to in my job do not have the skills required to fix even smallest problems with thin vmdks or snapshots
- most of the admins I talk to somehow compare VMFS with the behaviour of NTFS - most are of them are shocked when I tell them that there is no equivalent for chkdsk
So IMHO this aspects all sum up to:
- thin provisioned vmdks die without early warning
- the chance that a user can fix an error himself are almost non-existant
- trying repairs is a waste of time in most cases
- it has to be expected that in case mission critical data has to be recovered inside a predictable time frame results in an Invoice from Kroll Ontrack starting with 5000$ or more.
If my customers ask for recommendations for Thin/thick provisioning I think there is only one safe answer:
To be on the safe side thin provisioning should only be used when either:
- the VM is disposable - like a View-VM
- a solid and tested backup / or replacement policy is active so that the loss of a thin VM just becomes a calculated loss of a few hours worth of data
For thick provisioned VMs the story is very different.
A skilled admin can aquire the skills required to fix all problems that are caused by the vmdk-layer and the VMFS filesystem.
So for a skilled admin thick VMs have almost the same behaviour as a Windows-system running on physical hardware.
have to interrupt now - to be continued later ...
Ulli