Often when I’m speaking with customers regarding Microsoft Azure Blob Storage they are concerned about reliability. I reassure them by explaining the four replication options they have for redundancy. These are:
All data in the storage account is made highly durable, reliable and available within the datacenter by replicating transactions synchronously to three different storage nodes within a single storage cluster (these clusters are 20 racks within the same physical building). Each node is on a different fault and upgrade domain to ensure recovery from common failures that are seen regularly in normal environments ie disk, nodes, or a rack without impacting the storage account’s availability. All storage writes are performed synchronously across these three replicas in the three separate fault domains before success is returned back to the client. Note there is an option to have premium storage LRS which means the storage is based on SSD storage.
Zone-Redundant Storage (ZRS):
Zone-redundant storage (ZRS) stores three replicas of your data across two to three facilities depending on your region and how many datacentes that region has. It is designed to keep all three replicas within in a single region (but may, on occasions, span across two regions) providing the storage with higher durability than LRS (which only replicates data only within the same facility or datacenter). Data is synchronously replicated when the multiple facilities are within a region and in the event that multiple regions are needed, three copies are synchronously stored within the region and data is asynchronously replicated to facilities in other regions. If your storage account has ZRS enabled, then your data is durable even in the case of failure at one of the facilities. Note that ZRS is currently available only for block blobs and not for any other kind of storage at present.
Geo-Redundant Storage (GRS):
Similar to LRS, transactions are replicated synchronously to three different storage nodes within the same primary region for creating the storage account (there are multiple regions throughout the world with additional datacenters continually being provisioned with each region being made up of multiple datacenters). However, transactions are also queued for asynchronous replication to another secondary region that is geographically dispersed and Microsoft aim for a 300 mile + distance from the primary (geo-replication). In this secondary region the data is again made durable by replicating it to three more storage nodes there and therefore 6 copies of the data exist. So in the case of an entire regional outage or disaster where the primary location is unrecoverable, the data is still available.
Read-Access Geo-Redundant Storage (RA-GRS):
The default option for redundancy when a storage account is first created. For a GRS account, there is the ability to turn on read-only access to a storage account’s data in the secondary location/region. Since replication to the secondary region is done asynchronously, this provides an eventual consistent version of the data to be read from (if the primary goes offline there can be upto anuthing between 5 and 15 minutes of data last written that are not yet available on the secondary so be aware of this from a RTO perspective). When read-only access is required and turned on in the secondary region, you get a secondary endpoint in addition to the primary endpoint for accessing your storage account, so its not just a failover scenario and you just have to refresh the page for want of a better term. This secondary endpoint is similar to the primary endpoint except for the storage DNS name has a suffix of “-secondary”. eg if the primary endpoint is storageaccountname.blob.core.windows.net, the secondary endpoint is storageaccountname-secondary.blob.core.windows.net.
To start with, locally redundant storage is reliably and durable because Microsoft stores cyclic redundancy checks (CRCs) of the data to ensure the correctness and periodically reads and validates the CRCs to detect any bit rot (random errors that occur on disk media over time). Should the CRC fail, the data is recovered automatically. Since each VM disk is a blob in Azure storage, should the CRC fail on a disk then it is automatically commissioned or decommissioned.
Customers sometimes choose LRS for data that doesn’t require the additional durability of GRS and want the added benefit of the discounted price compared to GRS. Typically this would be data such as non-critical or temporary data such as log files or data that can be recreated if it is lost . Maybe data such as media or data that is stored elsewhere in Azure storage. This, coupled with organisations having geographical restrictions where their data is stored, then choosing Locally Redundant Storage ensures that the data is only stored in the location chosen for that storage account.
In the unlikely event of a major disaster that affects the primary storage location (in the context of remote storage (GRS and RA-GRS), Microsoft will first attempt to manually restore the primary location. Restoration of primary storage locations is given precedence due to the fact that failing over to secondary may result in recent delta changes being lost due to the nature of replication being asynchronous, and not all apps prefer failing over if the availability to the primary can be restored. Obviously this depends on the the nature of the actual disaster and the impact it has, in some very rare occasions (and i mean rare), Microsoft may not be able to restore the primary location, and therefore would need to perform a geo-failover.
If this should ever happen then affected customers would be notified via their subscription contact information. As part of the failover, the customer’s storageaccountname.blob.core.windows.net DNS entry would be updated to point to the secondary location. Once this DNS change is implemented and has been propergated, the existing Blob URIs will work. This means that you do not need to change anything and all existing URIs will work the same before and after a geo-failover.
Once the failover is complete, the secondary location is then considered the new primary location for the storage account. This location remains the primary location unless another geo-failover was to occur (that would be a major major disaster). Once the new primary is up and functioning, Microsoft bootstrap to a new secondary to get the data geo redundant once again.
I would suggest taking a look at the SLA website https://azure.microsoft.com/en-gb/support/legal/sla/storage/v1_1/ in order to ensure this meets your organisational requirements.