Retaining Your Apps and Knowledge Obtainable With HyperFlex



The Cisco HyperFlex Knowledge Platform (HXDP) is a distributed hyperconverged infrastructure system that has been constructed from inception to deal with particular person part failures throughout the spectrum of {hardware} components with out interruption in providers.  In consequence, the system is very accessible and able to in depth failure dealing with.  On this brief dialogue, we’ll outline the forms of failures, briefly clarify why distributed methods are the popular system mannequin to deal with these, how information redundancy impacts availability, and what’s concerned in an internet information rebuild within the occasion of the lack of information elements.

It is very important notice that HX is available in 4 distinct varieties.  They’re Normal Knowledge Middle, Knowledge Middle@ No-Cloth Interconnect (DC No-FI), Stretched Cluster, and Edge clusters.  Listed below are the important thing variations:

Normal DC

  • Has Cloth Interconnects (FI)
  • Could be scaled to very massive methods
  • Designed for infrastructure and VDI in enterprise environments and information facilities


  • Much like normal DC HX however with out FIs
  • Has scale limits
  • Decreased configuration calls for
  • Designed for infrastructure and VDI in enterprise environments and information facilities

Edge Cluster

  • Utilized in ROBO deployments
  • Is available in numerous node counts from 2 nodes to eight nodes
  • Designed for smaller environments the place preserving the purposes or infrastructure near the customers is required
  • No Cloth Interconnects – redundant switches as a substitute

Stretched Cluster

  • Has 2 units of FIs
  • Used for extremely accessible DR/BC deployments with geographically synchronous redundancy
  • Deployed for each infrastructure and utility VMs with extraordinarily low outage tolerance

The HX node itself consists of the software program elements required to create the storage infrastructure for the system’s hypervisor.  That is carried out through the HX Knowledge Platform (HXDP) that’s deployed at set up on the node.  The HX Knowledge Platform makes use of PCI pass-through which removes storage ({hardware}) operations from the hypervisor making the system extremely performant.  The HX nodes use particular plug-ins for VMware known as VIBs which can be used for redirection of NFS datastore site visitors to the right distributed useful resource, and for {hardware} offload of advanced operations like snapshots and cloning.

A typical HX node architecture
A typical HX node structure.

These nodes are included right into a distributed Zookeeper based mostly cluster as proven under. ZooKeeper is basically a centralized service for distributed methods to a hierarchical key-value retailer. It’s used to offer a distributed configuration service, synchronization service, and naming registry for giant distributed methods.

A distributed Zookeeper based mostly cluster

To being, let’s have a look at all of the attainable the forms of failures that may occur and what they imply to availability.  Then we are able to talk about how HX handles these failures.

  • Node loss. There are numerous the reason why a node could go down. Motherboard, rack energy failure,
  • Disk loss. Knowledge drives and cache drives.
  • Lack of community interface (NIC) playing cards or ports. Multi-port VIC and help for add on NICs.
  • Cloth Interconnect (FI) No all HX methods have FIs.
  • Energy provide
  • Upstream connectivity interruption

Node Community Connectivity (NIC) Failure

Every node is redundantly linked to both the FI pair or the swap, relying on which deployment structure you may have chosen.  The digital NICs (vNICs) on the VIC in every node are in an energetic standby mode and break up between the 2 FIs or upstream switches.  The bodily ports on the VIC are unfold between every upstream machine as properly and you will have further VICs for additional redundancy if wanted.

Cloth Interconnect (FI), Energy Provide, and Upstream Connectivity

Let’s observe up with a easy resiliency resolution earlier than inspecting want and disk failures.  A standard Cisco HyperFlex single-cluster deployment consists of HX-Sequence nodes in Cisco UCS linked to one another and the upstream swap via a pair of material interconnects. A cloth interconnect pair could embody a number of clusters.

On this situation, the material interconnects are in a redundant active-passive major pair.  Within the occasion of an FI failure, the accomplice will take over.  This is identical for upstream swap pairs whether or not they’re immediately linked to the VICs or via the FIs as proven above.  Energy provides, in fact, are in redundant pairs within the system chassis.

Cluster State with Variety of Failed Nodes and Disks

How the variety of node failures impacts the storage cluster relies upon:

  • Variety of nodes within the cluster—Because of the nature of Zookeeper, the response by the storage cluster is totally different for clusters with 3 to 4 nodes and 5 or better nodes.
  • Knowledge Replication Issue—Set throughout HX Knowledge Platform set up and can’t be modified. The choices are 2 or 3 redundant replicas of your information throughout the storage cluster.
  • Entry Coverage—Could be modified from the default setting after the storage cluster is created. The choices are strict for safeguarding towards information loss, or lenient, to help longer storage cluster availability.
  • The sort

The desk under reveals how the storage cluster performance modifications with the listed variety of simultaneous node failures in a cluster with 5 or extra nodes working HX 4.5(x) or better.  The case with 3 or 4 nodes has particular concerns and you’ll verify the admin information for this data or speak to your Cisco consultant.

The identical desk can be utilized with the variety of nodes which have a number of failed disks.  Utilizing the desk for disks, notice that the node itself has not failed however disk(s) throughout the node have failed. For instance: 2 signifies that there are 2 nodes that every have a minimum of one failed disk.

There are two attainable forms of disks on the servers: SSDs and HDDs. After we discuss a number of disk failures within the desk under, it’s referring to the disks used for storage capability. For instance: If a cache SSD fails on one node and a capability SSD or HDD fails on one other node the storage cluster stays extremely accessible, even with an Entry Coverage strict setting.

The desk under lists the worst-case situation with the listed variety of failed disks. This is applicable to any storage cluster 3 or extra nodes. For instance: A 3 node cluster with Replication Issue 3, whereas self-healing is in progress, solely shuts down if there’s a complete of three simultaneous disk failures on 3 separate nodes.

3+ Node Cluster with Variety of Nodes with Failed Disks

A storage cluster therapeutic timeout is the size of time the cluster waits earlier than routinely therapeutic. If a disk fails, the therapeutic timeout is 1 minute. If a node fails, the therapeutic timeout is 2 hours. A node failure timeout takes precedence if a disk and a node fail at similar time or if a disk fails after node failure, however earlier than the therapeutic is completed.

When you have deployed an HX Stretched Cluster, the efficient replication issue is 4 since every geographically separated location has a neighborhood RF 2 for web site resilience.  The tolerated failure situations for a Stretched Cluster are out of scope for this weblog, however all the main points are coated in my white paper right here.

In Conclusion

Cisco HyperFlex methods comprise all of the redundant options one would possibly anticipate, like failover elements.  Nonetheless, additionally they comprise replication elements for the information as defined above that provide redundancy and resilience for a number of node and disk failure.   These are necessities for correctly designed enterprise deployments, and all elements are addressed by HX.