Unstructured data continues to see rapid growth coupled with evolving use cases and access patterns in AHEAD’s customer base. With the popularity of interfaces such as object storage, unstructured storage systems are diversifying to cover new protocols to maintain market share. The downside is that diversification adds complexity and there are often notable compromises that have to be made in existing designs.
Clients are making decisions today for both new and existing applications to manage the explosive growth of data. This makes knowing and understanding the holistic view of the desired outcome critical to success.
As part of the design process, the following is important for both file and object storage:
- use cases
- available platforms
- what can be leveraged in the data center
- where public cloud can be utilized
As part of the discovery process, gathering desired use cases, protocols required for storage, and other design requirements must be considered as part of any larger cloud strategy.
Stepping into the technical, the purpose of today’s blog is to cover use case notes, typical design patterns for systems on the market today, and how to marry these two together.
Use Case Notes
The traditional NAS use cases of home drives and department shares are still there but are finding their way to cloud-based service offerings like Office 365 OneDrive, or other enterprise file sync & share offerings. However, user-generated content isn’t the primary driver of growth. Applications and machine-generated data have taken this role, creating new and interesting problems for data storage that need to be solved
I have two customer application use cases for unstructured data that are causing issues with their current implementations due to scale and cost.
The first is an application that stores documents as BLOBs in an Oracle database running on an Exadata system. While the response times for blob retrieval are very good (~10ms), the high growth rate is making database management difficult and driving rapid consumption of expensive storage capacity.
The second is an application that stores image files on NFS shares on a petabyte-size Dell EMC Isilon system. The files are small, reducing storage efficiency due to Isilon’s data protection design. There are also enough files that the application bumps up against recommended filesystem limits, e.g. maximum number of files in a directory.
The development team is interested in changing the application to utilize object storage, but the current NFS access will have to be preserved for an amount of time necessary to migrate nearly a petabyte of data to objects.
Both of these use cases cut an interesting path through available products on the market, so let’s cover some design concepts and patterns.
Every system has an origin myth: what was it created to do in the first place?
NetApp’s original FAS product was an NFS file server with, among other features, a groundbreaking redirect-on-write snapshot function that the industry has now adapted for use in most competing products.
Isilon’s original concept was a single distributed filesystem with cache coherency across many tens of nodes that employed data protection on actual data vs. disks in legacy RAID groups.
Both of these examples do much more today than their respective initial designs could do, but there are inherent trade-offs that can’t be totally overcome.
With this in mind, here are the major design patterns with product examples and trade-offs.
This is the tried-and-true dual-controller design where a filesystem’s state is managed by one, and only one, controller. Writes are mirrored to persistent storage on a second controller in the event that the primary fails.
There are many direct and hybrid examples of this design pattern. Dell EMC Unity comes to mind as a direct example. NetApp’s ONTAP 9 is a good hybrid example. It’s front-end is fully virtualized, allowing a filesystem to be presented on network interfaces on up to 24 controllers. However, a filesystem’s (FlexVol’s) state is actively managed on only one of those controllers, e.g. writes received on a non-owning controller get forwarded over the cluster interconnect to the owning controller.
Low-latency performance benefits from the state being managed on a single controller, but bandwidth and throughput, with respect to a single filesystem/volume, are capped by the amount of resources available in a single controller. Note that this limit is always increasing given Intel core density and other factors.
This design pattern distributes the construct of a single filesystem across many nodes in contrast to the dual-controller approach. A data distribution algorithm is used to evenly spread data across all nodes in the cluster, e.g., using a hash algorithm bound by the size of the physical cluster. Dell EMC Isilon is a long-standing example of this approach that runs, in its latest iteration, on a dense, purpose-built Intel-based hardware platform. Pure Storage’s FlashBlade is a new example that, in an age of x86-based commodity hardware, represents both an interesting software and hardware design. But, then again, I’m an FPGA fan.
The challenge in distributed designs is providing deterministic latency performance as the cluster grows. Scaling metadata/control path performance is just as critical as the data itself. I am currently tracking and/or working with several startups that I believe are making notable contributions to breaking down the metadata performance boundaries of distributed designs at scale.
Now, here is a rhetorical question: what causes a filesystem to break down for certain use cases?
Filesystems were originally developed for end users to organize their files in a manner similar to a file cabinet. This hierarchical directory and file structure, along with other metadata bits like simple UNIX permissions or SMB ACLs, adds a lot of overhead and scale limits, e.g., maximum files in a directory, maximum files in a filesystem, etc. There is minimal value in this hierarchy and its associated limits for applications.
Object storage strips away this hierarchy in favor of a massive, flat namespace for unstructured content. Each object has a global address that uniquely identifies it and its physical location. Metadata is typically included with the data, all being available via API call to applications.
For all of you CAP theorem aficionados, object storage relaxes consistency in favor of availability and partition tolerance at scale, e.g. an object update will take some time to propagate through a large system.
How large? I found an Amazon S3 blog post from April 2013 stating that they, at that time, had two trillion objects under management with 1.1 million requests per second. That was four years ago.
Object storage can be deployed on premises or consumed as a service. Current platform-as-a-service options include the aforementioned Amazon S3 or Microsoft Azure Blob Storage. On-premises options include Dell EMC ECS, IBM Cloud Object Storage (formerly Cleversafe), Scality, or Cloudian, to name a few.
Gateway products that present a filesystem to clients via NAS protocols, but persist data on object storage, are becoming more popular like Avere Systems, Nasuni, and Ctera. These products allow the introduction of native object storage while maintaining existing NAS interfaces, among other unique features.
Mixing it Up
As I mentioned at the beginning, systems are diversifying to maintain market share. Dell EMC Isilon started out as NFS, added SMB and other file protocols, but now also supports Hadoop Filesystem (HDFS) interfaces for analytics use cases.
Pure Storage’s FlashBlade started out as NFS-only, but is now incorporating S3 object support with more protocols on the way. Scality and Dell EMC ECS are object storage platforms that also support NFS.
All of this makes the core filesystem/data store design of any product very critical. A good, scalable design can be more readily adapted for new protocols and use cases.
Mapping Use Cases to Solutions
It is more important than ever to have ongoing conversations with application teams about their roadmap for unstructured content, e.g. their plans for object storage. This direction, paired with the origin myth of available products, will help to narrow down the best options.
Here are some examples:
- If you have a use case that fits Isilon’s approach but with more aggressive response time expectations than prior generations could deliver, the latest all-flash F800 might be a good fit.
- If object storage is most critical, looking at systems that were originally designed to be object storage platforms will likely yield the best results.
- If object storage paired with low latency is a requirement, an all-flash S3 object-capable system like Pure Storage FlashBlade might be a good fit.
- If your use case is in the Write Once Read Never (WORN) category where data must be kept for the long term but will likely never be accessed, a service like Amazon Glacier might fit the need.
In addition to AHEAD’s experience on the platforms mentioned above, we have a mature methodology for onboarding and recommending new technology based on specific use cases. AHEAD’s scorecard process leverages a continuously updated knowledge base of how products compare against common criteria, greatly reducing the time and effort required for our clients to do product research. The primary goal is to establish a defensible argument for any technology decision.
Contact us to get started on building out your roadmap for unstructured data.