Big data, like other kinds of data, has a life cycle-but how many organizations think about big data in this way? At this moment in time, probably not many.
To compare, it has taken us decades to get on top of data life cycles for standard transactional data with fixed record lengths coming from baseline “systems of record” in the enterprise. Even now, it is not uncommon for IT to sit down with various business functions deemed to “own” this data in order to determine both business and regulatory requirements for data retention and storage.
With big data, which can be unpredictable and come in many different sizes and formats, the process isn’t so easy. Yet, if we don’t start thinking about how we are going to manage this incoming mass of unstructured and semi-structured data in our data centers—where images, videos and documents are growing at a clip of 80 percent—we may never be able to lift our heads from under it!
IT’s heritage will tell us that the easiest way to manage big data is by throwing more storage at it, because storage is relatively cheap, it’s often easy to increment without sending out any budget “alerts” to the CFO, and it’s also what organizations have always done. If you stick deduplication technology in front of your storage, you can also weed out duplicate data and use data compression to further assist in managing the storage footprint.
Unfortunately, what you do in the data center doesn’t affect the distributed servers in different areas of the company that continue to house copies of the same big data.
Ash Ashutosh, CEO of Actifio, a data management software provider, cites an example of a medical research facility that generates 100 terabytes of data from the various instruments that it uses.
According to Ashutosh, the research facility has 18 different research departments that further process the same big data, with each department adding five terabytes of additional synthesized data to the baseline data.
“Now they must manage a total of over a petabyte of data, of which less than 150 terabytes is unique,” said Ashutash, “Yet, the entire petabyte of data is backed up, moved to a disaster recovery site, consuming additional power and space used to store it all. So now, the medical center has used over 10 petabytes of storage to manage less than 150 terabytes of real unique data. This is not efficient.”
It isn’t efficient-and it’s the kind of big data problem that can’t be solved by just sitting down with various departments and identifying data retention policies.
Ashutash recommends virtualizing all of this data and relocating it to the data center. The data center has techniques that can identify duplicated data and eliminate it, IT staff has data management experience that business users don’t have, and technologies like virtualization from single-source servers in the data center can provide on demand service and access to big data to end users throughout the organization.
There is also an operational side to this that involves data and process ownership, and that can become quite political.
The various departments involved must agree to surrender their physical servers and data, and to work off centralized and virtualized data that is maintained in the data center. This is where the CIO and other C-level executives enter in-because people throughout the organization have to understand and support a virtual data policy-and the data management guidelines that come with it.
In most organizations, this is still a work in progress. Consider:
- Only 51 percent of worldwide servers in companies were virtualized in 2012
- 44 percent of companies are delaying storage virtualization because of cost
What are the takeaways for IT?
First, that some old fashioned data management meetings-this time about big data-should be held at both the strategic and operational levels. These meetings will undoubtedly be about policies, but they will also be about control.
Second, if IT hasn’t already done so, it should get aggressive in the data center, putting into play technologies that are proven and ready to harness the big data that daily enters corporate portals.
Now is the time-before the digital floodtides literally sweep you away.