If you’re in the data and analytics business, then you’re probably already aware that the loose collection of technologies referred to as “Hadoop” is in trouble. Or at least, it’s at a crossroads. And if you do a search and read a few top articles to try and understand what happened, you’ll get almost unanimous agreement: The major cloud vendors such as Amazon, Microsoft, and Google offer reliable, low cost object storage in the cloud at virtually unlimited scale, which diminishes the value proposition offered by Hadoop and its HDFS file system. There are other technical and competitive factors given, but this is at the heart of it.
But is that really what happened? Well, yes, but it misses the most important parts of the story. Imagine it’s the late 1800’s and you’re constructing a high-rise building on unstable swampland using concrete as the primary load-bearing material. After some time, you realize things aren’t going well. Then you discover that steel is now viable as a lightweight and strong alternative to concrete. Should you switch? Maybe. But will that solve your problem? Leveraging a new material doesn’t do anything about the unstable swampland.
For those who have committed to the Hadoop ecosystem and are now more than a little concerned about what to do next, there are three factors that need to be considered, in addition to the correct – but incomplete – competitive factors that brought us to this point.
Data management discipline was severely downplayed. The “schema on read” paradigm is great for storing high volumes of unrefined data quickly, but when you build production solutions, you need to describe the data (that is, apply schema) so end users and developers can work with the data much more easily. You also must profile and monitor quality, apply descriptive metadata, establish security and access mechanisms, rationalize master data, and so on. None of these needs have gone away. I have yet to meet a business analyst who is happy to have to spend 80+% of his or her time “wrangling” data only to hesitantly trust the quality for their real work – finding value in the data.
Hadoop was dramatically over-positioned. I’ve often heard the statement that “developing in Hadoop is fast”. No, it isn’t. Loading raw data for exploration and experimentation can be done very quickly, that’s true. But dumping piles of data into a “data lake” as an enterprise data strategy simply passes the burden on to others. In fact, if you do take the time to build a rigorous production solution in the Hadoop ecosystem, it takes more time, and much more skill (hence risk) because the technology does not have anywhere near the same automation, performance, and workload management that has been in place and maturing for decades in commercial database technology. Low cost storage and processing, whatever the underlying mechanism, most definitely has its place; it’s just not the right tool for every job. When steel emerged as a revolutionary building material, concrete continued (and still continues, all these years later) to play a crucial role.
Data lakes were not adequately aligned to important business initiatives. Consider the following bold statement from Gartner, published in 2014:
“Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases.”
They could have said “70%” and it still would have been audacious – especially at the peak of the “Big Data” hype cycle. And they could have said “will fail to meet expectations”, but instead they said “will be useless”. The market was warned.
The most important part of the statement, though, is the reason given: “… overwhelmed with information assets captured for uncertain use cases.”
It has always been a mistake – and in my experience, it’s the most common mistake in an enterprise data strategy – to deploy data as a production enterprise “foundation” without directly linking every data element deployed to one or more production application needs. The ability to deploy data into Hadoop without applying structure in advance only exacerbated this problem. Again, this approach is great for storing high volumes of raw data for various reasons, but it is an abdication of responsibility when it becomes the primary approach to establishing production data across the enterprise.
Conclusion
If recent news about Hadoop is causing you to pause and consider a change in direction, don’t just think about which technology to switch to. Think about what’s really at the root of your struggles. Don’t forget the importance of data management practices. Make sure to use the right technology for the right role within a modern data and analytic ecosystem. And, most important, build and implement a responsible enterprise data strategy. Cascade the data strategy from the company business strategy, funded business initiatives, and the data and analytics required for their success. Simply dragging several petabytes of files from HDFS to S3 won’t cut it. Like architecture and engineering for physical structures, success with data and analytics requires more than chasing the latest resume-building buzzword; it requires rigor and discipline. Do not ignore the need for a firm foundation of professionalism.
Leave a Reply