Ten years ago, the journey began to find a flexible, versatile approach to build a central data store where all enterprise data could reside. The solution was the data lake—a general- purpose data storage environment that would store practically any type of data. It would also allow business analysts and data scientists to apply the most appropriate analytics engines and tools to each data set, in its original location.
Typically, these data lakes were built using Apache Hadoop and Hadoop Distributed File System (HDFS), combined with engines such as Apache Hive and Apache Spark. As these data lakes began to grow, a set of problems became apparent. While the technology was physically capable of scaling to capture, store and analyze vast and varied collections of structured and unstructured data, too little attention was paid to the practicalities of how to embed these capabilities into business workflows.