What is DataLake

A "data lake" is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.  It's a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data.  There are various products that you can use to build a data lake, such as all major cloud vendors Microsoft's, AWS and GCP provide various flavor of it. Data lakes are becoming much more needed as there are now so many data sources that companies can use to make better business decisions, such as social networks, review web sites, online news, weather data, web logs, and sensor data.  All of these "big data" sources result in rapidly increasing data volumes and new data streams that all need to be analyzed. Some characteristics of a data lake include

  • A place to store unlimited amounts of long-term data in any format inexpensively, as Hadoop is usually a much lower cost repository
  • Allows collection of data that you may or may not use later: “just in case”
  • Allows for easy integration of structured data, semi-structured data (e.g. XML, HTML), unstructured data (e.g. text, audio, video), and machine-generated data (e.g. sensor data)
  • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read
  • Complements an Enterprise Data Warehouse (EDW) and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW
  • Frees up expensive EDW resources (storage and processing), especially for data refinement
  • Exploit the full power of a Hadoop cluster to speed up ETL processing over SMP data warehouse solutions
  • Allows for data exploration to be performed without waiting for the EDW team to model and load the data, adding the benefit that it may turn out after exploration the data is not useful saving the EDW team from wasting resources
  • An excellent area to run extreme analytics, such as running millions of scoring models concurrently on millions of accounts to detect fraud on credit cards, which is typically not a workload you would see running in a data warehouse
  • A place to land streaming data, for example, from IoT devices or Twitter.  This data can also be analyzed during ETL processing (i.e. scoring Twitter sentiment)
  • An on-line archive for data warehouse data that is no longer analyzed on a frequent basis
  • Some processing in better done on Hadoop than ETL tools like SSIS
  • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)