Azure Synapse Analytics

“Azure Synapse Analytics is a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs.”

I hope you got announcement about Azure synapse analytics(public preview) this week – A single platform to build end-to-end analytics from ingestion to visualization by data engineer, scientist, analyst, architect and BI developer .

Below two images will give you more clarify of Azure Synapse Analytics and Modern Data Warehouse.

Well proven architecture with lot of customers is being used. It used many services in different layers to achieve the goal.

Azure Synapse Analytics give you a single service to achieve the same goal with many flexible way and integration all in to one bundle. Work data engineers, data scientist , developer and visualization experts in one product.

Above image shows you the Synapse Studio ,gives single view for each and every role in your big data echo system.

Some of key features and area.

  1. Enterprise data warehouse
    1. Build your data warehouse on the proven foundation of the industry’s top-performing T-SQL engine.
  2. Data Discovery, Exploration and Transformation
    1. Exploring and analyzing datalake and operational data together  in real time using serverless and provisioned.
    2. Direct link to database to operation database for data analysis and reporting without impacting production.
  3. Choice of languages
    1. Use your preferred language, including T-SQL, Python, Scala, Spark SQL, and .Net—whether you use serverless or provisioned compute resources.
  4. Code-free data orchestration (ETL)
    1. Build end to end ETL/ELT processes in a code-free visual environment to easily ingest data from more than 85 native connectors.
  5. Deeply integrated Apache Spark and SQL engines
    1. Enhance collaboration among data professionals working on advanced analytics solutions. Easily use T-SQL queries on both your data warehouse and embedded Spark engine.
  6. Streaming ingestion & analytics
    1. Perform real-time analytics on streaming data directly in your data warehouse.
  7. Integrated AI and BI
    1. Complete your end-to-end analytics solution with deep integration of Azure Machine Learning and Power BI.
    2. Invoke machine learning model in T-SQL query, Bring model close to data.
  8. Industry-leading management and security
    1. Use built-in features to ensure your data and processes are seen by only those with authorized access.

Videos URLs

  1. Get Started link – https://azure.microsoft.com/en-us/resources/azure-synapse-analytics-toolkit/
  2. Recoded Demo for analytics user – https://www.youtube.com/watch?v=xzxjpQSvDEA&t=170s
  3. Recorded demo for developers and Architect – https://www.youtube.com/watch?v=G5MW93oYPOI
  4. Recorded demo for PowerBI user – https://www.youtube.com/watch?v=MIXbZboW2qY

What is DataLake

A “data lake” is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.  It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data.  There are various products that you can use to build a data lake, such as all major cloud vendors Microsoft’s, AWS and GCP provide various flavor of it. Data lakes are becoming much more needed as there are now so many data sources that companies can use to make better business decisions, such as social networks, review web sites, online news, weather data, web logs, and sensor data.  All of these “big data” sources result in rapidly increasing data volumes and new data streams that all need to be analyzed. Some characteristics of a data lake include Welcome to WordPress. This is your first post. Edit or delete it, then start writing!

  • A place to store unlimited amounts of long-term data in any format inexpensively, as Hadoop is usually a much lower cost repository
  • Allows collection of data that you may or may not use later: “just in case”
  • Allows for easy integration of structured data, semi-structured data (e.g. XML, HTML), unstructured data (e.g. text, audio, video), and machine-generated data (e.g. sensor data)
  • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read
  • Complements an Enterprise Data Warehouse (EDW) and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW
  • Frees up expensive EDW resources (storage and processing), especially for data refinement
  • Exploit the full power of a Hadoop cluster to speed up ETL processing over SMP data warehouse solutions
  • Allows for data exploration to be performed without waiting for the EDW team to model and load the data, adding the benefit that it may turn out after exploration the data is not useful saving the EDW team from wasting resources
  • An excellent area to run extreme analytics, such as running millions of scoring models concurrently on millions of accounts to detect fraud on credit cards, which is typically not a workload you would see running in a data warehouse
  • A place to land streaming data, for example, from IoT devices or Twitter.  This data can also be analyzed during ETL processing (i.e. scoring Twitter sentiment)
  • An on-line archive for data warehouse data that is no longer analyzed on a frequent basis
  • Some processing in better done on Hadoop than ETL tools like SSIS
  • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)