On Thursday Dec 3rd Microsoft announced the latest functionalities of the Azure data and analytics platform; Azure Synapse Analytics and the much-anticipated Azure Purview (previously known as Azure Data Catalog v2 or Project Babylon). And though one can make a lot of comments on the entire session, I will not have the same long-lasting build-up as in yesterday’s announcement, but rather move straight into the essence of the review of Purview: ‘A unified and comprehensive data governance service’
Some of the noted announced features and benefits:
- Democratize access and maximize value of all your data
- Empowering everyone in the organization to find and understand the data
- Govern data use and assess risks
- Data relationships and lineage across the entire data estate
- Data Source scanning and classification across multi-clouds
- Business glossary – a trusted inventory of all business terms
- Enable security and compliance
This is a major and powerful addition to the Azure Data and Analytics platform and looking at the listed functionalities, we can tick off on a lot of requirements. It is a significant improvement from the first Data Catalog version in Azure.
In the introduction of Purview announcement it was emphasized that Purview is where everyone in the organization is empowered to find and understand data, however the demo starts by saying that Purview is where your analysts, data engineers and data scientists (your data technical/specialists) can find the data they need. And, from looking at the demo and also from the initial setup of my own Purview data catalog in Azure I agree with the latter; the Purview user interface is quite technical in terms of layout, terminology and level of detail.
The scope for this short functionality review is tightly related to the needs we experience from our customers, and with specific focus on the structured data warehouse data.
Below are five common requirements and a short evaluation of these in (the current preview version of) Purview.
1. End-to-end lineage from source to Power BI report
Our clients typically have data in a SQL server either on premises or in the cloud, and the good news is that Purview allows for including a range of different data storage options, both cloud and on premises in the catalog. However, after importing our entire data warehouse demo-environment into Purview we discover that views and stored procedures are not included. Creating both data marts, and the dimensional model layer of the data warehouse as views is very common and not being able to see these objects will break the lineage in Purview. Many of our clients also use SSAS and this does not seem to be available as a data source (i.e. object type in the catalog) in Purview. Lineage in terms of data loads and data processing (ETL) is not imported on a metadata level but when data is processed. This is an interesting approach and we have not yet been able to explore this in terms of lineage at scale. To achieve end-to-end lineage from source to Power BI report with a data warehouse architecture, you need to load data into for example SQL Server using ADF, and not use any views for data modelling and connect Power BI to the database, not using SSAS.
2. Able to document and describe every object in the data warehouse and ETL process
Purview supports ADF, but our clients are also generating business logic and doing data modelling using views or stored procedures in SQL server or Synapse. These objects are not included in the Purview catalog. For the objects that are included you can document these down to the column level using the description property. However – when doing this, the object will never be synchronized from the data source again. We see this as a major weakness, because this means that the descriptions and documentations must be done in every data source which is not always feasible.
We see the same thing for the ‘contact’ tab where you can add experts and owners. The object (table) is then excluded from future data source scans and will not be updated if it changes in the database.
3. Able to tag/classify data and trace the ‘tag’
The auto-classification feature is very powerful in Purview. There are built-in classification rules available and it is possible to define your own rules and classification types. What is not yet clear is how the object classification ‘follows’ the object and is propagated with the lineage to where the object is used.
4. Search ability on all metadata levels (from calculation or visual in Power BI to column name in data source)
The search functionality, though not tested extensively, seems to work fine Purview. There are though, some peculiarities regarding identification of words and wordings. For example: technical object names using CamelCase is not necessarily compatible with text searches. The common ‘danger’ with searching in all data is to not get lost in the search results. This is mitigated by different metadata groupings, for example on Glossary terms or Classifications and with icons on object types, which seems to be working fine. (The search experience is actually very similar to the search in Xpert BI Solution Catalog).
5. Supporting change and solution life cycle
As data platforms are always changing both in additions and deletions on both data and objects, it is important that the data catalog reflects the current available data. Also, from a data warehouse perspective being able to support DEV/QA/PROD environments is important. And be able to copy descriptions and documentation from objects as they are migrated. It is not clear yet on how Purview supports this. From our current understanding of Purview, describing objects is not meant to be done in Purview, but imported with the object from the source. When adding descriptions in Purview the object is ‘frozen’ and will not be synchronized from data source again. There are some best practices for deployment here: https://docs.microsoft.com/en-us/azure/purview/deployment-best-practices.
Lastly, I want to briefly compare the Purview with our own Solution Catalog.
- Scope: Purview has an enterprise level scope with a wide range of supported object types and data storage types. Solution Catalog is designed for and built for the data warehouse with focus on end-to-end lineage and documentation.
- Focus: first glance the main principal difference is that Purview seems to be more DATA focused i.e. focus on table/files/storage, volume, access, where Solution Catalog is more METADATA focused i.e. traceability, data flows and lineage, documentation of every object- also on views/code objects.
- User: As mentioned Purview is quite technical in navigation and terminology, while Solution Catalog has enabled a view for the business user which ‘hides’ some of the technical complexities in a data warehouse solution with the ETL processes and shows for examples data models and bus-matrix in a grid view for the business user.
Since the Solution Catalog also includes views and stored procedures as well as SSAS cubes and tabular models the two catalogs can be complementary. Xpert BI Solution Catalog has a metadata REST API interface for exporting data and could provide powerful additions to these object types and lineage in Purview if or when Purview will support this.
Good luck with your own evaluation, and please do not hesitate to contact us in BI Builders to learn more.