Today, many organizations experience a growing demand for data and insights, delivered faster. This is often fueled by digital transformation and increased desire to use analytics, AI and ML to drive more business value from data. The data warehouse is a cornerstone in delivering key parts of the data needed. However, new ways of working are needed to unleash the full productivity, scale and agility needed to meet the rapidly growing demands.
The concept of data warehouse automation is putting accumulated experience into a tool to increase productivity, improve agility and ensure scalability. The following will explain key concepts of what a data warehouse is and how data warehouse automation enables the modern, data-driven organization to achieve better business outcomes with data.
What is a Data Warehouse?
There are many perceptions of what a data warehouse is, and new technologies and tools have also changed the usage and definition over time.
A key purpose of the data warehouse is giving access to qualified data on which decisions can be made at the right time for the right users. The term ‘one truth’ is commonly used. It implies concepts such as master data management and different modelling techniques that can be achieved through a data warehouse.
The line between a data warehouse, a data hub, a data lake and a data platform is becoming blurred as different technologies gives more flexibility and may overlap. As an example, such as with Azure Synapse Analytics, a service may provide a wide range of options for data storage, data processing and data consumption, essentially covering more than just a data warehouse.
Typically, a data warehouse is viewed as one key component of a data platform, responsible for managing structured data. The structured data is used both for operational, administrative and strategic decision making. The whole data platform typically also extends to big data, covering unstructured data, data exploration, data science and R&D-type of activities.
The Logical Data Warehouse (Gartner Group)
Gartner Group introduced the term ‘Logical Data Warehouse’ (LDW). The LDW is a data consolidation and virtualization architecture of multiple analytic systems. The motivation is simple: Different approaches are needed to cover fundamentally different data and analytics needs. The needs range from the known to the unknown in terms of data and the questions we want to have answered.
A financial report is an example of how well-known financial data is used to answer a well-known question, such as “how well is the business performing financially?”. This is typically solved in a data warehouse with basic analytics such as dashboarding and interactive reports.
In comparison, testing a hypothesis during a churn analysis may be answering a previously unknown question, using relatively unknown data such as describing customer behavior. This, on the other hand, is often implemented using a data lake with advanced analytics, AI or ML toolsets. The LDW is an overarching concept of how to build one, consistent architecture to cover all of these very different data and analytics needs.
The LDW consist of different data storage and usage areas such as:
- Data Warehouse
- Data Lake
- Operational Intelligence
- Data Science and Machine Learning
Read about the structured components of the LDV(Gartner)
The LDW architecture can also be referred to as a data platform when it comes to the data storage and data access philosophy. That is, you should gather and store all types of data using the appropriate technology for each data type, with the ambition of having one data access interface across the underlaying storage technologies.
A typical approach could be to store structured data for answering well-known questions in the data warehouse, while storing semi-structured, unstructured or large-volume or unknown structured data in a data lake. The LDW enables the flexibility of different storage options and may hide some of the complexity of data access across different technologies.
Why Do I Need a Data Warehouse?
These are common requirements for implementing a data warehouse:
- Getting access to data across organizational units and levels.
- Combining data from various line-of-business applications in reports and analytics.
- Adding flexibility to analyze data, which is not available in line-of-business applications.
- Tracking historic changes in data over time, whereas line-of-business applications only keep the current snapshot.
- Owning and controlling data assets.
- Increasing data quality for improved decision making.
- Having master data and business glossaries for analytics.
- Having additional logic for analytics, but it is not easily implemented and maintained in a line-of-business system or directly in a report.
- Enabling compliance with regulations and governance rules.
Looking at each requirement isolated, many can be met by other means than a data warehouse. However, as the organization matures in the way it uses data, a data warehouse is typically introduced as a common data repository that handles all of those requirements combined.
Is my organization ready for a Data Warehouse?
To identify if your organization should invest in a data warehouse, here are some of the things to investigate:
- Do you have key personnel e.g. financial controllers on which the organization depend, for reports and data?
- Have you experienced management meetings where different reports give slightly different numbers or different conclusions?
- Do you miss having access to historical data at any point in time?
- Does your organization keep reports in Excel spreadsheets and/or Power Point presentations?
- Do you spend time discussing which data is correct, instead of acting on the data?
- Do you know which data sources and business rules are applied to the reports/data on which decisions are made?
- Do you have the need to look at data across different applications?
If the answer is YES to any of this then you should consider a data warehouse.
Learn more from these Use Cases
What Are Common Components of a Data Warehouse?
Components of a data warehouse architecture can be viewed from two perspectives; architecture design and the technical components, where technologies meet one or more architectural design requirements.
On an architectural level, the most common components are:
- Data source layer
- Raw data layer
- Mid-data layer and/or ETL/Flow layer
- Semantic layer
- Visualization layer
- Data governance and design best practices
When looking at the technical components to meet these architecture design requirements, there are numerous options where both budget, competencies and actual requirement priorities determines the technical choices:
- Data storage: Databases such as SQL, Oracle, Redshift, SAP Hana or distributed file systems such as HDFS.
- Data integration and data processing: Tools, languages or code to extract, move, transform, store and model data such as Xpert BI, SSIS, Azure Data Factory, SQL, Python, Informatica and Matillion.
- Semantic layer: Data modeled for easier consumption and use, typically as fact and dimensional tables in an in-memory database such as OLAP Cubes, SSAS Tabular and SAP Hana.
- Visualization: Tools to create and share data visualizations, reports, dashboards and analysis such as Power BI, Qlik, Tableau and Excel.
The examples listed here are just a small selection of what we normally meet at our customers.
In our experience the success of a data warehouse project is not decided by technology or functionality alone. It is to share and communicate the contents and train users on how to use the solution both on pre-defined reports and self-service solutions. Also, staying agile to incorporate new requirements, more data sources and reflecting the needs of the organization.
Lack of wide user adoption
Very often a data warehouse solution is run by a few data enthusiasts within an organization and not appreciating the required effort to get user acceptance across the range of roles that exists. Though solution ‘champions’/’sponsors’ are key to success, it is also a common pitfall to rely too much on these few employees and not researching the wider needs and requirements in the entire organization.
Lack of budget for continuous improvement
Take into account that data sources changes, requirement changes and data quality changes over time – and that time may be soon. Having technologies that can handle change is important, but the change still needs to be performed and tested to ensure the endurance of the data quality in the solution. The resource budget should be continuous also after the implementation is ‘finished’ to account for this. Over time – usually from 12months - there will also be a need for budget for removing technical debt or sunsetting ideas/solutions that were built earlier but later changed and improved.
Unclear Master data or data integration principles
As soon as the data warehouse solution combines data from more than one data source, integration rules must be in place. This might not require a master data management system or another big project, but it will require some standards and principles to maintain data quality. Often a term used in one department has a slightly different meaning in another, or the list of customers or projects are not completely synchronized across all systems. However big or small the solution, organizations should appoint a resource who can specify and qualify that the integration principles are being followed.
Lack of consistent technology choices
Another common pitfall is choosing technology without determining the actual business need first. Although IT is becoming a heavily integrated part of the business it is most often not the actual business - but should be a supporting enabler. Keeping in mind that the solution must be able to co-exist in the current environment and be compatible with current solutions is important. Considering the skills and competencies both inside the organization but also externally can be another factor. The most common pitfalls when it comes to technology is to fail to evaluate the following:
- Are people in IT and line-of-business aligned around common goals?
- Will the technology choices enable the needed agility and productivity?
- Are the technologies scalable, not only in terms of data volumes, but also supporting bigger teams collaborating across disciplines?
- Is there sufficient competency available in the market at an affordable cost?
- Is automation with standardization and best-practices available?
- Is this technology compatible and interoperable with other technologies we have or plan to purchase?
- Does it support our current security requirements?
- Does it support regulatory compliance requirements?
- What is the TCO (Total Cost of Ownership) or the lifetime cost of the technology?
Data Warehouse Automation: The Next Evolutionary Step
To be able to meet the data needs in the organisation the data warehouse developers and engineers need new tools to be more productive, more agile and scale their solutions better over time. With more data sources and increasingly complex data structures the need for automation also increases. Many would argue it is the ‘do or die’ for the traditional data warehouse.
Data Warehouse Automation (DWA) has been available for over a decade, but adoption has not been accelerating before now. This chapter will explain why you should consider automation for your data warehouse.
“Data warehouses provide organizations with reliable, high performance analytics, but implementing and maintaining these systems can be daunting. Data warehouse automation can provide data and analytics leaders with a more productive means of meeting analytics requirements.”
Automating Data Warehouse Development by Henry Cook, Gartner Analyst
The need for automation in data warehouse solutions is a natural evolution as for any industry or technology area. The start was standardization of modelling techniques with different approaches, such as those by Kimball, Inmon or Linstedt all with the aim to make the data warehouse more standardized and thereby more efficient. However, since these techniques are not technology specific, which is in principle a good thing, they may sometimes result in very complicated solutions. This is due to trying to implement complex design patterns for which the technologies (ETL tools) were not designed to meet in a such high scale.
Automation lift standardization to a new level – by saying that the tool can know quite a lot of what you can do just by looking at metadata. It can create standardized structures and design patterns, it can ‘hide’ complexity of a high variation in data sources by simple user interfaces for configurations. The result is not only higher productivity, but also higher quality in the solution. Quality, as in higher data quality, but also in terms of usability and content by having the developers focus more on implementing user requirements for data rather than tedious repetitive tasks.
One of the benefits of traditional ETL/ELT tools is having a never-ending list of functionalities and options. This is also the main drawback. Giving developers too many various options in how to solve a particular problem, also results in lack of standardization within solutions and risk of manual errors. The challenge is finding the right balance between agility, flexibility and creativity on one hand, and standardization and robustness on the other.
The illustration below shows how, given a set of resource constraints (budget, time, people and organizational strategies), and the continuous stream of business requirements IT and/or developers usually have a tradeoff when it comes to Agility, Productivity, Robustness and Risk Reduction when using traditional ETL/ELT tools. It is extremely difficult to check all the boxes. However when utilizing automation, this can be achieved. Read more in our blog post on Why ETL Tools are not enough for Agile Data Warehouse development and operation.
Ideally, the automation tool, which is basically a machine, should handle boring, repetitive tasks that require high consistency and low creativity. Also, those are tasks that people such as developers dislike anyways. An example of a boring task is updating data lineage documentation after a code change. There is very little reason why that cannot be automated.
Developers on the other hand, should spend more of their time on high-value tasks that cannot be automated. Examples are discussing requirements with line-of-business users and finding clever ways of handling data modelling, combining data and implementing business rules. The automation tool will serve as a supporting team member by handling tasks such as data loads, performing complex optimizations, detecting design errors, estimating run-time costs and keeping the documentation always updated.
Most automation tools utilize metadata and built-in patterns to automatically build and maintain parts of the data warehouse solution or data platform solution. This helps standardize the parts of the solution where you do not want much flexibility such as data ingestion, building history-tables, optimizing order of task execution etc. The degree of automation varies from tool to tool and there is always a balance, as mentioned above, between automation/standardization and flexibility/creativity.
An important feature of automation tools is that the metadata is typically used to auto-generate technical documentation. It ensures the documentation is always up to date and reflects the actual solution implementation. The documentation no longer reflects the intended design, but the actual design so you can even debug the solution by reading the documentation.
Don't take our word for it. Hear it from the people that matter: Customer Success Stories
Maturity Levels of Data Warehouse Automation
Data warehouse development has evolved over the years from ad hoc data delivery to using automation software. Today, most organizations aim for some form of automation to increase productivity and solution robustness.
From a decade of helping organizations using data actively and strategically, we have identified four typical maturity levels of data warehouse automation.
Maturity levels explained:
- Ad Hoc Data Delivery: Pure manual approach with data in silos rather than in a centralized data storage. Typical challenges include lack of data availability, low data quality, difficulties with combining data across different areas of the organization and overall, very limited performance. This is a starting level where organizations may recognize the strategic value of their data but lack the methods of unleashing the potential business value the data can have. Organizations in this position may be running large risks in terms of lack of compliance and potential disruption.
- Structured Manual Process: Still a manual approach without using any automation but bridging the data silos with a central data warehouse or data storage. Challenges are lack of scaling up performance and productivity, lack of standardized structures, a solution that is hard to maintain and effectively unnecessary dependencies on key personnel. Organizations appreciating the benefits of automation have typically experienced the challenges of this level.
- Automation Framework: Developers accelerate their productivity by using an in-house developed automation framework. It can take the form of code templates, code generation scripts, best practice development methodologies and testing frameworks. A challenge is limited maintainability as the solution and the automation framework is too tightly coupled, a change in either one needs the other to be re-tested, significantly decreasing agility over time. Also, many frameworks do not have dedicated developers who actively maintain so it can get outdated and eventually not supported anymore. Another challenge is increased dependencies to key personnel as the framework is custom-built to each organization. Also, developing and customizing a framework takes development time away from solving requirements, significantly reducing time to market. Lastly, the overall solution complexity grows significantly without adding much business value, introducing unnecessary technical debt.
- Automation Software: As with an automation framework, developers accelerate their productivity, but the automation is delivered as software, not a custom framework. Having a solution as software means there are dedicated developers who fix bugs, there are frequent releases of fixes and new features, there is a community where you get answers, there are certified developers who can implement it, there are other customers using, testing and improving the same software, there is documentation and training, new storage and compute technologies are covered shortly after they arrive, there is a support channel and a roadmap ensuring the solution is never outdated.
However, if you are pioneering new data platforms or technologies, or your business requirements are unique, you may find yourself without an automation software available. In those cases, building an automation framework may be the only viable approach to improve productivity. However, as automation software eventually catch up and become available, you may want to migrate to an automation software at a later point in time to get the added robustness and agility, as well as reduced maintenance costs.
Technology examples for each level:
- Ad Hoc Data Delivery: Power BI or Excel directly onto a data source. Works in small scale and for building use-cases.
- Structured Manual Process: Typical SSIS or Azure Data Factory with packages or pipelines designed for each business case.
- Automation Framework: In larger or more mature data environments, typically with high expectations on productivity, but without the availability of an automation software. A typical example is creating an automation framework using configuration tables and variables/parameters in the jobs for Azure Data Factory to avoid having to manually create all the data pipelines. Developers may also combine this with their own toolkit, approach, or standards for how to improve productivity.
- Automation Software: Investing in a data warehouse automation tool to both increase productivity and stability, as well as simplify the data warehouse implementation process. Xpert BI from BI Builders is an example of automation software for Azure that lets developers use automation for Azure Data Factory without having to develop their own automation framework.
What is the cost of a data warehouse?
It is a common misconception that a data warehouse is an expensive investment only suitable for the larger organizations. With new technologies, methodologies and toolkits the entry cost can be much lower than before. New payment structures based on service, usage, and consumption can let smaller organizations benefit from small initial data warehouse investment.
When talking about the cost of a data warehouse – which can be anything from a $15K project to a $15M investment - it is normally compared to the benefits and the alternative cost. Not having insights on how your business is performing with qualified, timely, ready-to-use data can lead to bad decisions or missed opportunities. It could be not detecting a ‘Kodak’ moment before it happens. On the other hand, the potential benefits of discovering a new opportunity before the competition or exploring the entire value-chain of the business can be substantial.
A data warehouse is typically designed to serve multiple business areas bridging different data silos. It means an investment in a data warehouse is often a strategic decision unlocking benefits across the organization. Setting the right ambition level is still a key measure to keep the initial investments down.
In our experience there are two main cost drivers with the following Pareto-split over time:
- People, as in developers and advisors (80%)
- Software, as in tools and infrastructure (20%)
The main interest for an organization is therefore to keep the costs on the people side low or get the most business benefits delivered as possible by the in-house people and external consultants.
What is the cost of data warehouse automation?
Data warehouse automation software typically has a low entry point and grows with more complex solutions, more users or other metrics.
The cost of data warehouse automation should be compared to:
- What is the business value of increasing the value from other investments i.e. in Azure Data Factory, Azure Data Lake Storage and Azure Synapse Analytics?
- What is the long-term alternative cost of building the data warehouse manually or developing an automation framework in-house?
- Where do you want your data warehouse developers to spend their time, on repetitive tasks or highly challenging and value-adding tasks?
- Do you have the experience to build your own automation framework? Who will bug fix, maintain, support and develop this further over time?
- What is the level of increased business agility and productivity with the alternatives, and what is it worth to your business?
It is important for decision makers to fully understand the long-term impacts of the investment decision. Those impacts may include topics such as business agility, productivity, TCO of the whole data warehouse, quality, stability and scalability.
In our experience, once an organization sees the increased productivity and cost-efficiency of data warehouse automation over time, its value to the business keeps growing.
We encourage investigating what automation can do for your solutions, and Gartner Group has some detailed articles we can recommend.
Read more: Automating data warehouse development
As data usage and data volume grows, it has become an increasing need to have a catalog service of the various data ‘estates’ (as Microsoft has named it) within the organization Data is stored in different locations and technologies and different shapes and forms. Data has classifications, metadata, quality and models design for different purposes and ultimately exist in a shape that meets every organizational need. This is also one of the factors which makes a catalog complicated, simplifying the technical and complicated to something understandable and usable. And the term ‘usable’ – for whom? An IT professional has considerably different needs than a non-tech busines user. While IT needs to have a catalog of data contents, volumes, usages and locations, a business-user does not care where or how but more what data or even reports or dashboards are available for me to consume,
It is only a natural consequence that this need results in a growing number of different data catalogs, and for larger organizations it is very likely that there is a requirement for having more than one catalog meeting different user requirements, it be technical or end-user requirements.
The Data Warehouse and AI & ML
It is a common misconception that Artificial Intelligence and Machine Learning (AI & ML) environments only consume unknown data and must require a pure data lake approach.
For data exploration scenarios yes – typically unknown data with unknown quality and high variety. But when operationalizing Machine Learning algorithms, qualified and standardized data is a prerequisite. At Google, the ratio of Data Engineers to Data Scientist is 10:1. This indicates that most of the work is done on data preparation such as making the data structured and qualified, before the advanced ML/AI algorithms can be performed.
The data warehouse is a reliable source for structured, qualified data. Often the data is integrated with master data and put into a usable and defined context. AI/ML projects often combine these structured data with data directly from line-of-business applications or other data sources.
In our experience, the data warehouse is more often becoming a supporting data source for AI/ML environment. Also, with the introduction of services such as Azure Synapse Analytics that combine a data warehouse with a data lake and an over-arching virtualization, it also gets easier to implement. This is in-line with Gartner’s definition of the Logical Data Warehouse (LDW), where Gartner argue the value of the combined architecture.
We see an increasing market demand for both agility and robustness for modern data warehouse solutions, two behaviors that traditionally did not co-exist simultaneous. With data warehouse automation robustness is achieved through standardization, metadata-based architectures and data flows, and agility is provided through increased productivity and increased ability to make changes on the fly.
Why DataOps and how can automation help you?
DataOps is an approach inspired by DevOps, Agile Development and Lean Manufacturing. It looks at the whole data lifecycle to maximize the value of data to an organization by aligning people around a shared focus on increasing agility, reducing data delivery cycle times, and minimizing errors to deliver confidence to the business user with each data delivery. It is about reacting faster to changes while minimizing risks and safeguarding the integrity of the organization’s data estate.
DataOps is based on:
- DevOps for accelerating build lifecycle using automation
- Agile Development for adding the people aspects to DevOps in the data value chain
- Lean Manufacturing with statistical process control to monitor the health of the data pipeline
According to the 2021 Gartner paper “Assessing the Capabilities of Data Warehouse Automation (DWA)”, data warehouse automation can simplify the implementation of DataOps in the three following three areas:
- Accelerating build lifecycle for data assets i.e. with code reuse, ready design patterns, enforcing standardization and actively using data source metadata
- Adding the people aspects with alignment and shorter delivery increments i.e. by supporting data stewards with lineage, data catalogs, tagging of data and documentation
- Monitoring the health of the data pipeline i.e. automatic error detections at design time and built-in test automation for runtime, and cost estimation at design time
With today’s many digitalization initiatives, cloud migrations, new business use cases and growing focus on analytics, AI and ML, there is a rapidly growing complexity and increased need for governed and quality assured data delivered quickly. Using data warehouse automation you get built-in support for implementing DataOps and a fast-track to efficient and robust delivery of trusted data.
Red more in this blog post.