Today, organizations need a holistic view of an increasing variety of data to support their digitalization efforts. Common and agreed upon data structures and data formats are key ingredients to harvest the value of data through BI, analytics, AI and ML. However, managing data transformations can be inefficient and error-prone over time unless we enforce standardization and best practices.
In this article we will explain what data transformation means, outline some of the common motivations for data transformation and how to make the process robust and efficient using the concept of automation.
Data transformation at its core
Data transformation is the process of converting data from one format or structure into another format or structure. The intention is to create formats and structures that are easier to consume for different use cases and roles in the organization. For example, for reporting purposes your organization would need a unified view of vendors, customers, products, departments etc. In a data warehouse or data platform context we often refer to ETL (Extract Transform and Load) or ELT (Extract, Load and Transform) indicating that data transformation is a key part of enabling analytics.
Software applications such as ERP, CRM, HR or line-of-business applications are designed for specific business domains and predominantly data input. Examples are registering and managing work orders, budgets, time sheets and inventories. As such, they perform well for their intended purpose, but perform poorly when queried for large data volumes. In addition, they typically hold only a fraction of the complete information required by the organization. An example of a broader query could be to see accumulated sales per month per department combined with resource utilization and profit margins. The latter requires a holistic enterprise view of data and is crucial in a data-driven organization. This is where data transformation comes into play.
When transforming data you have the opportunity to create a much more flexible data structure. Data, then, can be stored in different layers – from raw data to modeled data with different data transformations depending on the needs. Typical layers may include landing zones, business rules, data marts, dimensional models and so on – with different users and use-cases accessing the data on the relevant layer.
Motivations for data transformation
There is a wide range of reasons for applying data transformations, the most common in an analytics perspective include:
- Avoiding negative impacts on daily operations: By disconnecting or decoupling data from its line-of-business applications, you can structure, model and analyze the data without slowing down or interrupting critical daily business operations. It is equally important for data warehousing and more computationally intensive machine learning and big data analytics.
- Create new, holistic view of information: When combining data across the complete ecosystem in which the organization operates, it is possible to see a consistent and holistic view of information. Data transformation is important to build an maintain data models on which the holistic view of information is based.
- Assessing and improving data quality: Through data transformation processes data quality issues can be detected and potentially dealt with automatically. For example addresses can automatically be standardized and any complete errors could be sent to a data steward for a manual fix.
- Increasing the performance of analytics: Not only machine learning, but also dashboards and reports may require optimizations to get the performance users would accept. Data transformation plays a key part typically optimizing the data structures to the intended use.
- Applying rules on security and data access: Data transformation can be used to build data structures that simplify the enforcement of data access rules i.e. who would be able to see and access different information.
- Enabling compliance: A typical motivation would be to classify, tag and mask out GDPR-sensitive data. It could also enforce data governance rules and providing data for regulatory compliance needs such as pre-defined reports.
- Providing data history: As most line-of-business applications only keep the current view on data, data transformation can be used to keep track of historic data.
On a strategic level it is about taking active ownership of your data and using that ownership to achieve your goals. We call that using data a strategic asset.
Challenges with data transformation for analytics
When going from raw data to high value, holistic information there are typically several steps of data transformation. These steps are often collectively managed in a few centralized places, the most predominant being the data warehouse.
A challenge with a typical data warehouse is that it tends to grow overly complex and hard to maintain over time. This significantly degrades the business value and business agility. A key reason for unwanted complexity is the blank sheet approach to new data warehouse development enforced by most ETL and ELT tools. The lack of standardization and best practices enforced by development tools leads to several inefficiencies over time, such as:
- Reducing solution stability and degrading its performance as common challenges may be solved in different and inconsistent ways, introducing errors and inefficient code
- Increasing the dependency on key personnel since knowing the solution becomes much harder and the risk of breaking the solution is greater
- Using too much time and focus on managing technical debt and avoiding breaking the existing solution, as opposed to implementing new business requirements
As a result, an increasing amount of resources is needed for new development and maintenance. The organization's agility and overall ability to use data as a strategic asset tend to suffers over time.
Robustness and efficiency by enforcing standardization and best practices
A trend in modern data transformation for analytics is to enforce standardization and best practices for all developers through the development tool, typically referred to as data warehouse automation tool.
It takes one step further than data integration tools focusing on ETL and ELT. It assists the developers with methodology-supported code generation according to best practices where both solution design and coding approach to common tasks are standardized. Enforcements of standardization and best practices can include:
- Minimizing technical debt and optimizing efficiency by having one single approach to common challenges such as data formats, data structures, solution architecture, data modelling paradigms, data definitions, source definitions, data loading techniques, change control, documentation and data lineage
- Minimizing dependencies on key personnel by providing standardization and best practices as software, with releases, documentation and trainings, and not developed as a by-product in a project
- Increasing stability and robustness by adding proactiveness, process support and optimizations such as fault detection, optimization of data load processes, migration support and test automation
By reducing the degree of freedom on data transformations that effectively can be standardized, the cost of maintenance can grow linearly, not exponentially. It lets developers use their energy and creativity on data transformations that cannot be standardized, namely implementing business requirements. To the organization, it can mean significantly increased business agility.
Summary
Data transformation is a prerequisite for modern organizations to get a holistic view of an increasingly complex ecosystem of data. It enables high-quality data in the right format, in the right time to the right user or purpose. However, data transformations are complicated processes and related tools and technologies, such as those enabling analytics, should be evaluated thoroughly to enable efficiency and agility over time.