Free up time for analytics, AI and ML
Analytics, AI and ML starts with data. Data automation not only makes trusted data available faster, but lets you build a robust, sustainable platform for all future analytics needs.
Why it matters
- Making quality assured data available and prepared for the intended purpose typically accounts for 80% of the work within analytics, AI and ML. It makes sense to do the data part as efficiently as possible.
- Google eomploys 10 data engineers per data scientists, other companies with fewer resources can leverage data automation to get a competitive edge
- Most organizations today have big ambitions within AI and ML, but struggle with data availability, data quality and data governance, these processes must be sped up
- Analytics tools such as Power BI (*), and data science tools such as SAS, Spark, C3 etc have limitations on ready data automation and can be improved with better data ingest and data management
How it works
- Today, data automation is not limited to data warehousing, data can also be delivered into data lakes, in-memory processing (e.g. Databricks and Spark), data lakehouses, analytics sandboxes and more
- Data interfaces and APIs lets data be shared between data warehouses, data lakes and data lakehouses
- Data automation lets the data engineers work on a higher level of abstraction, making them more productive in the preparation of structured or semi-structured data (*) for analytics, AI and ML
- Automatic documentation and data lineage of all SQL code used for data preparation, custom Python can be added manually into the data lineage dependency graph
- APIs can be automatically be generated on top of any database that would otherwise be too cumbersome to load data from manually
- Data automation is prevalent for batch loading of data, down to microbatches every few minutes, however true real-time is currently less normal mostly due to price/performance considerations (**)
Read more
(*) Xpert BI does not replace Power BI, but provides the data
Xpert BI is a data automation tool taking the place of a data integration tool. Power BI is a visualization tool that is best used on top of ready, integrated and quality assured data provided by Xpert BI.
(**) Structured vs semi-structured vs unstructured data
Data automation is metadata driven and focuses on structured and semi structured data. However, you may let unstructured data co-exist and be augmented with structured and semistructured data.
- Structured data: Data that conform to tables and columns e.g. from databases, ERP, CRM, finance, HR, production, projects, sales, marketing etc
- Semi-structured data: Data from APIs, flat files, user logs, metadata from unstructured data
- Unstructured data: Data without any structures such as raw text, images, video and sound, AI is typically used for categorizing or recognizing patterns and structures that can be turned into structured or semi-structured data.
(***) Batch vs near realtime vs true realtime
Data Automation typically focus on batch loading of data, typically every day, every couple of hours and down to near realtime every couple of minutes. Realtime is rare typically due to higher costs, specialized technologies and skillsets, and not always a business case to defend the investments. However, specialized realtime solutions can co-exist with data automation for a data warehouse, data lake or data lakehouse as part of a holistic data platform.
- Batch: Hours, days, weeks, months (typical data integration tools)
- Near realtime: Seconds, minutes (data integration tools sometimes added change data capture (CDC))
- Realtime: Milliseconds (specialized technologies such as Kafka)