Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.
Unlike the traditional extract, transform, load ("ETL") process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems. To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used. This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.
Data virtualization and data warehousing
What is the Difference between Data Virtualization and Data Federation? - Evan Levy reviews the differences between two technologies used to access data across multiple systems: Data Virtualization and Data Federation. To learn more about data federation, visit http://ww...
Some enterprise landscapes are filled with disparate data sources including multiple data warehouses, data marts, and/or data lakes, even though a Data Warehouse, if implemented correctly, should be unique and a single source of truth. Data virtualization can efficiently bridge data across data warehouses, data marts, and data lakes without having to create a whole new integrated physical data platform. Existing data infrastructure can continue performing their core functions while the data virtualization layer just leverages the data from those sources. This aspect of data virtualization makes it complementary to all existing data sources and increases the availability and usage of enterprise data.
Data virtualization may also be considered as an alternative to ETL and data warehousing. Data virtualization is inherently aimed at producing quick and timely insights from multiple sources without having to embark on a major data project with extensive ETL and data storage. However, data virtualization may be extended and adapted to serve data warehousing requirements also. This will require an understanding of the data storage and history requirements along with planning and design to incorporate the right type of data virtualization, integration, and storage strategies, and infrastructure/performance optimizations (e.g., streaming, in-memory, hybrid storage).
Examples
- The Phone Houseâ"the trading name for the European operations of UK-based mobile phone retail chain Carphone Warehouseâ"implemented Denodoâs data virtualization technology between its Spanish subsidiaryâs transactional systems and the Web-based systems of mobile operators.
- Novartis, which implemented a data virtualization tool from Composite Software to enable its researchers to quickly combine data from both internal and external sources into a searchable virtual data store.
- The storage-agnostic Primary Data data virtualization platform enables applications, servers, and clients to transparently access data while it is intelligently migrated between direct-attached, network-attached, private and public cloud storage. Server flash memory pioneer Fusion-io co-founder David Flynn, now Primary Data CTO, saw the need to move data across storage types to maximize efficiency with data virtualization.
- Linked Data can use a single hyperlink-based Data Source Name (DSN) to provide a connection to a virtual database layer that is internally connected to a variety of back-end data sources using ODBC, JDBC, OLE DB, ADO.NET, SOA-style services, and/or REST patterns.
- Database virtualization may use a single ODBC-based DSN to provide a connection to a similar virtual database layer.
Functionality
Data Virtualization software provides some or all of the following capabilities:
- Abstraction â" Abstract the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
- Virtualized Data Access â" Connect to different data sources and make them accessible from a common logical data access point.
- Transformation â" Transform, improve quality, reformat, aggregate etc. source data for consumer use.
- Data Federation â" Combine result sets from across multiple source systems.
- Data Delivery â" Publish result sets as views and/or data services executed by client application or users when requested.
Data virtualization software may include functions for development, operation, and/or management.
Benefits include:
- Reduce risk of data errors
- Reduce systems workload through not moving data around
- Increase speed of access to data on a real-time basis
- Significantly reduce development and support time
- Increase governance and reduce risk through the use of policies
- Reduce data storage required
Drawbacks include:
- May impact Operational systems response time, particularly if under-scaled to cope with unanticipated user queries or not tuned early on.
- Does not impose a heterogeneous data model, meaning the user has to interpret the data, unless combined with Data Federation and business understanding of the data
- Requires a defined Governance approach to avoid budgeting issues with the shared services
- Not suitable for recording the historic snapshots of data. A data warehouse is better for this
- Change management "is a huge overhead, as any changes need to be accepted by all applications and users sharing the same virtualization kit"
Technology
Some data virtualization technologies include:
- Actifio Copy Data Virtualization
- Capsenta's Ultrawrap Platform
- DataVirtuality
- Data Virtualization Platform
- Delphix Data Virtualization Platform
- Denodo Platform
- Gluent Data Platform
- HiperFabric Data Virtualization and Integration
- Querona
- Red Hat JBoss Enterprise Application Platform Data Virtualization
- Stone Bond Technologies Enterprise Enabler Data Virtualization Platform - http://www.stonebond.com
- TIBCO Data Virtualization, formerly Composite Software, previously owned by Cisco
- Veritas Provisioning File System / Data Virtualization Veritas Technologies
- XAware Data Services
History
Enterprise information integration (EII) (first coined by Metamatrix), now known as Red Hat JBoss Data Virtualization, and federated database systems are terms used by some vendors to describe a core element of data virtualization: the capability to create relational JOINs in a federated VIEW.
See also
- Data integration
- Enterprise information integration (EII)
- Master data management
- Database virtualization
- Data Federation
- Disparate system
References
Further reading
- Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility, Judith R. Davis and Robert Eve
- Data Virtualization for Business Intelligence Systems: Revolutionizing Data Integration for Data Warehouses Rick van der Lans
- Data Integration Blueprint and Modelling: Techniques for a Scalable and Sustainable Architecture Anthony Giordano