Like music orchestration for an ensemble of instruments, data pipeline orchestration is all about integration and synchronization, which becomes more difficult as more applications become API-centric and are assembled (and re-assembled). As you orchestrate data moving around, you need to keep track of what is happening (or what has happened) to demonstrate to a regulator, or anyone who needs to know, that the data is correct and has not been tampered with. The synchronization then needs to focus on coordinating workflows across different processes, companies and networks, and following a series of data value chains, with overarching visibility and management to establish data provenance as part of your orchestration.
Data from multiple sources and various associated data services/applications all need to come together in real-time. As a result of orchestration, data policies and service levels can be better defined through automated workflows, provisioning, and change management. Critical data management processes can also be automated, including data creation, cleansing, enrichment and propagation across systems. However, data management is becoming more complex, with more and more metadata being produced to describe data assets, and a greater number of applications accessing that data. Data provenance uses metadata to describe where data was originated and is required to create a history of the data from its starting point, through all of its derivative works. This is much like keeping track of all of the changes and iterations created when writing a symphony, so nothing gets lost in the process.
The provenance of data products generated by complex transformations, such as data orchestration workflows, can be extremely valuable to digital businesses. From it, one can determine the quality of the data based on its source, provide attribution of data sources, and track back sources of errors and iterations. Data provenance is also essential to organizations that need to drill down to the source of data in a data warehouse, track the creation of intellectual property and provide an audit trail for regulatory purposes.
Best practices for data orchestration/provenance
Ultimately, digital businesses need data orchestration/provenance to facilitate and track data flows and data consumption from disparate sources across a distributed, data fabric. In addition, they require strong data governance, diverse data blending and the real-time delivery of analytics at data exchange points.
And, as more applications become more API-centric and are assembled (and re-assembled), business context and output value take the form of more data and more metadata (small and large). Coordinating and following what is essentially a series of data value chains requires an over-arching view that is not in a document, but can be automatically monitored, check pointed and even dynamically updated. Other best practices for data orchestration/provenance include:
- Deploying controls that are balanced with performance requirements for accelerating data access and transfer
- Maintaining metadata information in terms of change, transparency and understanding.
- Understanding the details on who is generating and using data, as the size and overall expense of data continues to rapidly grow. Furthermore, its use needs to be correlated to business processes and value/benefits, as well as provide input to company risk profiles.
- Solving the operational considerations of data orchestration/provenance before the following happens: the proliferation of analytics, the demand for more data and the ease at which data-driven functions are being cloned.
The challenges behind implementing data orchestration/provenance
The rate at which data and metadata is being created by digital business has reached new heights, making it difficult for organizations to follow these best practices. Companies need to overcome many business and technology challenges first, such as:
- Data flows and their associated metadata information are rarely documented or maintained.
- The people, who know how the choreography of an application flow works, may change roles or leave, and the knowledge goes with them.
- Data movement, transfers, feeds, Extract, Transform and Load (ETL), versioning, etc., all impact the health and performance of business operations, but are not as visible or operationalized as they need to be. As a result, consistency in data protection, security, quality, just to name a few, may be lost as information traverses the corporate environment.
- The introduction of an ecosystem of partners and fast-paced changes to standup new business models can quickly increase risk.
- Multi-party data exchanges will need to go beyond a single view of provenance and will likely incorporate blockchain distributed ledgers to coordinate meta information and immutable tracking.
How to facilitate the deployment of data orchestration/provenance
By publishing APIs and using API management that processes information about data activities, which are then called by (or bundled into) application and service APIs, you can facilitate automated metadata management (i.e., embed maintenance or provenance into the API calls). Also, auto-updating the data/metadata at the same time the application has the information enables greater visibility of data variations as the data moves through the workflow pipeline. Then, by leveraging real-time event processing and monitoring throughout the platform, including pre-processed service views summarizing data activity in their respective domains of control (i.e., boundary, inspection, policy management, key management, data services, data integration and API management), you can be more proactive, than reactive, in your data management.
Security, data and application design pattern(s) relationships can also be auto-determined and, when combined with the metadata patterns of data activity, can be created as views within a dashboard or by observing data interactions. While it is encouraged that all data interactions use a data integration service (to establish a data pattern) with extreme low latency, that level of introspection may be complete overkill for the task at hand. In those cases, APIs that update metadata should suffice. At this point, this mega-patterning allows you to deliver an automated, self-updating view of all data movement inside the data environment, as well as across clouds and business ecosystems (see diagram below).
Data Pipeline Orchestration/Provenance at the Digital Edge
The design pattern for data pipeline orchestration/provenance prescribes leveraging an Interconnection Oriented Architecture™ (IOA®) strategy, a framework for directly and securely interconnecting people, locations, clouds and data at the digital edge. Today, a company’s digital edge is where most of its business is happening and where most data is being created and exchanged. Digital edge nodes (interconnection hubs) provide the key components by which data is being managed and exchanged between business processes and applications. By taking the following steps, you can successfully leverage an IOA framework and digital edge nodes to implement real-time data orchestration/provenance to prove and verify where and how a result was produced for regulatory or other company compliance reasons:
- Apply data orchestration/provenance services to add scheduling and coordination of the following data services listed below.
- Leverage globally distributed data repository and local private storage (snapshots and access changes) for large data sets.
- Take advantage of a data integration service for data services and data translation as needed.
- Establish a provenance function to keep track of and publish APIs for automatic updates of metadata management, and to query the service.
- Leverage complex event processing to learn data relationships and trails, as well as construct views on data activities.
- Update policy enforcement to flag anomalies of rogue data access and movement.
By implementing this data pipeline orchestration/provenance design pattern, you will realize the following benefits:
- As an integrated family of optimized functions, the arduous challenge of master data management, risk and data security is made easier — without compromising availability, accessibility or performance.
- These data services can be integrated with cloud services to seamlessly satisfy operational requirements, update dashboards and provide greater business insights.
- Should a data breach occur — or more likely a colossal mistake — automatic checks and balances apply protection, with data recovery as a fall back.
- Data expiration and a hardware security module (HSM), though not covered in this design pattern, are important to data orchestration/provenance. Data that should be deleted, based on policy, or migrated to long-term archival can also be managed via the integration of these data services with data orchestration/provenance services.
The most important asset in a digital business is its data. Keeping an accurate record of your data from its starting point through all of its changes is vital to keeping your digital business within company and regulatory auditing guidelines, as well as maintaining the integrity and value of that data.
In the next blog article, we’ll begin the series on the IOA Application Blueprint and the associated application design patterns for the digital edge. Through this series, you’ll learn how to localize application services and leverage APIs to create a multicloud, multi-party business application integration point for greater performance and user quality of service.
In the meantime, visit the IOA Knowledge Base for vendor-neutral blueprints that take you step-by-step through the right patterns for your architecture, or download the IOA Playbook. If you’re ready to begin architecting for the digital edge now, contact an Equinix Global Solutions Architect.
Other blog articles you may be interested in about securing data at the edge include: