What is Modern Data Architecture
It has been said that all businesses today are IT businesses. They just happen to make their money selling different things. While that statement may not be universally true, software and computer technology have become a central part of most companies’ day-to-day operations. Of all the components of their software ecosystems, the most vital is their data. Whether it is a list of clients, a product catalog, or a production schedule, each company’s data is what makes it unique. It is the record of its history and the key to its future.
Simply put, data architecture is about organizing data. In general, a company’s data architecture describes its approach to data in three ways – how it is stored, how it is processed, and how it is used. The architecture influences data storage by describing how data is organized such that it is accurate, able to be accessed efficiently, and scales with the business. Data processing involves getting the data from its sources to the data storage system and retrieving it when needed. A company’s data architecture governs the processing of data to ensure that data is secure, easily consumed by the systems that need it, and is flexible enough to handle the many data sources that might be feeding into the system. Finally, the data architecture defines how the data will be used. It describes the interfaces that systems can use to access data and encourages the organization to view data as a shared asset.
Every company has a data architecture. Some have very well-defined systems and processes for managing their data’s storage, processing, and usage, while others have no formal system at all. In this post, we’ll talk about nine characteristics that form the basis for every data architecture. Together, they form a solid foundation that will support a company’s data needs and help ensure that it can grow to meet its customers’ demands efficiently and effectively.
Eliminate Data Copies and Movement
One of the most critical and challenging tasks that a company’s data architecture needs to address is the tendency to duplicate data. This duplication naturally evolves in systems without proper management and oversight. When a new system or application is added to the business’s software ecosystem, it often requires basic information already present in the existing data infrastructure, such as user information or the product catalog. Software development teams tend to want to model and store this common information in an optimized manner for their specific needs. Eventually, this tendency can lead to problems with data consistency and degraded performance across the enterprise.
When multiple copies of the same data are held within the enterprise, other processes and procedures are required to synchronize that data and keep it consistent. These processes are often challenging to maintain as the software ecosystem continues to evolve and new uses for the shared data are required. Despite the team’s best efforts, the data will eventually become unsynchronized, and costly errors become more likely. In addition to the complexity of managing multiple copies of common data, the required synchronization tasks often lead to degraded performance. Two frequent sources of this degradation are the overhead of the actual synchronization processes themselves and the need to verify that the data being used is, in fact, up to date and correct.
A modern data architecture defines locations for common enterprise data to be stored. These common databases serve as the single source of truth for this information, thus eliminating the inefficiency that data duplication incurs. This does, however, place a burden on the centralized systems since they must serve the entire organization’s software suite. To accomplish this, a flexible and scalable storage solution must be selected.
Scale to Meet Storage Needs
An organization’s data is subject to a wide range of performance demands. Some data, such as past sales records, are rarely accessed, while information such as customers’ shopping carts might be accessed constantly. A modern data architecture provides criteria and guidelines to help quantify what kinds of access requirements are needed for all of the enterprise’s data. It also defines minimum performance requirements that the data system must meet. The architecture also makes provision for handling increased demand for data over time. This might result from long-term growth or some transient event, such as a holiday. A modern data architecture provides the ability to increase bandwidth and storage capacity to service these higher demands and allows the data storage system to scale back when the high-demand period ends. Thus helping to align the organization’s operating costs with its revenue.
Establish a Common Vocabulary
Naming things is one of the hardest things to get right. Without proper guidance, teams within the IT organization will tend to adopt similar, but not identical, terms for common data – should the central data for a system user be called user, account, customer, something else? All of these options might be valid, but having them all used within a company’s data systems leads to confusion and inefficiency. A modern data architecture establishes a common vocabulary for standard data items. This vocabulary should be stored in a living document that can grow and evolve as the organization’s needs change. Additionally, the data architecture should establish a pattern for how to name new things when needed. A few well-designed guidelines can go a long way in accelerating development efforts and making it easier for data to be shared within the company.
Ensure Security and Access Controls
A company’s data is one of its most valuable assets. Unfortunately, that also means that it is a target for misuse and theft. A modern data architecture establishes guidelines that allow each data item to be classified according to how sensitive it is. It also defines the mechanisms and processes that are to be used to ensure that each data item is adequately secured. The most sensitive data will often be protected by a “defense in depth” approach that protects the information behind multiple security layers that reduce the likelihood of a successful attack. Modern data storage systems allow high levels of security to be placed on the raw data itself, allowing very granular security rules to be enforced. Additional layers of security can be applied in the data access system as well as the application level. Defining data security controls throughout the organization’s data architecture makes it less likely that a vulnerability will be exposed, allowing data to be misused or corrupted.
Curate the Data
An organization’s data is useless if it is not organized, accurate, and reliable. The data architecture defines how the information that is ingested into the databases is prepared for consumption. Data curation often involves the processing of raw text to categorize this data into known values. This is especially important when the raw data will be used to populate a dimension that will be used to constrain queries or commands. Another critical aspect of data curation is performing any data transformations that are necessary. Raw data is often gathered using different units (e.g., kilometers, miles, etc.). The data architecture establishes rules for how data should be stored and how to convert from the incoming unit system to the data system’s standard.
Keep Data Accessible
An organization’s data only adds value if it is available. One of the key principles of modern data architecture is the need to ensure that the business’s stakeholders have access to the data they need when they need it. A modern data architecture leverages technology such as cloud-based databases to ensure that information is highly available and hardened against downtime caused by network outages, maintenance tasks, etc. The concept of accessibility extends beyond this, however. The organization’s data architecture must also ensure that the data can be accessed in a reasonable amount of time. For geographically distributed organizations, this often requires regional clones of common data. Care must be taken, however, to balance the need to access the data quickly with the inefficiencies that can arise from creating and maintaining these clones.
View Data as a Shared Asset
One of the most common techniques for increasing the value of custom software is designing code to be reused as efficiently as possible. When properly written and documented, code reuse saves developers time by eliminating the need to solve the same problem repeatedly. Sharing data offers similar benefits to the organization. When data is viewed as an organization-wide asset instead of being siloed within different departments, the entire data system runs more efficiently and effectively. A successful data architecture encourages data sharing and established patterns and expectations for how data is to be stored and documented so that the entire enterprise can benefit from its existence. Efforts along this line also tend to improve the company’s overall data systems by fostering a common vocabulary and reducing data duplication.
Many companies have a broad pipeline of information that they need to use to populate their databases. The information may be in the form of structured CSV files or unstructured emails or even semi-structured inputs such as JSON documents. The data architecture provides mechanisms to ingest each type of incoming information. Additionally, the architecture must specify how this information is to be stored. Many database systems can natively store XML and JSON documents in a queryable fashion. Leveraging these capabilities can significantly reduce the complexity of data import operations by deferring the processing of these documents until they are consumed. However, such a decision can’t be made lightly since unprocessed documents are more difficult to verify from a data integrity standpoint.
Provide the Right Interfaces for Consumption
Many development projects interact with databases using direct queries and commands. This leads to duplication of logic across the entire application suite. Modern data architecture defines interface services that provide standardized mechanisms to allow data to be easily consumed. Such interfaces might consist of a RESTful API to power web services and applications, SQL interfaces for data analysts, or OLAP interfaces for business intelligence. By providing an abstraction from the underlying data source, each specialist can focus on solving the business problem that they have been tasked with rather than wasting time trying to coerce the data into the form that they need.
Every company has a data architecture, whether formal or informal. A modern data architecture considers the storage, processing, and usage of all enterprise data to provide a foundation for its entire software ecosystem. When the nine principles above are taken into account, the data system becomes a secure, efficient, and organized platform that sets the company up for success and prepares it for the future.