Here’s food for thought: How can you verify that your data is being fully and efficiently utilized to enhance your business, if no standard exists to ensure the basic accuracy, coverage, extensibility, and interpretability of data? Without a process to ensure clean and quality data you can trust, you have no power or backing for reports and insights behind the best data-driven decisions.
Data modeling is a critical first step in defining the structure of available data, insuring against mistakes, and ultimately certifying actionable downstream results. Preparing a robust data model means knowing the process and its benefits, the various types of models, best practices, and the relevant software tools which are available.
Data modeling is the process of describing the structure, associations, and constraints relevant to available data, eventually encoding these rules into a reusable standard. From highlighting the most-critical high-level concepts underlying data from a stakeholders’ standpoint, to precisely documenting the contents, data types, and rules governing individual columns in a database, it is an integral part of the planning stage for any analytics deployment or business intelligence project.
Data modeling notations and techniques are often graphical and intended for clarity and portability; the resulting specifications can effectively be communicated across the levels of an organization, from engineers to managers to analysts, as well as clients or other businesses. The process itself inherently encourages discussion, collaboration, and careful consideration of the transformation of requirements to implemented data reality.
The creation and adherence to data modeling standards is recommended for a variety of applications — from integration and API development to the design of data repositories both on-premises and in the cloud (whether a single database, a data warehouse, a data mart, a data lake, etc.)
During this process, the emphasis is on the need for availability and organization of data, not how the data will be applied. A data model, therefore, is the formal documentation of conventions for the management of a data resource and essentially a guide to implementation, usage, or extension of such a resource.
An effective data modelling procedure leads to better allocation of human and computational resources, anticipates issues before they arise, bolsters cross-functional communication, and enforces compliance (regulatory and internal) — all while guaranteeing underlying data quality, security, and accessibility.
At the business level, a completed data model makes an organization more responsive to change, lowers risks, and increases effectiveness — ultimately reducing costs.
At the engineering level, redundancy is minimized and systems naturally become easier to integrate, easier to interface with, and more compatible with each other.
There are three overarching types of data models which should be generated while planning an analytics project. Together they range from the most abstract to the most discrete of specifications, but each involves contribution from a distinct subset of stakeholders.
First, the conceptual data model (sometimes called a conceptual schema) expresses initial requirements at a high level, and is agnostic of technologies or software. Here the general business scope (or domain) of the model is used as a launching point for discussion among process owners. Important classes in the data are selected and semantic relationships between them are described, all based on real-world instances which these classes represent.
Though these deliberations will typically involve managers and data architects, anyone should be able to understand the resulting model. The focus is on the business understanding of data — technical details are reserved for the following two modes.
The conceptual schema can be translated into a logical data model, where more detailed specifications are first dealt with. This type of model concerns the structure of required data: typically how it might be represented in an actual repository.
A single conceptual model can result in multiple logical models, depending on the complexity of the underlying data represented and on the sophistication of business requirements. Architects continue to drive the conversation as they begin to involve data engineers.
Finally, specifications can be encoded in physical data models. These describe individual database tables, component columns and types, and details of keys and relationships — while accounting for performance, access, and resource allocation. The technologies involved have all been selected and accounted for (e.g. choices of local vs. cloud, types of DBMSs, and APIs).
At this point, data modeling has reached the highest level of granularity before actual implementation or deployment. The onus now lies with data engineers and database administrators.
To Note: The difference between logical and physical schemas has often blurred in practice. Some tools and engineering paradigms even combine these distinct models (and the conceptual is also sometimes omitted or undistinguished from the logical). Remember:
Although there are countless benefits to data modeling, there are also inherent dangers in creating and using bad models. To prevent unnecessary overhead and inefficiency, among a host of other problems, sticking to best practices is critical.
It’s important to start with the widest perspective, both temporally and cross-sectionally. Though the three types of models have a natural ordering and distinct purposes, the final specifications should be considered a living document.
The process might be required in full as a nascent business dives into analytics, or in part to extend an established data repository with new use cases. It may revisit itself for reengineering, integration of legacy systems, or migration to the cloud.
Additionally, many enterprises have data ecosystems composed of multiple components and silos, in addition to the exploding global networks of data externally available. It’s important to design flexibly and modularly, with a view to both the future and to this wider ecosystem.
Practically, this means you might start with a bubble chart of your business’ data silos as well as external or anticipated sources. This overview could reveal unconsidered entities to include in a conceptual model, or even highlight inefficiencies and lead to organization-wide improvements. You might consider this a 0th step to the process, but also one that should be revisited as a business or application evolves.
A beginning modeler might prefer to skip the conceptual phase as unnecessary, while a manager might imagine that architects will seamlessly translate a business idea to a logical schema without input or discussion, but the conceptual model is the foundation for the entire process. To wit, this phase is independent of technological considerations because the business should drive the adoption and utilization of technologies.
Concept modeling refines the original idea or purpose behind a project to its component business requirements and constraints. Failing to make these definitions at the beginning incurs great risk of needing to revisit assumptions down the line. Even worse, there is a danger of passing along ambiguous or incorrect interpretations of entities and relationships, translating to erroneous data.
As previously mentioned, a completed conceptual model is also a standalone communicable resource, potentially the only documentation which can be understood across an enterprise. Conceptualization should always be a well-defined step in any new data modeling process.
Again, logical modeling is the bridge between pure business requirements and technical constraints. It is the most important stage for validating compatibility between vision and technical possibility. Before work on a physical model can even be considered, entity relationship diagrams (ERDs) must be carefully assembled and reviewed, and entities themselves discussed and classified into logical buckets (uniquely identifiable entities with unique primary keys, child entities and their relational connection to parents, etc.)
Though this step is still removed from the pure technicalities which engineers will encode in implementation, it explores and validates the bare feasibility of organizing the data previously determined as a business requirement. Logical barriers and ideas incompatible with structured data storage must be discovered and resolved here.
The physical model synthesizes the result of all discussions while also creating the most detailed specifications for a data storage project. Here, the focus has moved from abstract perspective and prioritization of concepts to a fully-fledged set of technical guidelines. The specific objects and the technologies containing and serving them should be codified.
Your physical schema must be an engineering reference that is complete, maximally specific, yet navigable. With this in mind, a successful physical modelling step will produce:
Data modeling is an evolving paradigm (the three-schema approach dates back to the 1970s!), and the rise of cloud storage and computing might seem to disrupt the process.
It is clear, however, that data modeling alongside the cloud will only continue to provide value for businesses and be a critical part of planning for the foreseeable future.
Data modeling is at its core a paradigm of careful data understanding before analysis or action, and so will only grow more valuable in light of these trends.
Across three distinct types of schema, the data modeling procedure encompasses all different aspects of planning for any data project. It is a paradigm which encourages good business practices alongside technical preparedness, and its continuing value and relevance is clear.
If you’re ready to get started, access a free trial of Talend Data Fabric and take a look at leading data modeling tools.