Trust And Budget
The given as the product is a simple concept, almost trivial. The data is presented as if it were a product, easily identifiable, described in detail (what it is, what it contains, and so on), with public quality metrics and guaranteed availability. Products are more likely to be sold (reused, if we talk about data) if it is possible to build a relationship of trust with potential consumers, for example, by allowing users to write reviews/faqs to help the community share their experience with that resource. The success of a data asset is driven by its accessibility, which is why data products must provide different access opportunities to meet consumer needs (technical and functional), since the more flexibility consumers find, the more likely it is to exploit such data.
Today, it is also necessary to always have previous versions of the data available, keeping track of the changes and being able to access in different ways, such as stream type, or flow of events (provided that this is reasonable and compatible with the kind of data), without being bound to the table structure of a database. But let’s not focus too much on the technical side. The truly revolutionary aspect is that data, as a product, now has a predefined and well-determined price, with evident effects on the budgeting and reporting capacity. In a traditional system, even as modern as a data lake, all data operations (from insertion to recovery to preparation) are required of the data engineers.
If they have time and can meet the demand right away, it’s suitable for the user but not too good for the business, as it implies overcapacity. If, In a scenario where IT is a fixed cost to the business, capacity cannot be expanded on demand. Even it could. It would be nearly impossible to determine how much of the additional effort is directly related to the new project, how much of the work can or could be reused in the future, how much would ultimately have to be done, and so on. When several new projects start simultaneously, IT overload can quickly become an indirect cost and, at times, out of control. With Data Mesh, you have immediate visibility of data availability, how much (and how) they are used, and the associated cost. Time and money are no longer unknown (or worse, indeterminable in advance).
The Turning Point
To understand the meaning of the following two points, “Self-service data infrastructure as a platform” and “Federated computational governance “, let’s draw a parallel with a type of platform that has changed the rules of the game in the past. It is perhaps an oversimplification but valuable for the purpose. Let’s imagine you’re working in logistics and need to book a hotel for a sales department meeting. There are a few requirements (number of rooms available, price within budget, conference room on-site, easy parking) and a list of “nice-to-have” (half board or restaurant, so you don’t have to book a lunch catering, shuttle to and from the airport for those flying, not too far from the airport or city centre).
The corporate Data Lake contains all the data you need: it lists all the hotels in the city where the meeting will take place. But also the “Yellow Pages” of the past. How long does it take to find a suitable hotel using such a directory? Not only are they sorted in a predetermined way (in alphabetical order), but it is necessary to check each property in sequence by calling the phone to find out availability, price and so on. Furthermore, the amount of information available for each hotel in the Yellow Pages is wildly inconsistent.
It would be nice to exclude a specific number, so you don’t have to call them, but some hotels have bought large spaces where they write if they have a conference room or restaurant, others have just listed a phone number. Suppose we also want to check the quality of a location, to avoid sending the Sales Director to a squalid inn. In that case, the complexity grows exponentially, as is the uncertainty of how long it will take to figure it out. Maybe we are lucky, and we find a colleague who has already been there and can give us feedback. Otherwise, we have to check for ourselves.
When planning the company’s sales meeting, how do you figure out how much it will cost to find the right hotel in terms of time and money? As if that weren’t enough, the Commercial Director is furious that this problem recurs every year. It is never possible to estimate because the time it took to find a hotel last time does not indicate how long it will take—the next. If we now replace the word “hotel” with “data” in the previous example, we realize that parallelism may seem a bit extreme but not so far-fetched. The hotel owner must list a price (cost) and availability ( uptime ), as well as a structured description of the hotel (dataset) that includes the address, category, etc., and all required metadata (is it a parking lot ? a restaurant? a swimming pool?
Is breakfast included? etc.). Each of these features becomes easily visible (always present and shown in the same place), searchable, and can act as a filter. People staying at the hotel (using the dataset) also leave reviews, which help both other customers in their choice and the hotel owner improve the quality of their offer or at least the accuracy of the description. The self-service aspect is twofold. From the user’s point of view, with this platform, the Sales department can choose and book the hotel directly, without the need (and paying) for the help of the Logistics (Data Lake Engineer team).
From the owner’s point of view (hotel or data owner), it means that you can independently choose and advertise which services to offer (air-conditioned rooms, hot tubs, butler service, and so on) to satisfy and even exceed wishes and customer requests. In the world of data, this second aspect concerns the freedom of Data Producers to independently choose their technological path in compliance with the standards approved by federated governance. Last but not least, the Data Mesh architecture involves ease of scalability(once all hotels/datasets are available, the system can grow to accommodate those in other cities as well / include new ones) and reuse.
Reuse means that the effort spent to create one solution can, at least in part, be paid (reused) to create another. Let’s stick to the hotel analogy. If it was made last year and now you want to proceed with the same system for B & Bs, a lot can be exploited, and there is no need to start from scratch. Of course, the “metadata” will be different (Bed and Breakfasts do not have conference rooms). However, it will still be possible to use the same user feedback system and technology to collect information on prices and availability. Once again, it will keep up to date with the owner of the B&B.
A Project And An Organizational Change
That said, it seems Data Mesh is a no-brainer. And this may be true perhaps for large companies, but building a Data Mesh is a mammoth project. If we only have three or four hotels. There is no point in making a booking platform. What is essential to keep in mind is that a Data Mesh architecture, to express its full potential, requires a profound organizational change in the company.
To cite the most prominent aspect, data engineers must “migrate” from the centre (the Data Lake) to the producers of data to guide them in the process of correct “data preparation”, conforming to the rules of federated governance and exposing them correctly so that can be found and used (thus also generating revenue through internal sales for the data owner). It also requires a change of mindset so that the entire company can begin to consider data as a product, freeing itself from the limitations and bottlenecks of a Data Lake, reaping the benefits of a truly distributed architecture and, therefore, of the new paradigm.
Creating a Data Mesh is a vast undertaking, but it means building the solid foundation that will support the evolution of the data-driven business. We can have all the data in the world in our data lake, but if we can’t harness it effectively and sustainably, we won’t benefit from it. Because in today’s world, standing still means going back. The only way to stay competitive is to create new products, services and solutions for your customers.
To be an execution machine, you need to spend time looking for opportunities rather than looking for data in your Data Lake, analyzing the market and chasing new customers. Once the goal is achieved, the reward can be a relaxing and rewarding weekend, to look at the serene lake from your home and remember when your Data Lake was equally still and non-transparent.