,

Data Stores: Hub, Mesh, Lake, Lakehouse — What Is Right for You?

Someone suggests a data store to solve your team’s need to mine data. Fine, but which one should you use? Unfortunately, the answer is not straightforward, but consider the following items as you ponder the data store type:

  • Are there security and/or governmental restrictions on some/all of the data?
  • Do some or all data owners insist they maintain control over their data?
  • Is there an increased need for analysis/research on the data?
  • Rather than a long-term centralized store, can intermittent data fulfill most needs?
  • Is the volume of new data increasing and does the expected data contain various types (structured, unstructured, and semi-structured)?

Bear in mind that most of the solutions will require additional staff to maintain the data or, in the case of a data mesh, the ability to create and maintain APIs to access individual data repositories. As you pursue this journey, you should expect to find duplicated data. Hopefully, you will identify one data repository that is considered the single source of truth. If so, consider building your new data store on that. If in the future, you might use this new repository in an AI scenario for retrieval-augmented generation (RAG) purposes, then it will be essential to use the source that’s considered most accurate.

The list above may get you started on your journey. Below are suggestions for potential solutions.

  • Data warehouses are designed for analytics and can meet high-security requirements. If the data is mostly structured, this type of store should be considered.
  • The data mesh model is the first choice for this scenario, as it allows owners to retain control. Some level of programming is likely required for the APIs, though.
  • If the data types vary and you expect more data to be analyzed, a data lakehouse should be considered.
  • The data hub model is good for data exchanges, as the storage area is understood to only retain the data for a short time (think of an airport hub).
  • The data lake model is likely the best for this scenario. Understand, though, that this model is designed for raw data, so any analytical users will need to clean/reformat the data to suit their uses. This may evolve into a data lake house to help with the analysis.

Bear in mind that any central repository (e.g. data lake), some maintenance will be needed, especially with aging out data. Access to the lake either for putting in or getting information should be governed in accordance with any regulations that need to be in place (and to help maintain the cleanliness of the data.

The goal of this endeavor is to provide access to data across the organization so that better decisions can be reached. While experience and “gut instinct” are not to be dismissed, additional reliable data from a data store can substantiate those decisions, whether for or against a strategic direction.


Dan Kempton is the Sr. IT Advisor at North Carolina Department of Information Technology. An accomplished IT executive with over 35 years of experience, Dan has worked nearly equally in the private sector, including startups and mid-to-large scale companies, and the public sector. His Bachelor’s and Master’s degrees in Computer Science fuel his curiosity about adopting and incorporating technology to reach business goals. His experience spans various technical areas including system architecture and applications. He has served on multiple technology advisory boards, ANSI committees, and he is currently an Adjunct Professor at the Industrial & Systems Engineering school at NC State University. He reports directly to the CIO for North Carolina, providing technical insight and guidance on how emerging technologies could address the state’s challenges.

Image by Susanne Stöckli from Pixabay

Leave a Comment

Leave a comment

Leave a Reply