Data fabric?!

Data fabric refers to a comprehensive and flexible data management framework that enables organizations to seamlessly integrate, access, and manage data across diverse data sources, locations, and formats. Data fabric is designed to provide a unified and consistent view of data, regardless of where it resides, whether it’s on-premises, in the cloud, or at the edge. It plays a crucial role in modern data architectures and is particularly relevant in the context of big data, hybrid and multi-cloud environments, and distributed computing. Here are key aspects and components that define the meaning of data fabric:

  1. Data Integration and Interoperability:
    Data fabric solutions are designed to integrate data from various sources, including databases, data warehouses, data lakes, cloud services, IoT devices, and more. They enable seamless data interoperability, ensuring that data can flow freely between different systems and platforms.
  2. Unified Data Access and Management:
    Data fabric provides a unified layer for data access and management, allowing users and applications to interact with data regardless of its location or format. This abstraction layer ensures a consistent and simplified experience for data consumers.
  3. Data Abstraction and Virtualization:
    Data fabric abstracts the underlying data infrastructure, offering a logical representation of data. This means that users and applications interact with a logical view of data without needing to understand the complexities of the physical data storage or technology stack.
  4. Scalability and Flexibility:
    Data fabric solutions are designed to scale with an organization’s growing data needs. They accommodate new data sources, larger datasets, and changing requirements, making them suitable for handling big data and evolving data landscapes.
  5. Data Governance and Security:
    Data fabric incorporates features for data governance, security, and compliance. It provides controls for data access, authentication, authorization, encryption, and auditing, ensuring data is used securely and in compliance with regulations.
  6. Real-Time Data Insights:
    Data fabric enables real-time data processing and analytics by making data readily available for analysis. This facilitates data-driven decision-making and supports business intelligence initiatives.
  7. Cloud and Hybrid Cloud Support:
    Data fabric solutions are typically cloud-agnostic and can seamlessly operate in multi-cloud and hybrid cloud environments. They support data mobility, allowing data to move between on-premises and cloud resources as needed.
  8. Data Resilience and High Availability:
    Data fabric incorporates redundancy, failover, and data replication mechanisms to ensure data availability and minimize downtime in the event of failures.
  9. APIs and Data Services:
    Data fabric often exposes data through APIs and data services, making it easier for developers to access and interact with dataprogrammatically.
  10. Use Cases:
    Data fabric is used in a wide range of use cases, including data integration, data analytics, data warehousing, data migration, data governance, and more.

Data fabric is a crucial component of modern data architecture, enabling organizations to harness the full potential of their data assets, facilitate data-driven decision-making, and adapt to evolving data requirements in an increasingly complex data landscape. It provides the agility and flexibility needed to address the challenges of managing and utilizing data effectively.

Who is uncle bob?

Robert C. Martin, also known as „Uncle Bob,“ is a well-known figure in the software development industry. He has authored several important books on software development and is a prominent advocate for clean code and best practices in software engineering. Here are some of his most important books:

  1. „Clean Code: A Handbook of Agile Software Craftsmanship“ – This book is arguably Robert C. Martin’s most famous work. It focuses on writing clean, readable, and maintainable code. It covers principles and practices that can help developers write high-quality code that is easy to understand and modify.
  2. „The Clean Coder: A Code of Conduct for Professional Programmers“ – In this book, Martin discusses the qualities and behaviors that define a professional software developer. He emphasizes the importance of continuous learning, discipline, and professionalism in the field.
  3. „Agile Principles, Patterns, and Practices in C#“ (or equivalent titles for other programming languages) – This book is part of Martin’s exploration of Agile software development principles. It provides practical guidance and examples for implementing Agile practices in real-world software projects.
  4. „UML for Java Programmers“ – While not as widely known as his other books, this one is valuable for those interested in using Unified Modeling Language (UML) to design and document software systems, particularly if you’re a Java developer.
  5. „Clean Architecture: A Craftsman’s Guide to Software Structure and Design“ – This book delves into the architectural aspects of software development. It presents a clear and practical approach to designing systems with maintainability and flexibility in mind.
  6. „Patterns, Principles, and Practices of Domain-Driven Design“ (co-authored with others) – This book explores the principles and patterns of Domain-Driven Design (DDD), a methodology for building complex software systems that reflect the real-world domains they are meant to model.

These books have had a significant impact on the software development community, promoting best practices, design principles, and professionalism among developers. Reading them can provide valuable insights into writing high-quality code and building software systems that stand the test of time.

Modern Data Stack in a Box with DuckDB

There is a large volume of literature (1, 2, 3) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB Twitter, GitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools

https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html

https://duckdb.org/2022/10/12/modern-data-stack-in-a-box.html

https://www.datafold.com/

The Modern Data Stack: Past, Present, and Future

https://www.getdbt.com/blog/future-of-the-modern-data-stack

Learn how some of the most amazing companies in the world are organising their data stack. Learn more about the tools that they are using and why.

https://www.moderndatastack.xyz/stacks

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure – Kai Waehner (ampproject.org)

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure – Kai Waehner (ampproject.org)

What is Kappa Architecture?

Kappa Architecture – Where Every Thing Is A Stream (pathirage.org)

Kappa Architecture is a software architecture pattern. Rather than using a relational DB like SQL or a key-value store like Cassandra, the canonical data store in a Kappa Architecture system is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.

Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.

But why?

Kappa Architecture revolutionizes database migrations and reorganizations: just delete your serving layer database and populate a new copy from the canonical store! Since there is no batch processing layer, only one set of code needs to be maintained.

Says who?

The idea of Kappa Architecture was first described in an article by Jay Kreps from LinkedIn. Then came the talk “Turning the database inside out with Apache Samza” by Martin Kleppmann at 2014 StrangeLoop which inspired this web site.

TURNING THE DATABASE INSIDE OUT WITH APACHE SAMZA

HOW DO I MAKE MY OWN?

RESOURCES

Tools

LOG DATA STORES

An append-only immutable log store is the canonical store in a Kappa Architecture (or Lambda Architecture) system. Some log databases:

STREAMING COMPUTATION SYSTEMS

In Kappa Architecture, data is fed from the log store into a streaming computation system. Some distributed streaming systems:

SERVING LAYER STORES

The purpose of the serving layer is to provide optimized responses to queries. These databases aren’t used as canonical stores: at any point, you can wipe them and regenerate them from the canonical data store. Almost any database, in-memory or persistent, might be used in the serving layer. This also includes special-purpose databases, e.g. for full text search.

Strange Loop

Strange Loop is a multi-disciplinary conference that brings together the developers and thinkers building tomorrow’s technology in fields such as emerging languages, alternative databases, concurrency, distributed systems, security, and the web.

Strange Loop was created in 2009 by software developer Alex Miller and is now run by a team of St. Louis-based friends and developers under Strange Loop LLC, a for-profit venture.

Some of our guiding principles:
No marketing. Keynotes are never sold to sponsors. The conference mailing lists are never sold or given to sponsors.
Tech, not process. Talks are in general code-heavy, not process-oriented (agile, testing, etc). There are many fine speakers, topics, and conferences in the process area. This is not one of them.
Technology stew. Interesting stuff happens when you get people from different areas in the same room. Strange Loop has a broad range of topics from academia, industry, and a touch of weirdness.

About – Strange Loop (thestrangeloop.com)

Apache Impala vs Presto

Apache Impala vs Presto: What are the differences?

What is Apache Impala? Real-time Query for Hadoop. Impala is a modern, open source, MPP SQL query engine for Apache Hadoop. Impala is shipped by Cloudera, MapR, and Amazon. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time.

What is Presto? Distributed SQL Query Engine for Big Data. Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Apache Impala and Presto belong to „Big Data Tools“ category of the tech stack.
https://stackshare.io/stackups/impala-vs-presto

stackshare.io

Some links about the good old (and dead?) SSAS Multidimensional Cube

Why doesn’t SSAS cache the entire cube?
Also, if you cube is much larger than the memory available to SSAS, then you would expect to see continual IO, and it is likely to be quite well optimised. However, when you have a 64 bit server with a cube that is larger than 3GB but is comfortably less than the server memory, you might be surprised to see the volume of continual IO.

http://richardlees.blogspot.com/2011/12/why-doesnt-ssas-cache-entire-cube.html

Best Practices for Performance Tuning in SSAS Cubes, you are in right place. Define cascading attribute relationships, for example, day > Month > Quarter > year  and define user hierarchies of related attributes (called natural hierarchies) within each dimension as Appropriate for your data

Best Practices for Performance Tuning in SSAS Cubes, you are in right place. Define cascading attribute relationships, for example, day > Month > Quarter > year  and define user hierarchies of related attributes (called natural hierarchies) within each dimension as Appropriate for your data

Remove redundant relationships between attributes to assist the query execution engine in generating the appropriate query plan. Attributes need to have either a direct or an indirect relationship to key attributes, not both.


https://mindmajix.com/msbi/best-practices-for-performance-tuning-in-ssas-cube

https://kejserbi.wordpress.com/

https://hub.packtpub.com/query-performance-tuning-microsoft-analysis-services-part-1/

https://github.com/RichieBzzzt/SSASActivityMonitor

https://github.com/ssasdiag/SSASDiag

https://christianb7.wordpress.com/2012/11/11/analysis-services-2012-configuration-settings/

https://www.mssqltips.com/sqlservertip/2568/ssas-best-practices-and-performance-optimization-part-4-of-4/