Kappa Architecture is a software architecture pattern. Rather than using a relational DB like SQL or a key-value store like Cassandra, the canonical data store in a Kappa Architecture system is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.
Kappa Architecture is a simplification of Lambda Architecture. A Kappa Architecture system is like a Lambda Architecture system with the batch processing system removed. To replace batch processing, data is simply fed through the streaming system quickly.
Kappa Architecture revolutionizes database migrations and reorganizations: just delete your serving layer database and populate a new copy from the canonical store! Since there is no batch processing layer, only one set of code needs to be maintained.
The idea of Kappa Architecture was first described in an article by Jay Kreps from LinkedIn. Then came the talk “Turning the database inside out with Apache Samza” by Martin Kleppmann at 2014 StrangeLoop which inspired this web site.
TURNING THE DATABASE INSIDE OUT WITH APACHE SAMZA
HOW DO I MAKE MY OWN?
- Questioning the Lambda Architecture
- Apache Kafka and the Next 700 Stream Processing Systems
- Article by Jay Kreps: The Log: What every software engineer should know about real-time data’s unifying abstraction
- Presentation: Discovering Kappa Architecture the hard way
- Linux Foundation Presentation: Kappa Architecture: Our Experience
- Liquid: Unifying Nearline and Offline Big Data Integration (Summary of Liquid paper can be found here.)
- Article by Joan Goyeau: Functional Programming with Kafka Streams and Scala
LOG DATA STORES
An append-only immutable log store is the canonical store in a Kappa Architecture (or Lambda Architecture) system. Some log databases:
- Amazon Quantum Ledger Database (QLDB)
- Apache Kafka
- Apache Pulsar
- Amazon Kinesis
- Amazon DynamoDB Streams
- Azure Cosmos DB Change Feed
- Azure EventHub
- Chronicle Queue
STREAMING COMPUTATION SYSTEMS
In Kappa Architecture, data is fed from the log store into a streaming computation system. Some distributed streaming systems:
- Amazon Kinesis
- Apache Flink
- Apache Samza
- Apache Spark
- Apache Storm
- Apache Beam
- Azure Stream Analytics
- Hazelcast Jet
- Kafka Streams
SERVING LAYER STORES
The purpose of the serving layer is to provide optimized responses to queries. These databases aren’t used as canonical stores: at any point, you can wipe them and regenerate them from the canonical data store. Almost any database, in-memory or persistent, might be used in the serving layer. This also includes special-purpose databases, e.g. for full text search.