Designing Data-Intensive Applications

The book “Designing Data-Intensive Applications” is definitely one of the best technical software engineering books I’ve ever read. Also it is probably the best one you can use to prepare for system design interviews in big tech companies, so take a note. I personally made three small tech talks within my tech team based on this book and plan to maybe do few more. Engineers love this stuff.

The book is very crisp and well structured dive into architecture of under-the-hood workings of distributed data systems, such as databases, caches, queues focusing on fundamentals rather than specific implementations. The book is also based on lots of scientific research which is properly referenced so you can always go deeper into any of given topics if you would like to.

Here is the list of all of the chapters, with some of the notes, and random comments from me (not well structured, but gives a good idea on what the book is about):

  1. Reliable, Scalable, and Maintainable Applications
    • Martin Kleppmann brings in main terminology and fundamentals in this chapter:
      • Reliable – system works correctly, even when faults happen.
      • Scalable – strategies for keeping performance good under load.
      • Maintainable – better life for engineering and operations.
  2. Data Models and Query Languages
    • Data models as in three main types of databases: document, relational, and graph, and sometimes full-text search can be considered as another data model type.
  3. Storage and Retrieval
    • Awesome chapter where the author builds database starting with simplest possible file commands like:
      #!/bin/bash 
      db_set () {
           echo "$1,$2" >> database
      } 
      db_get () {
           grep "^$1," database | sed -e "s/^$1,//" || tail -n 1 
      }
      
    • And then expanding this to log-structured databases (Bitcask, SSTables, LSM-trees, Cassandra, Lucene, etc) and to update-in-place ones, with B-trees being the main example of these and core for effectively all relational databases.
    • In this chapter the author also talks about two types of categories of databases – those optimized for transaction processing and those optimized for analysis.
  4. Encoding and Evolution
    • This chapter is looking at different types of encoding, including popular JSON, XML but also those that are language specific or those that are used to write data to disk and also all kinds of problems that come with backwards compatibility and rolling upgrades.
  5. Replication
    • The chapter on replication explains of single-, multi-, and no- leader replication, touching of potential problems you might face with them, such as “split brain“, “seeing into the future” because of the replication lag, etc, and solutions to those problems.
    • Simple example of split brain is in single-leader replication when the leader node becomes temporary unavailable and one of the replicas is elected as leader, though when the old leader comes back online it still considers itself to be the leader. These inconsistencies and situations have to be be handled in properly designed distributed systems.
  6. Partitioning
    • Key-range partitioning – partitions based on ranges of sorted keys.
    • Hash partitioning – hash function applied to each key and partitions are based on ranges of hashes.
    • Local document-partitioned indexes – secondary indexes stored in the same partition. Read requires scatter/gather across all partitions, write is to single partition.
    • Global term-partitioned indexes – secondary indexes partitioned as well. Read from single partition, write to many.
  7. Transactions
    • Race conditions:
      • Dirty reads – read even before data committed.
      • Dirty writes – another client overrides old uncommitted data.
      • Read skew – client sees different parts of database at different times. Preventable by snapshot isolation.
      • Lost updates – two clients in read-modify-write overriding each others data. Preventable by snapshot isolation or manual lock.
      • Write skew – premise of the write decision changed since it was read by a transaction. Preventable by serializable isolation.
      • Phantom reads – transaction reads data matching search conditions that are being changed by another client. Preventable by snapshot isolation.
    • Serializable transaction types: executions in serial order, two-phase locking, serializable snapshot isolation.
  8. The Trouble with Distributed Systems
    • All kinds of crazy things can go wrong when you are dealing with distributed systems are discussed in this chapter, including lost network packets, clocks going out of sync, temp failures of nodes, heck even sharks biting underwater network cables. Here is screenshot from youtube for fun:
  9. Consistency and Consensus
    • Achieving consensus means making all of the nodes agree on irrevocable decision. One of the ways to achieve this without implementing own consensus algorithms is to “outsource” it to tools like ZooKeeper.
    • Leaderless and multi-leader systems do not require consensus, but they need to cope with such problems as conflict resolution.
  10. Batch Processing
    • Unix tools such as awk, grep, sort have the same processing principles as MapReduce. Same concepts extrapolate into dataflow engines as well. The author shows how two main problems of partitioning and fault tolerance are solved and then goes into join algorithms for MapReduce: sort-merge joins, broadcast hash joins, partitioned hash joins.
    • Batch processing jobs read some input of data and produce some output, without modifying the bounded input.
  11. Stream Processing
    • Stream processing is different to batch processing because the input is not bounded and is never-ending, therefore sorting does not make any sense and data is served by message brokers.
    • Two types of message brokers are discussed: AMQP (order not preserved, messages assigned to consumers and deleted upon acknowledgment by consumer) and log-based (order preserved and same consumer receives all messages from same partition, messages retained and can be reread).
  12. The Future of Data Systems
    • One of the interesting approaches to designing applications is to shift amount of work done from the read path to the write path. I.e. you would do more of preprocessing when writing via materialized views and caches so that when you need to read everything is prepared for you and query is super efficient. This technique is probably something that is used often when solving scaling and latency problems.
    • I especially liked the last chapter on the Future of Data Systems and section “Doing the Right Thing” entertaining on implications of intensive data usage and trying to automate things based on data. One interesting thought experiment is to replace the word data with surveillance: “In our surveillance-driven organization we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights.” This becomes scary, especially when we try to justify automatically made decisions without looking at ethical implications. Another interesting quote brought in this sections is this “Machine Learning is like money laundering for bias“.

Conclusion

If you are software development engineer working on scalable distributed systems is book is simply must read. Totally and absolutely recommended. I will be going through some of the chapters once or twice again as one read is not enough. Let me end with this fancy quote brought up in the book:

“The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.”  – Douglas Adams, Mostly Harmless (1992)