Book Review: Software Engineering at Google

Disclaimer: I works for Google. Opinions expressed in this post are my own and do not represent opinions of my employer.

NOTE: Most of this was written in Sep 2020 (month after joining Google). I’m not too sure why I didn’t publish it back then.

I picked the book called “Software Engineering at Google” by a recommendation by one of my colleagues. Because the book is written for outsiders for me, as a Noogler, it bridges the gap between my prior external experience and knowledge and internals of Google that I start experiencing.

Some criticism? Or course. Personally I think the book is way too long for the content it contains. Style changes with each chapter. Some are written like they were someone’s life story, some are written like publications, yet some are written like promotions of a software product. I’m not telling someone did a bad job, it is just a natural outcome of multiple people writing a book.

Do I recommend the book? Yes! Though, you would gain most from reading it provided you: 1) are a software engineer AND 2) (started working at Google OR are interested in Google OR are working for another big tech company). I would say to hold reading the book till later if you have little to none industry experience and if are not working for a big software company. I think it would be hard to relate to many problems described in the book without prior experience. For example, the book would talk about source control systems and compares between distributed and centralized ones and provides an explanation why a monorepo works well at Google.

The book is not a software engineering manual, although it covers many software engineering topics. Things are mostly covered from Google’s point of view. This means that some problems might not be applicable to you and even for those problems that are applicable, solutions might not be feasible or reasonable. Most software companies hardly ever need to solve for scale of software engineers working for them, but at Google there are 10th of thousandths of SWEs, so this becomes important.

The book’s thesis is that Software Engineering extends over programming to include maintenance of software solutions.

The below is not so much book review, but rather me taking notes. Some of the below are quotes. Some of them are small comments and thoughts. Skim through it to get an idea of what’s in the book.

Thesis

Question “What is the expected life span of your code?”
Things change and what was efficient might no longer be.
Scale Everything, including human time, CPU, Storage
Hyrum’s Law: https://www.hyrumslaw.com/With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on my somebody.
- Example with ordering inside of a hashmap
“Churn Rule” – expertise scales better, this is probably why EngProd exists.
Beyounce Rule – cover with CI what you like. For example, compiler upgrades.
Things are cheaper at the start of a project.
Decisions have to be made even if data is not there.
“Jevons Paradox” – “Jevons paradox occurs when technological progress or government policy increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the rate of consumption of that resource rises due to increasing demand.” – wiki
Time vs. scale. Small changes; no forking
- Figure with timeline of the developer workflow
Google is data driven company, but by data we also mean, evidence, precedent, and argument.
Leaders that admit mistakes are more respected.

Culture

Genius is a myth. It exists because of our inherent insecurities. Many technology celebrities made there achievements by working with others. Don’t hide in your cave.
Hiding is considered harmful (example with bike gear shifter).
Bus factor
Three Pillars of social interaction: humility, respect, trust.
Techniques: rephrase critizism, fail fast and often
Blameless
Politicians never admit mistakes but they are on a battlefield
Googley: thrives in ambiguity, values feedback, challenges status quo, puts the user first, cares about the team, does the right thing

Knowledge Sharing

Challenges: lack of psychological safety, information islands, duplication, skew, and fragmentation, single point of failure, all-or-nothing, parroting (copy-paste without understanding), haunted graveyards
Documenting knowledge scales better than one-to-many sharing by an expert, but an expert might be better at distilling all of the knowledge to particular use-case
Psychological safety is the most important part of an effective team. Everyone should be comfortable asking questions. Personal comment: I find myself guilty of not always providing safe environment for my colleagues in the past I would use “well-actually” and frequent interuptions.
Biggest mistake for beginners is not to ask questions immediately and continuing to struggle.
1-to-1 doesn’t always scale so community and chat groups can be used instead.
To provide techtalks and classes few conditions are needed to justify them: complicated enough topic, stable, teachers needed to answer questions,
Documentation is vital part of keeping and sharing knowledge
“No Jerks”. Google nurchures culture of respect.
Google provides incentives to share knowledge (for example peer bonus, kudos, etc).
go/ links and codelabs
Newsletters have their nieche of delivering information
code readability process. I personally was always looking for way to make code look more uniformly throughout projects I worked on. And here we are this is implemented on the scale of entire large company. trade-off of increased short-term code-review latency for the long-term payoffs of higher quality code

Engineering for Equity

Unconscious bias is always present and big companies have to acknowledge its presence and work towards ensuring equality.
Google strives to build for everyone. AI based solutions are still behind and still include bias.
It is ok to challenge existing processes to ensure greater equity.

How to Lead a Team

A really great chapter.

Roles are Google are: The Engineering Manager (people manager), The Tech Lead (tech aspects), The Tech Lead Manager (mix)
Technical people often avoid managerial roles as they are afraid to spend less time coding
“Above all, resist the urge to manage” – Steve Vinter
“Traditional managers worry about how to get things done, whereas great managers worry about things get done (and trust their team to figure out how to do it).”
Antipaterns:
- Hire pushovers
- Ignore Low Performers
- Ignore Human Issues
- Be Everyone’s Friend (people may feel pressure to artificially reciprocate gestures of friendship).
- Compromise the Hiring Bar
- Treat your Team Like Children
Patterns:
- Lose the Ego – gain team’s trust by apologizing when you make mistakes, not pretending to know everything, and not migromanaging.
- Be a Zen Manager – being skeptical might be something to tone down as you transition to a manager. Being calm is very important as it spreads. Ask questions.
- Be a Catalyst – build consensus.
- Remove Roadblocks – In many cases, knowing the right person can be more valuable than knowing how to solve a particular problem.
- Be a Teacher and a Mentor
- Set Clear Goals – entire team has to agree on some direction
- Be Honest – Google advises against “compliment sandwich” as it might not deliver the message. Example with analogy of stopping trains.
- Track happiness – A good way is to always ask at the end of 1:1 “What do you need?”
- Delegate, but get your hands dirty
- Don’t wait, make waves when needed
- Shield your team from chaos
- Let the team know when they do well
People Are Like Plants – very interesting analogy
Intrinsic vs. Extrinsic Motivation

Leading at Scale

Three “always”: Always be Deciding, Always be Leaving, Always be Scaling
Always Be Deciding is about trade-offs (plane story); best decision at the moment; decide and then iterate
Always be Leaving is about not being single point of failure, avoiding bus factors
Product is a solution to a problem. Make teams work on problems not products that can be short lived.
The cycle of success: analysis, struggle, traction, reward ( which is a new problem)
Important vs. Urgent – gtd, dedicated time scheduling, tracking system
Top leaders should learn to deliberately drop some balls and conserve their energy. It is good enough to focus on 20% of things.

Measuring Engineering Productivity

Google decided that having dedicated team of experts focusing on eng productivity is great idea. Entire research was even dedicated to study productivity.
Case-study: readability
Google measures using Goals/Signals/Metrics
Signal is the way in which we will know we have achieved our goal
Metric is exactly how we measure signals
Actions Google takes are “tool driven” – tools have to support new things instead of just telling engineers to change anything
For qualitative metrics consider surveys

Style Guides and Rules

Google has style guildes which are more than just about style but more about rules
Rules have to be: optimized for the reader of the code, have weight and not include unnecessary details, and have to be consistent.
Example when rules can change is when this is at scale and for very long periods of time. For example, C++ rules, with an example of starting to use std::unique_ptr

Code Review

This covers typical code review flow. Snapshots and LGTM.
There are three aspects: any other engineer’s LGTM, owners LGTM, readability LGTM. Three can come from one person
Code Review benefits: code correctness, code is comprehensible, consistency, ownership promotion, knowledge sharing, historical record.
Code reviews help with validation for people who suffer from imposter syndrome.
There is a big emphasis on writing small changes
Write good change descriptions
Adding more reviewers leads to diminishing returns
Code review buckets: greenfield, behavioural changes, bug fixes, refactorings and large scale changes

Documentation

Most documentation comes in form on code comments
Documentation scales over time
Arguably documentation is like code, so it should be treated like code (policies, source control, reviews, issue tracker)
Wiki-style documentation didn’t work out because there were no clear owners, therefore documents became obsolete, duplicates appeared. This grew into number one complain.
Documentation has to be audience oriented.
Types of documents: reference documentation (what is code doing), design, tutorials (clear to reader), conceptual docs, landing pages (only links to other pages).
Good document should answer: WHO (audience), WHAT (purpose), WHEN (date), WHERE (location of doc, implicit), WHY (what will it help with after reading).
“CAP theorem” of document: Completeness, Accuracy, Clarity.
Technical writes are helpful documenting on the boundaries between teams/APIs but are not great asset for team specific documentation.

Testing Overview

What does it mean for a test to be small? It can be measured in two dimensions: size (can be defined by where they run: process, machine, wherever) and scope.
Small – single process (no db, no IO, preferably single-thread); Medium – single machine, no network calls, Large – no restrictions. This can be enforced by attributing tests and frameworks that run tests.
Scope is how much code is evaluated by a test.
Flakiness can be avoided by multiple runs – this is trading CPU time for engineering time.
It is discouraged to have control flows in tests (if statements, etc).
Testing is a cultural norm at Google.

Unit Testing

Aim for 80% UT and 20%broader tests.
Tests have to be maintainable.
Big emphasis on testing only via public API.
Test behaviors and not methods.
Jasmine is a great example of behavior driven testing.
A little of duplication in tests is OK; not just DRY. Follow DAMP over DRY.

Test Doubles

Seams – way to make code testable; effectively DI.
Mocking: fake (similar to real impl); stub – specify return values; interaction testing (assert was called)
Google’s first choice for tests is to use real implementations. Arguable this classical testing approach is more scalable.
Fakes at Google are written and have to have tests of their own.
It is fair to ask owners of an API to create fakes.
Interaction testing is not preferred though warranted if no fake exists or we want to count number of calls (state won’t change).

Larger Testing

General structure of a large test: obtain system under test (SUT), seed necessary test data, perform actions using the SUT, verify behaviors
Larger Tests are necessary as UT cannot cover the scope of complicated interactions
Larger tests could be of different types: performance, load, stress testing; deployment configuration testing; exploratory testing; UAT; A/B diff testing; probes and canary; disaster recovery testing; user evaluation; etc.

Deprecation

Code is a liability, not an asset.
Migrating to new systems is expensive. In-place refactorings and incremental deprecations can keep systems running while making it easier to deliver value to users.
Deprecations: advisory and compulsory.
For compulsory deprecation is it always best to have a team of experts responsible for this.

Version Control and Branch management

Google got stuck with centralized VCS. The book argues that there really are no distributed VCS, as all of them would require a single source of truth
I used to work with both centralized and distributed systems. What google does is even a bit more different as branching is not encouraged at all
Sometimes it is ok to have two source of truth. For instance, Red Hat and Google can have their own versions of Linux
Arguably if UT, IntegrationTests, CI and other policies are in place dev branches are not needed
Other disadvantages of dev branching is difficulty of testing and isolating problems after merges
“One Version” – for every in our repository, there must be only one version of that dependency to choose
Hopefully in future OSS will arrive at virtual mono-repo with “One-Version” policy which will hugely simply OSS development over next 10-20 yearst

Static Analysis

Static Analysis has to be scalable and incremental running analysis only on relevant parts

Dependency Management

Managing dependencies is hard. Big companies come up with their unique solutions.
A dependency is a contract so if you provide one this will cost you.
SemVer is a standard in versioning dependencies but is only an attempt to ensure compatibility.
Compatibility can also be ensured using test harnesses.

Large Scale Changes

LSC might be semi-unique to Google and other super-large companies but at this scale changing int32 to int64 everywhere in code base has to have special processes in place.

Continuous Integration

Everyone does CI these days and CI systems become ever more complex.
Actionable feedback from CI is critical.

Continious Delivery

“Velocity is a team sport” of reliably and incrementally delivering new value fast.
Ways to achive this are: evaluating changes in isolation; shipping only what gets used; shifting left (more more things to earlier in the pipeline); ship early and often.

Compute as a Service

Container based architecture adapted to a distributed and managed environment with common infrastracture is a solution to handling scale.

Markdown	Result
text	text
text	text
*text*	text
`code`	`code`
~~~ more code ~~~~	more code
[Link](https://www.example.com)	Link
* Listitem	Listitem
> Quote	Quote

Book Review: Software Engineering at Google

Highlights

Recent Posts

Recent Comments

Archives

Book Review: Software Engineering at Google

Highlights

Recent Posts

Recent Comments

Categories

Archives