Valve's Culture, Self-Organization and Scrum

“If you don’t like change, you’re going to like irrelevance even less.” 
- General Eric Shinseki

Valve Corporation is an enormously successful game development and digital distribution company headquartered in Bellevue, Wash. In the spring of 2012, Valve's New Employee Handbook was released.

Its release has led to a number of discussions about the merit of The Cabal (what Valve calls their approach of having small cross-functional teams implement core features for their games). For me, it's hard to argue with success and everything I've read about Valve being a great place to work, so I read the handbook closely.

Since first reading about The Cabal in 1999 and attending a few of their conference sessions since, I've been inspired. That inspiration helped lead me to agile thinking.  

I felt there was a connection with agile and the kind of place, like Valve, where I wanted to work. A place where rigid process and hierarchies were considered a mismatch to creative development.

Valve's handbook states this belief near the start:

Hierarchy is great for maintaining predictability and repeatability. It simplifies planning and makes it easier to control a large group of people from the top down, which is why military organizations rely on it so heavily. But when you’re an entertainment company that’s spent the last decade going out of its way to recruit the most intelligent, innovative, talented people on Earth, telling them to sit at a desk and do what they’re told obliterates 99 percent of their value. We want innovators, and that means maintaining an environment where they’ll flourish. Self-Organization

The handbook goes on to describe the role of an employee in this environment. The criticisms I've heard about The Cabal often say that you need the right kind of people for this to work.  

I agree, but I think that the potential pool of such people is larger if you provide mentoring to help the transition into such an environment.  

The handbook acknowledges this as a challenge:

There are a number of things we wish we were better at: -Helping new people find their way. We wrote this book to help, but as we said above, a book can only go so far. -Mentoring people. Not just helping new people figure things out, but proactively helping people to grow in areas where they need help is something we’re organizationally not great at. Peer reviews help, but they can only go so far. 

This is a common challenge in any studio that is attempting to improve self-organization. People have to unlearn a lifetime pattern first imposed in our public education systems, which train children how to work in a task-driven, top-down hierarchical organization.  

It's an even greater challenge for a studio that has been hierarchical, since it threatens the status quo (more on this later).

Self-organization and hierarchies aren't mutually exclusive. Gabe Newell leads Valve; it's not a pure democracy, but Valve doesn't have many layers between him and an artist creating texture maps.

Nor is the artist being handed a list of texture maps he or she is assigned to create during the week. The artist is expected to be a professional and is treated like an adult by being allowed to be personally accountable.  

What self-organization does is to flatten hierarchies and reduce the number of lines of communication between people that need to communicate.

Scrum and The Cabal Have the Same Goals

Scrum is a framework for iterative and incremental product development based around self-organizing teams.

Team size, sprint durations, Scrum roles, etc. are meant to foster self-organization. Every two to three weeks, the team inspects their work and practices and seeks ways to improve both.  

The roles provide for clear interfaces and areas of ownership. The benefit is that after awhile, every individual should find motivation in seeking these improvements.

This motivation builds on itself, accelerates and leads to a better working environment.  

Good working environments and profitability aren't exclusive. While Valve says it enjoys high retention rates, in 2011, Forbes pointed out it also makes more profit per employee than even Google or Apple!

The goal of Scrum adoption is not to "do Scrum perfectly" but to establish a framework that will lead to such a culture. It's been referred to as a starting script for self-organization.

Why Is It So Hard Then?

So why do few companies ever achieve similar cultures? Why is it so challenging for organizations that adopt the Scrum framework to become like Valve?

This is the big question.  

I believe it mostly lies in cultural resistance. As mentioned earlier, an organization that grows in a hierarchical pattern resists the adoption of self-organization.  

This applies to managers as well who see their value tied to a command-and-control structure. Even in the face of studio extinction, these forces resist change.

I once heard the comparison of a manager resisting change in a failing studio to that of the Titanic passenger with the finest cabin refusing to evacuate!

Resistance comes from developers as well who focus on their tasks and discipline, and leave accountability to their bosses. This feels safe, especially in a culture that hands out blame like candy during Halloween.

Valve has the benefit of fostering self-organization and found growth through hiring people that worked well in that environment.  

It doesn't hurt that Valve is self-funded and somewhat isolated from external customers. It also doesn't hurt that they own their intellectual property.  

But this doesn't mean the path to self-organization is impossible to move to from a hierarchical culture. It's definitely hard and it does takes time.

There is a revolution taking place right now in how we work that may take a generation to become commonplace.  

We have more examples every year that show us how to get there and what the benefits are. Scrum can help that transition occur, if the values, not necessarily the practices, are followed.

A major goal of my "Agile Game Development: Essential Gems" course was to offer the gems of transition from practices to values.

Note: This is an updated version of an original blog post that was previously published here.

The Coffee of Destiny

A few years ago, I pulled into one of the many local Starbucks, like I did everyday. But on this day, the world changed for me. 

We had been applying Scrum successfully on our games for a while, but we had a problem with applying it during production. 

Often, games go through at least two phases before they launch. One is pre-production, where the core gameplay features and engine are developed. 

The second is production, where the hours of gameplay content (levels, characters, story) are developed using those features and technology. 

As much as we tried to be agile, we couldn’t eliminate that separation of phases. 

The problems were that during production there is more of a flow of handoffs from one discipline to the other, and large assets, such as levels, took far longer than a single sprint to complete. 

This led to a lot of unfinished work at the end of every sprint and bottlenecks between the handoffs. 

Now, if you are a straight-coffee drinker who frequents Starbucks, like me, you probably appreciate that you don't have to wait for all the lattes, cappuccinos, etc. ordered ahead of your cup of coffee to be made first. 

The barista works on those exclusively and the cashier can directly pour your coffee for you.   

That day at Starbucks, the light bulb turned on.

I thought, maybe we can learn something from this. A bit of Googling showed that Starbucks was applying a practice called Kanban.

Kanban roughly translates to “signal card” in Japanese. It’s a set of practices that focuses on the flow of work and uses the state of the work in progress to signal to the people doing the work what they should do. 

At Starbucks, the empty coffee cups, marked with your misspelled name and order details are the Kanban. Based on how many there are between the cashier and barista, this informs them of what they should do next. 

Looking at the Kanban board above, my coffee goes from the "order" column to the "leave" column directly. 

Based on the size of the line, the barista and cashier might help one another out. If the line is long and the barista has nothing to do, they’ll ask people in line what drink they want and start making that before the cashier takes their order. 

Conversely if the line empties, but there is a backlog of unfinished drinks, the cashier will join the barista in making drinks. 

This benefits everyone. A key metric for Starbucks is the customer cycle time: the amount of time it takes between walking in the door and when you walk out with your drink.   

The critical path for coffee drinkers and latte drinkers isn't the same, but it isn't entirely separate; much as I personally would enjoy it, there is no separate cashier line for coffee drinkers. 

Starbucks has chosen not to optimize specifically for us straight-coffee drinkers for good reason. 

This is similar to the approach you might use for asset types. Although every asset will have a large variation of effort needed (like that between coffee and a latte) and partially separate paths, measuring every asset's cycle time will still give us valuable information.   

The goal isn't to achieve a uniform cycle time for all assets, just as people who order lattes should expect to wait longer at Starbucks than us super-efficient coffee drinkers. 

Let's look at the Kanban board that shows various assets going through a game asset production pipeline:

This board includes assets that might need particle FX or animation applied to them, or neither. The important principles apply. 

We're going to measure the throughput and limit the work-in-progress (WiP) regardless of which steps are taken. Some assets will skip some steps like me skipping the barista. 

Doing this can improve the entire system. As a coffee drinker, I don't care how quickly the barista can make a latte, but I greatly appreciate when the under-tasked barista helps fill coffee orders. 

This can happen in an asset production pipeline as well. As we measure throughput, we can create such policies in a production pipeline: Starbucks has far shorter coffee cycle times than barista-drink cycle times and that is fine for everyone.   

The key is to measure throughput for different asset classes and explore where and when improvements for classes can improve their cycle time without impacting the other classes. 

Most production pipelines are far more complex than this, but the same principles apply. Start by simply modeling what you're doing now. Then measure throughput and reduce WiP.  

How Agile Can Remedy a Bad System

Do you want to know who you can blame for the latest fighter jets running hundreds of billions of dollars over budget, getting delayed for years and never living up to their potential?

You can blame me!

Well, you can blame me and thousands of others who have worked on fighters like the F-22 and F-35. These “next generation” planes are examples of what happens when the people making them don’t communicate with one another very well.

Let me tell you a story of a usual day working on the avionics for the F-22, back in 1990.

I was a software engineer working on some code that helped the F-22 communicate digitally with other planes. This was something fighters hadn’t done before and it was an important “feature” of the jet.

If the F-22 could receive digital radar information from a large radar plane 100 miles back behind the fight, it wouldn’t have to use its own radar and expose itself.

A bad system

So this new feature required unique software and hardware. It also used encrypted data, which classified the work as secret. Much of the rest of the jet had parts that were secrets as well. Maybe the name of the jet and the wheels were the only things that weren’t secret.

So, in developing this software for prototype hardware in development, I often had questions for the people who made the prototype. Unfortunately these people were in a different state.

In fact, people working in more than 35 states created the F-22. The reason for this was that in order for congress to approve the enormous cost of the plane, most congress people had to have constituents working on it. So Lockheed Martin broke up the work to ensure congress would approve.

Now it’s hard enough for people separated by a few time zones to communicate, but when they have to communicate about something with a secret classification, it gets downright near impossible.

So the conversation process was set up like this:

-I had to request permission from security to have a conversation on the phone.
-Security had to contact the hardware people and schedule a conversation. The earliest was a week away.
-I had to find something else to do for a week.
-When the appointment arrived, I was escorted to a secure room that had a scrambled phone.
-A secure room had “pink noise” (which sounds like a washing machine filling up with water) to mask any sound from leaking out.
-The scrambled phone degrades the sound quality of whoever is speaking.
-A security officer has to be present on both sides of the call.

Finally when I was able to speak to the hardware engineer about the problem I was having with his hardware, we were not allowed to have any casual conversation to start (in case I learn something about his classified details).

With the pink noise and scrambled phone connection, he was very hard to hear. So I started yelling a question into the receiver about my problem, but before I could get my first sentence out, the security officer in the room interrupts me and says I can’t go into specifics on secret technology.

I try to ask him what the point of having the call in the secure room and the scrambled phone was, but he wasn’t budging.

The conversation I waited a week for was reduced to: “I’m having a problem with something you made.”

Obviously the hardware engineer could offer little help beyond suggesting I reboot the hardware.

As a result, we had to put off getting much of the avionics to work for a year until we were able to integrate it all in a shared lab we all visited.

As you may imagine, integration was a disaster and delayed the plane for at least a year. However, we did solve the digital communication problem in a different way, described below.

Fixing a bad system

The managers I worked for on the F-22 program knew the policies of secure communication and had some control over them, but they didn’t fix them, and there was no urgency to do so.

We developers had a sense of urgency, but no power to change the policy, so we stopped communicating with our distant colleagues, which was bad.

The obvious solution is to couple the sense of urgency with the power to create change. This is a core principle in agile. It raises the urgency for fixing these problems to those in power through transparency, and it raises the power and accountability of the developers who are closest to the problems through self-organization.

Creating transparency

Transparency is created through frequent inspection of “what” we are making and “how” we are making it. In Scrum, every one to three weeks, we are reconciling what we achieved in a sprint review with what we forecasted we thought we might achieve in sprint planning.

The cost and impact of practices like secure phone conversations would become more apparent because they slow the velocity of adding features such as digital communication.

It’s harder to hide behind a complex schedule or hope that the problems will magically go away in some distant integration phase. We integrate as much as possible every sprint.

This transparency and velocity are tools for experimenting with inspecting the policy of secure communications and adapting to new policies that should increase velocity.

Creating self-organization

Scrum also defines the retrospective that allows the developers to discuss what is slowing them down every iteration, and to experiment with new ways of working to improve their velocity.

If developers don’t have this power to try new ways of working, this virtuous cycle does not occur. Fortunately we had a manager that trusted us and let us experiment.

Developers on secret programs have little power to override security protocols, but we can innovate within them to improve “how” we work.

In this case, when requested, our manager gave us permission to fly out and meet with our distant colleagues instead of calling them. For some reason, security policies allowed us to be left alone in a secure room with the hardware to discuss whatever we felt like.

Although it took more time to fly than to dial, we were able to solve many integration problems quickly in a day of meeting together. As the young, single engineer, I was the one that traveled the most to discuss the hardware and, after a few trips, became a limited expert on the hardware.

From then on, whenever a software engineer at our location had a problem with the hardware, they’d first come to me. If I couldn’t fix their problem, I would add it to my list of things to talk about during my next trip.

The same goes for games, actually

As a result of this new way of working, the digital communication part of the avionics did not have the integration problems that other areas did.

When I switched my career to video game development, I found similar problems occurring with distributed teams.

Although the security policies weren’t to blame, there were other problems with culture and organization that separated disciplines and put a systemic burden on their work, much like I experienced earlier.

So when someone asks me why it’s valuable to have cross-discipline teams who are co-located as much as possible, I remember the F-22.

Distributed or discipline-centric teams might not have the communication barriers I did in 1990, but when there are thousands of cross-discipline conversations that need to occur over the course of development, any barrier in the way of those, such as management tools or chains of communication or a long hallway to walk down, add delay – or worse: eliminate crucial conversations.

The moral of the story is that if your teams aren’t performing, take a look at how they’re organized. As W. Edwards Deming said, “A bad system will beat a good person every time.”