Warning: This is a bit of a rant. If at any point in this post, I seem like the crazy one, please tell me somehow. Comments, Twitter, LinkedIn. It would be really valuable to hear your thoughts. But I also encourage you to think about the situation, context and information I provided and decide for yourself whether the way we are acting is crazy. There are craziness checkpoints along the way! And if you think any of this is crazy as well, try to do something about it!
I had a work colleague recently ask a question on our internal mailing list. A summary of the question would be:
A company that is using ‘agile’ wants to know if their formula for converting story points into hours is sound because executives are doing short term (< 6 months) planning.
This is an example of how they break down their work. Seems reasonable and just like many other companies.
They also provided a formula. This is where the craziness starts which is startlingly common in our industry.
… and a sample of how the formula works.
Ok. At this point, all that is going through my head is “OMG”. Now comes the process by which they populate the formula.
- PMs provide the Epic Point estimate
- Developers decompose epics and provide point estimates for the user stories.
- Teams points (epic points or user story points?) per resource per sprint are tracked.
This is COMMON in my experience working with many organizations in our industry.
In all fairness to my colleague, he knows this is crazy. The developers in this organization know this is crazy. This is a typical scenario when executives and senior managers are trying to get information but using techniques that we have proven don’t work in our industry.
The good thing is, the way that they decompose work is fine and the formula will work as long as we don’t have have 0 velocity or 0 team members.
The existence of the formula and the process by which we find numbers to put into it are where the craziness begins. First, a bit of information on “story points”.
Dave’s Craziness Meter
Story points are an abstraction agile teams use to REDUCE the perception of accuracy. 1 story point will require “some small” bit of effort to complete. 2 story points should require about twice as much effort as 1 story point. So that would be 2 times “some small bit of effort”. 5 story points require about 5 times “some small bit of effort”. I hope I’m clear about how we are purposefully reducing the perception of accuracy. Why? Because it incredibly difficult (usually impossible) and time consuming to create accurate estimates for the effort required to do knowledge work using the time units desired by decision makers!
I don’t even advocate using story points to my teams anymore. As numbers, it is too easy for people to plug them into a formula to convert them into something they are not intended to be converted into. I now generally only advocate sorting work by sizes like Small, Medium and Large.
Now let’s discuss the numbers that go into the formula from the example.
To start, a PM will make an point estimate in story points. In the example, it is 200 story points. I don’t know why a PM is doing an estimate on technical work.
Point estimates are bad. They imply accuracy where it normally does not exist. We have a point estimate of an abstraction that is trying to reduce the perception of accuracy. At least it is a single point estimate using an abstraction, so it could be OK except humans are optimistic estimators. Our interpretation of an estimate usually means that we expect the epic to fall in the middle of a normal distribution of epics and we have a 50/50 chance of falling under either side of the curve.
The problem with this is that the magnitude and likelihood of doing better than our estimate is the same as the likelihood and magnitude of doing worse than our estimate. And the other problem with this approach is it’s wrong. Using information from industry observations, we see that plotting the actual effort of our epics will form a log-normal distribution .
On a log normal distribution, the chance of doing better than our estimate is smaller than the chance of doing worse than our estimate. This means our epics usually fall behind the mode which means that they took more effort to complete than we expected. Simply put, when humans provide point estimates, we are usually wrong.
Let me state that again. When we estimate in knowledge work, we are usually wrong! And the formula above actually understands that!! The historical team buffer is a built-in admission that we need to add 20% to our estimates to make the numbers get closer to the actual values we observe!!! And I’ll bet that most epic actual values exceed the padded estimate!
Dave’s Craziness Meter
Ok, so point estimates are bad. And humans are bad estimators. That all came from the first point of the PM creating an epic point estimate. The rest should go more quickly. We’ve covered a lot of ground in that first exploration of the situation.
Next we have the developers decomposing the epics into stories and providing point estimates on those stories. All of the problems that we encountered in estimating above apply at this level as well.
The problem now is that it doesn’t work to simply sum up the points for the stories to get the points for the epic. If we look at the problems with point estimates, imagine doing that 20-100 times and adding the value together. We know that our our estimates are wrong (by at least 20% and probably more) and the impact of how wrong you are has a significant impact on effort. The only way to potentially add up the stories to see how much the epic will be is to estimate stories with an expected value that is the mean on a log-normal distribution and run a Monte Carlo simulation using all the stories in the epic. (That is an whole other blog post!)
I’m betting that a Monte Carlo simulation isn’t being used to turn the developer’s stories estimates into a meaningful epic estimate. Never mind the observed expansion (dark matter as David Anderson refers to it) of requirements that we see as we build the software and discover what we’ve missed. It’s always there, we just needed to build the system so that we could find it.
So our estimate is probably wrong. Significantly wrong. Period. Let’s move forward.
The next part of the formula is:
and the data in our example looks like this:
10 points is the average points per resource (I hate that word) per sprint. I’m assuming that they have standardized sprints for all teams in the organization. 2 is the number of team members (much better description) that will be working on this epic.
Do all developers produce 10 points per sprint, regardless of the task? Regardless of the skill level? Are we talking senior or junior developers? Is the business problem something they’ve done before, or something novel? Technology is know or new? Stat holidays? What if a developer is sick?
That 10pts could also be interpreted as developer velocity, and as we know, velocity can be highly variable for a team and is even more variable for individuals. In Scrum, we always talk about team velocity because it hides the variability of individual velocities that we know exists by averaging out the effort expenditure across multiple people over time. And we also know that velocity cannot be compared between teams! One team might have a velocity of 10 while another may have a velocity of 20, but the former team is more effective. The velocity number itself should not be interpreted as a standardized representation of effort that can be applied to any large scale planning activity (multiple project, multiple teams). It only works when you know the project and you know the team and it’s capabilities and past delivery history in specific contexts.
Dave’s Craziness Meter
So we have an epic estimate (crazy) divided by a developer velocity # multiplied by the number of team members (crazy). The next step is to multiply by the Historical Team Buffer otherwise known as padding the whole estimate.
The number from the formula is 1.2 in order to make the whole estimate 20% larger or 120% of the original estimate.
As I mentioned above, this buffer is a built-in indication that our estimates are wrong! We know we can’t deliver the epics as expected given our estimations, so we pad the estimate to make it closer to what we actually expect to do.
Dave’s Craziness Meter
So after all of that, we’ve arrived at a value of 12 sprints required to deliver this epic.
Phew!! Now we need to turn this into an expectation of effort expenditure or schedule.
Oddly enough, based on the question from our internal mailing list, we were not able to determine what the executives wanted. They either want to know how long it will take or how much it will cost. Unfortunately, I’d guess that they expect to get both from that number. They expect that the project will be done in 960 hours of developer effort which will occur in 12 weeks of calendar time.
That is 12 sprints and 80 hours of effort per sprint. That seems to be 2 developers for 40 hours per week or 80 hours per week.
So, I can with a fair bit of confidence tell you that the work week is 40 hours, 2 man weeks is 80 hours, and if a sprint is 1 week long with 2 people, the project will spend 960 hours of effort in 12 weeks.
What I can’t tell you is if both developers will work 40 effective hours per week. You almost certainly will pay them 40 hours per week, but whether all of that effort is applied to the story is another question. Emails, meetings, slow days, coffee cooler conversations, code reviews, design sessions, non-project related tasks are just some of the kinds of things that eat into that 40 hour work week. Never mind sickness or other work absences. Most people I speak with use 6 hours as a number for planning how much time a developer can effectively spend on a story. I’ve seen organizations where a team member regularly, for a variety of necessary reasons, spends less than 2 hours per day on a project they are supposed to be full time on.
So that does two things to our calculation. We’ve determined that the story is now about 960 hours in effort. If the developer can’t work on it 80 hours per week, the number of iterations has to go up! If the developers are only effective 4 hours per day, that means we have to spend 24 iterations on the project to get it done. Or potentially, the developers can deliver it in 12 weeks, but have to work 40 hours of overtime per week, which means the costs go WAY up for the project as the organization has to pay overtime. Let’s not even try to incorporate the diminishing returns on overtime hours, especially in an environment where that overburdening is chronic.
So as an executive, I wanted a fairly certain number for how much effort a specific epic is going to take so that I can do appropriate planning for the next 6 months. And 960 hours and 12 weeks calendar time is something I’m going to hang my hat on. I mean, 960 is a pretty precise number, and the “confidence” that the process has given the organization in that number must be pretty good.
Dave’s Craziness Meter
And since we’re doing organizational planning and planning multiple epics for multiple teams in the six month period, I’m also making the assumption that all epics that are 200 points are the same. And I’m assuming that all story points are the same effort. I’d have to deduce that that any epic of 100 points would take exactly half as much effort or half as much time as any 200 point epic. Any 200 point epic is exactly the same as any other 200 point epic. There is a lot of confidence in our formula and approach.
Dave’s Craziness Meter
Ok. I’ve maxed out my craziness at this whole endeavor. My hope is that you think this is all a little crazy as well because if we start to accept that we’re acting a little crazy we can start to do something about it.
I’d like to clarify that this post is not intended to be critical of people who are using Agile estimation techniques and trying to fit inside of an organizational traditional project planning methodology. We are all trying to do the best that we can. But sometimes we get stuck in a way of thinking. I’m hoping that this blog post might just jolt some of you into thinking about these problems and deciding to try and do something about them.
The formula used is as an example in this post is just that. A formula. And a very common kind of formula as well. The values you input into the formula will work as long as you don’t have 0 velocity or 0 team members. (No one likes division by 0.) In Kanban we use a lot of formulas and a scientific approach to understanding our work and workflows.
It’s the drivers underlying how we want to use formulas that we really need to be critical of. The formulas and the data that we feed into them has to make sense and drive us towards creating valuable data points on which meaningful decisions can be made. If we use complicated formulas and suspect or fictitious data as parameters, that makes the output worse than garbage because it provides some sort of sense of accuracy and a false sense of confidence because we think “this must be right. Our formula is sound.”
As long as we, as an industry, support this behavior, we’re are contributing to the continuation of this problem.
There are solutions to this problem, but exploring those options will be explored in another blog post.
But in the mean time, I’d love to hear your feedback! Do you think I’m crazy or do you think our industry is crazy?
Dave, you make several great points. I believe that the process of forming estimates is useful. The effort forces the team to really think what needs to be done and how they might do it. But, they are not committing to any particular approach or implementation. In the end, estimates represent a range of possibilities, as you point out. Problems result when management takes estimates and turns them into a commitment and a deadline. Some teams estimate high to counteract this problem but that seems unethical to me. Estimating in ranges may help but some managers will latch onto the lowest number. It becomes a vicious cycle. Teams have to resist having deadlines imposed on them when the deadlines include a fixed feature set. They need to push back and stand firm. Management can have a deadline or a feature set but not both.
Thank you Vin for the insightful comments. I didn’t state it exactly, but dealing in ranges and probabilities is where I was driving towards with the article and where I will go in the follow-up articles as well.
I don’t exactly agree with you that the forming of estimates is how we should get teams to “really think what needs to be done”. Teams should be thinking about this, but not as a result of having to form an estimate. I think they should defer that level of investigation unto they’ve been asked to do the work. We’re mixing up the catalyst for some sort of desired action with the desired output of the exercise. The desired output of an estimation effort is an estimate and not a design spec. Let’s not work on anything until we’ve committed to it.
And you’re right about decision makers seeing and grabbing a hold of numbers! And it is the setting of these expectations that makes our projects so difficult to manage and demoralizing to people involved with the project. I don’t think that we’ll ever be able to push back on the expectations that come from them as long as we continue to provide these “accurate” numbers. We have to do something different, which I’ll explore in the follow-up post!
Thanks again for your insights!
“I’m betting that a Monte Carlo simulation isn’t being used to turn the developer’s stories estimates into a meaningful epic estimate”. You’re probably right, but something along these lines did get built into FogBugz way back in ~2007. (http://www.joelonsoftware.com/items/2007/10/26.html).
[…] Estimations accuracy follows a distribution curve which is weighted to the late side (not a normal ‘bell’ distribution which might balance things out).  […]
[…] Image: Agile Ramblings […]