Friday, April 27, 2007

An Oakland A's Lesson:
The Average, Without Context, Is Not The Reality.  

The Oakland Athletics are at .500 of this writing, and that's a tribute to their pitching and defense because they are 12th of 14 teams in on-base percentage (.308) and dead last with a veritably anemic .228 batting average and a spectre of slugging percentage at .340. By itself, that's not so exceptional as to be worth flapping my lips about it, but some analytical remarks in the San Francisco Chronicle this week brought up an issue I was confronting a client with last month: In the desire to simplify analysis in baseball and beyond, we can miss nuances that are so important that they allow for a confident interpretation of a lie as the truth.

In a complex system, something like a mean average or a mode can be a valuable measure, but only if you think it through and measure context and consequences beyond that number.

Susan Slusser, writing for the San Francisco Chronicle, made or passed on this observation:

Over the past 18 games, Oakland starters have not allowed more than three runs in any outing -- the longest such stretch in the major leagues since the 1998 Indians did the same.

Seven times in the past eight games, the A's starter hasn't given up more than one run. {SNIP}

Braden walked the first man he faced, Brian Roberts, but got the next batter, Melvin Mora, to hit into a double play. Nick Markakis popped up, and with that, the A's established a major-league record: In 20 consecutive games to open the season, Oakland has not allowed a run in the first inning. (The 1990 Brewers and the 2005 Red Sox both reached the 19-game mark.)

In the two games subsequent to Slusser's article, they added two games to that streak, so there hasn't been a single start over the last 20 in which A's starters have given up more than three runs. Pretty extraordinary on first glance -- six of those starts were by ostensible 4th or 5th starters and one by a rookie acting as an emergency fill-in in his first major league start.

We can't really know if this was particularly meaningful unless we look at all the individual components (starters' games), which we will in a few paragraphs after a detour to beyond baseball.

My client was buying some used manufacturing equipment, relatively expensive and with a high transportation charge. He normally would have been decisive and bought something without ever talking with me, but the price scared him a little, and he asked me to talk with him about the comparative virtues and vices of the two- and a half choices he had (the third was really a half -- he was pretty sure he didn't want it, but wanted a third for the comparison). I asked him for the numbers he used and he sent me a text file with four sets of numbers for each unit:  price, presumed remaining life, mean time between failures, and he and I call rated effective throughput [(throughput minus defectives) per hour * % uptime]. 

He then rolled all of these up into an estimated cost for each choice over its life. Smart, as far as he went. Which turned out not to be far enough.

Because he was presuming averages, and the business he's buying this unit to move into is all about turning on a dime with short turnaround orders of custom runs. And in spite of the fact that he knows as well as anyone that mean time between failures doesn't mean you have any idea when the system will fail over time in your shop over time. Some systems break down a lot in their first year on a line and once the weaker parts are replaced are as indestructible and reliable as a DC-3. Some are great unless you read the maintenance protocols closely and realize that you won't be able to keep the unit's pace down to suggested rates. Some just have tolerance that's so fine that your real world system will kill it. And some break down only in clusters -- never for a long time and then with a bunch of loosely cause-'n-effect glitches, go out of service for a long time.

All of these units could have the same MTBF, the same price per unit of rated effective throughput, and have make-or-break differences in how each would fit this particular business. Fortunately, my client had been meeting with other people in this new trade, experienced ones, and his assitant had been researching contract maintenance providers. And what they knew broke his freeze-up. We asked his assistant to find the repair outfits that serviced both of his top choices, and lo and behold, the unit with the slightly less attractive MTBF rating spent less time down per failure because it was almost always the same small group of components that failed and it was routine for repair places to keep them on hand, while the one with the slightly better MTBF unit had to order parts from a warehouse in Mexico and they sometimes got slowed at the border.

In many places, raw MTBF might have delivered the most benefit, but in this particular context, it was less about number of failures as time of repair.

The average is not the reality; it's an artifact... of part... of reality.

So how significant was the As' 20-game streak of starters giving up no more than 3 runs? How good were they, and did they show real consistency or were there odd flukes in that run? Let's take a look. To estimate the effectiveness of each start, we're going to use a thumbnail stat left over from early sabermetrics days, Game Score, derived from a formula that includes values for length of start, hits, walks and stirkeouts. Game Score was designed so that an average start would be a 50, better ones getting higher scores and worse ones getting lower. While not "Truth",  it's a useful indicator.

2 Apr. 2 @SEA Haren L 4-0 6 4 4 0 1 1 2 57
3 Apr. 3 @SEA Blanton L 8-4 6 5 4 4 1 0 7 53
4 Apr. 4 @SEA Harden W 9-0 7 3 0 0 0 2 7 76
5 Apr. 5 @LAA Gaudin W 4-3 5 5 2 2 2 0 3 52
6 Apr. 6 @LAA Kennedy L 5-2 6 6 2 1 1 2 2 54
7 Apr. 7 @LAA Haren L 2-1 7 6 2 1 0 3 2 58
8 Apr. 8 @LAA Blanton W 2-1 5.3 5 1 1 0 2 2 54
9 Apr. 9 CWS Harden L 4-1 6 5 2 2 2 2 6 58
10 Apr. 10 CWS Gaudin W 2-1 5.7 3 1 1 0 3 6 62
11 Apr. 11 CWS Kennedy L 6-3 5 5 1 1 0 3 3 53
13 Apr. 13 NYY Haren W 5-4 5 4 3 3 0 4 5 48
14 Apr. 14 NYY Blanton L 4-3 6.7 5 3 3 1 3 5 54
15 Apr. 15 NYY Harden W 5-4 6 5 1 1 0 2 7 63
17 Apr. 17 LAA Gaudin W 4-1 7.7 4 1 1 0 2 4 69
18 Apr. 18 LAA Haren W 3-0 7 4 0 0 0 0 3 72
20 Apr. 20 @TEX Blanton W 16-4 6 7 3 3 0 1 7 52
21 Apr. 21 @TEX Kennedy L 7-0 5 3 1 1 0 3 5 59
22 Apr. 22 @TEX Gaudin L 4-3 6 4 1 1 0 2 7 65
23 Apr. 23 @BAL Haren W 6-5 7 5 1 1 1 1 5 67
24 Apr. 24 @BAL Braden W 4-2 6 3 1 1 0 1 6 67
25 Apr. 25 SEA Blanton L 2-0 9 6 2 2 2 2 6 71

I use a rule of thumb -- a start with a game score between 46 and 54 I call "medium", the bracketing amounts "better" or "worse". In this run, there was only one start below 50 (a close enough 48). This is 13 "better" and 6 "medium" starts during the three-or-fewer run (Lola) run. These performances weren't fluky. What's fluke-alicious, though, is that the A's could only muster an 11-9 record during this ecstatic time (the last game that continued the streak isn't on this list -- it was finishing up as I was writing).

Are there blemishes here? In the immortal words of Browns fan Curly Howard, "Why Soy-tenlee".  Because while the starters have been legendarily consistent, the bullpen has not been. As abnormal as the starters' consistency has been, it would be almost an order of magnitude more extraordinary for a bullpen to be that consistent...their tolerances tend to be lower (some appearances start with runners on or in scoring position and tougher-than-average hitters up (if they were weak hitters, it would be probabalistically more likely the manager would leave the starter in). And their appearances are shorter, innings fewer, so there's less opportunity to smooth out rough patches into a flat average. The line 1 inning, 0 hits, 1 strikeout, a very lovely reliever line, could occur several times during a start that was crappy. The line 1 inning, 3 hits, 1 walk, 2 runs, an awful relief line, could happen within an overall "better" start. So in short run efforts (in and beyond baseball), what's good or bad has to be considered in context and you have to look at a short stretch with dramatically high or low results as a piece of the whole, not the whole.

In this case, the bullpen's inconsistency undermined the starters' efforts and in part, this was exacerbated by a gaggle of relatively short starts (under six innings). A's relievers gave up 13 runs of their own and 4 runners starters had departed leaving on base during a 23.3 inning stretch over seven games. 

In conclusion on Slusser's newsbyte on the starters, the streak showed actual fine consistency, although in part "three or fewer runs in a start" takes advantage of some pretty short (in some case "minimum" a starter needs to get a win) starts, and a performance like Rich Harden's April 13 start where he gave up three runs in five innings is less impressive than a lot of longer starts where pitchers gave up more. The relative success of this, keeping the A's at .500 even.

This quick trigger on starters who have not yielded runs or many runners is an interesting approach, not generic. In general, the shorter starts where the pitcher hasn't been knocked out, they have pulled him at below 100, below 90, and even below 80 pitches. Not a common occurrence. Personally, I don't think it's a design for the season, but for early in the season pending Loaiza's return, though I could be wrong, but if I am, I think we'll start seeing lots of churning in their bullpen, with AAA players getting looks in relief roles and probably other acquisitions.

I suspect that having their horse, their most reliable starter Barry Zito, leave through free agency before the season, and then their veteran backup Esteban Loaiza, someone they are counting on for 25-35 starts and 150-225 innings, out at the start of the season with a neck problem, they are trying to win without burning out the remaining starters. And doing a decent job of it, at least on the pitching side, decent enough to be at about the 1/7th mark of the season at .500, one game out of first in a fairly flat division.

On the other hand, we may be seeing the beginnings of an innovation attempt, a new way to handle pitching workloads. If it continues to work as well as it has, we'll see others try to follow, and those others' success will hinge, to a great degree, on how well the context of their talent works in this anti-protocol. Keep watching for those five- to six-inning, 75-90 pitch starts where the starter leaves without having given up much.

As with my client, more often than not, effective analysis requires more than a look at averages (even intelligently-composed ones). Take a step back from important conclusions long enough to look at stretches of individual data points...there's a lot more useful observation to be had down there.

