<$BlogRSDUrl$>

Monday, July 30, 2007

Baseball Insights in Measuring the Soft Data:
Mike Emeigh Nails JKIs  

One of Baseball's most important lessons for managers beyond baseball is in its depth of metrics -- its meaningful ones as well as its plaque ones. In this entry, we'll cover how to evaluate seriously something people "just know".

An anthropology major I went to college with ended up working on an auto assembly line in Ohio and told me one should special order a new car that was built on a Tuesday because the QC people did their best work on that day (and both management and the line knew it). He eventually ended up doing grad school research on it and found there were no hard numbers anywhere to support it -- that the involved people simply "just knew it".

I call this a JKI.

JKIs are not guaranteed to be true or to be false. They are likely to have been true in the moment the "aha" happened.  In the instances when the JKI turn out to be not-generally-true, more often than not, it's a case of making a judgement on too small a data pool or a case of ignoring the specific context within which the observers initiated the JKI (The Tuesday car-build apocrypha originated during an era when that plant had weekday shifts and Saturdays & Sundays off...¿what would have happened if the days-off were Wednesday & Thursday? And what if it was true in one plant but the JKI was generalized to all plants?). A tonne of JKIs "stick" in contexts where they are no longer true or generally-true.

One of Baseball's most persistent and truly interesting JKIs is "clutch hitting" (a posited ability to perform better either in important games or in important at bats, than the player does in less-than-important ones).

Fans generally believe some players are "clutch hitters" (Tony Perez, Derek Jeter) and some smaller pool are "chokes" (the opposite of "clutch"). Some of the new wave baseball statisticians believe they have proved clutch hitting doesn't exist...based mostly on the idea that in general, it's not reproduceable (i.e., that if a player truly had "clutch hitting ability" or was "a choke", that player would, with some high level of consistency, repeat the pattern season after season and, in general, it's not).

One of the most insightful baseball researchers is the humble, diligent, reliable Mike Emeigh (sort of a Sabermetric Edgar Martinez), and he's recently executed a study of individuals' hitting in the clutch using methods developed by the researcher known as Tango Tiger, and that Tango wrote up in the book he co-authored, The Book.  Here's some of what Emeigh wrote on Baseball Think Factory.

Baseball fans - and sportswriters - who are not oriented toward statistical analysis tend to have a fixation with the concept of “clutch hitting”. In spite of numerous studies over the years that show that clutch ability - if it exists at all - tends to be relatively small, fans still argue that so-and-so is truly a “clutch god” or a “choker”.

One issue that we’ve had in trying to evaluate clutch performance from an analytical standpoint is that it’s been difficult to come up with a consistent definition of “clutch situation” that doesn’t do one of two things:

1. aggregate too many “unlike things” together (e.g. performance with runners in scoring position, which equates runner on second/two outs with bases loaded/no outs even though there is a very different potential impact on the game situation);

2. reduces the sample size to a point where small variations have tremendous impact (e.g. performance with RISP in late/close situations, where many hitters may have no more than 20-30 appearances in a season)

What Leverage Index does is to place every plate appearance on a sliding scale based on potential game impact. As Tango notes in the quote I highlighted above, most people know clutch when they see it, even if they can’t necessarily define it. LI does an excellent job of accurately capturing the relative importance of game situations from the viewpoint of a typical fan.

{SNIP OUT HIS DESCRIPTION OF TANGO'S WEIGHTING SYSTEM & ITS APPLICATION}

One could, in this manner, develop weighted performance for each player, weighting his PA by the LI of each situation in which he appeared. If the player’s weighted performance was better than his actual performance, one could conclude that he produced more value in game-important situations (e.g. was more “clutch"); if the player’s weighted performance was worse than his actual performance, one could conclude that he produced less value in game-important situations (e.g. was more of a “choker"). The advantage of doing something like this is that every plate appearance for every player can be included in the study, and plate appearances are weighted in a more-or-less appropriate manner based on a consistent definition of the value of the PA.

If you head over to the study, you can see that he then elaborates Tango's Leverage Index weightings and proposes a simple way  to evaluate individual hitters' weightings. He then found the findings were not of overwhelming value for some rational statistical reasons and calls out he's publishing anyway because the findings tend to confirm the fans general perceptions (JKI).

Looking at the group of good hitters, we have.

Top 5, weighted OPS - actual OPS:

    Carlos Delgado, .285/.391/.566 unweighted,   .310/.416/.618 weighted, 77 point gain 
    Carlos Beltran, .278/.368/.517 unweighted,   .295/.388/.550 weighted, 53 point gain 
    Albert Pujols,  .338/.429/.650 unweighted,   .345/.443/.688 weighted, 52 point gain     
    David Ortiz,    .294/.391/.609 unweighted,   .318/.412/.638 weighted, 50 point gain
    Derek Jeter,    .316/.387/.464 unweighted,   .331/.410/.482 weighted, 41 point gain
   

Bottom 5, weighted OPS - actual OPS:

    Travis Hafner, .299/.404/.590 unweighted,    .289/.399/.563 weighted, 32 point loss 
    Javy Lopez,    .298/.347/.518 unweighted,    .283/.350/.486 weighted, 29 point loss 
    Carlos Guillen,.310/.379/.483 unweighted,    .301/.382/.456 weighted, 24 point loss 
    Miguel Tejada, .306/.356/.505 unweighted,    .296/.351/.489 weighted, 21 point loss 
    Carlos Lee,    .290/.344/.513 unweighted,    .284/.344/.492 weighted, 21 point loss

The top five have been well-publicized for their “clutchiness”. The bottom 5 aren’t particularly well-known as “chokers” - with the possible exception of Tejada - but Alfonso Soriano, who was sixth from the bottom, does have something of an “unclutch” reputation.

ARod, FWIW, hit .299/.396/.562 overall, but had a weighted performance of .297/.403/.557, for a 2-point OPS gain. This placed him 24th among the 36 good hitters, and especially in comparison to Jeter probably explains a lot of the perception of ARod as a player who doesn’t produce when it counts. Manny Ramirez, who also has a bit of an “unclutch” reputation, hit .311/.412/.602 overall and .312/.429/.594 weighted, a 9-point OPS gain but with a larger loss of power than the typical good hitter showed.

While there are some mismatches between weighted performance and perception - Bobby Abreu was just behind Jeter, JD Drew and Adam Dunn were also pretty high, and Andruw Jones and Miguel Cabrera are fairly low on the list - as a general rule I think that performance weighted by LI matches perception of clutch value quite well. Whether this has any analytical significance remains to be seen, but I think it offers a starting point.

Here in clutch hitting, then, is a JKI that has a large basis in reality.

THE PROBLEMS WITH TURNING JKIs INTO ACTIONABLE INFO
In your own endeavor, as in Baseball, there is a need to examine your JKIs as well as a challenge in finding ways to examine them sensibly. As a management consultant, I generally collect JKIs when I hear them explicitly stated or alluded to -- unexamined JKIs are easy targets & there can be a lot of value in correcting 'em. 

But in Baseball, as in your own endeavor, the challenges fit into a number of basic categories you can address, and Emeigh's piece covers most of them.

#1. Recognize that the metric you define will strongly shape...if not dictate entirely... the value of the outcome.

Mike said:

Your definition of the performance measure shapes what results you arrive at. If the measure of success in a weight-training program is size, the exercises you use to get there are valued differently than if it's strength and different still if it's some ability to apply strength or size. I worked with a client not long ago who hired wonderfully creative sales people, but who weren't selling a ton. Their approach was a standard reaction for salesfolk compensated by gross and not net...it was to try to cut prices to increase volume. When that failed to acquire enough net, they applied their creativity to inventing custom products -- whatever objection a handful of customers came up with, they invented a unique product for that objection and tired to sell it to them ("What if I could give you..."). 

If you measured the performance of salesfolk by how many clever & sellable products they elicited from customers (a virtuous thing that isn't core to their job), then you could build a measurement you could collect and report on. But it would be the wrong thing.

In the part of the article I snipped out, Mike discussed Tango's textured weighting system that includes both score differential and inning and pairs them. If you only considered run differential (say, a three-run lead) but not inning, you'd lose the context because hitting a two-run homer down three runs with one out in the 3rd inning is a different context from doing the same thing in the 9th inning...there are fewer outs between you and defeat and in a "normal" game, it's easier to make up three runs in the six innings after the third than it is in the single inning left when the plate appearance occurs in the 9th.

Tango defined meaning around context, and, for this purpose, the right context. They are measuring the right things.

#2. Take care to neither overdefine so you have too few sample cases to reach decent conclusions, or underdefine it and miss the difference in contexts that you'll face happen. 

Mike said:

One issue that we’ve had in trying to evaluate clutch performance from an analytical standpoint is that it’s been difficult to come up with a consistent definition of “clutch situation” that doesn’t do one of two things:

1. aggregate too many “unlike things” together (e.g. performance with runners in scoring position, which equates runner on second/two outs with bases loaded/no outs even though there is a very different potential impact on the game situation);

2. reduces the sample size to a point where small variations have tremendous impact (e.g. performance with RISP in late/close situations, where many hitters may have no more than 20-30 appearances in a season)

What Leverage Index does is to place every plate appearance on a sliding scale based on potential game impact. As Tango notes in the quote I highlighted above, most people know clutch when they see it, even if they can’t necessarily define it. LI does an excellent job of accurately capturing the relative importance of game situations from the viewpoint of a typical fan.

#3. Try to go into the effort with a point of view, explicitly held, but be willing to allow the data to reshape it.

The brilliant Paul Saffo, the only futurist I know of who isn't a laughable failure, calls this process "Strong opinions, loosely held'. Mike showed his skepticism abut the wisdom of the crowds in assessing "clutch" hitting. Yet when he delivered the table of individual clutch performance, he led with the case where people were generally correct: the leaders table.

He then pointed out surprises and counter-examples (batters thought to be one thing when they were another).

The stats gang who initially posited clutch hitting doesn't exist overvalued reproduceability because they already believed it didn't exist. Some of those who argued against the cynics believed clutch hitting existed anyway simply because they wanted it to. The reality is something can exist in a few individuals that doesn't exist in the pool of the population as a whole. The fact that most batters don't have a consistent clutch/choke factor doesn't mean it doesn't exist anymore than the fact that about 97% of people have a "normal" number of adult teeth sprout up, while 3% don't means every human enters the world equipped with the potential for the same number of teeth and that variance doesn't exist (if you can't be hyperbolic, be hyperdontic).

#4. Don't overstate the quality of your measure -- and if it's not decisive, think about how to make it better.

Emeigh concluded:

While there are some mismatches between weighted performance and perception - Bobby Abreu (Jeff note: His rap is that he's a poor big-time performer) was just behind Jeter, JD Drew and Adam Dunn were also pretty high, and Andruw Jones (Jeff note: good reputation as a clutch player) and Miguel Cabrera are fairly low on the list - as a general rule I think that performance weighted by LI matches perception of clutch value quite well. Whether this has any analytical significance remains to be seen, but I think it offers a starting point.

All the performance measures you use are items you should examine...regularly if you know they leave room for significant improvement, just occasionally if they are pretty good. You check into even the effective ones once in a while because context changes over time and that reweights factors that make up the components of your measures.

He states the measure has some value, that it shows The one criticism I have of the piece is around this point #4. Having just  marinated himself in this data for what must have been quite some time, I think Emeigh should have proposed one or more next steps to improve its analytical significance -- no one would be in a much better position to JKI where improvements might be hiding.

JUST KNOW IT
There are always arguments out there about why you can't have as much accountability in some line of work than they maintain in Baseball. That's an excuse to do nothing.

Channel Mike Emeigh, look at the performance data and analyse it with insight and perspective. You are likely t see how to make changes for the better when perofrmance is important, to be a "clutch hitter" yourself.


This page is powered by Blogger. Isn't yours?


free website counter