Saturday, May 22, 2004
Facts are stubborn, but statistics are more pliable -- Mark Twain
In almost every organization, there's a powerful need to perform business intelligence (BI), the gathering and synthesis of information followed by structured analysis. Organizations that are good at BI have Mark McGwire home run-sized advantages in a dozen areas, including marketing analysis to inventory control to sales-force optimization to quality control.
Of course, that's just the need. The reality is that most organizations would rather not invest in serious BI; they'd rather play seat-of-the-pants guesstimation games that impersonate BI with about as much chance of success as The Rock's drag impersonation of the Olsen Twins.
Baseball is way ahead of business organizations in this realm. The least BI-oriented baseball team is way ahead of the average billion-dollar corporation in BI acumen. It's become popular in the broader sabermetric community to ridicule certain teams that reject wholesale adoption of sabermetric analysis. Deserved or not (I'd say "element of truth, but overstated"), it's important to note the lowest common denominator major league team has a wealth of statistical information and people on staff whose job consists, to a large degree in analysing data for the purposes of gauging prospective talent, taking advantage of opponents' weaknesses and tactical tendencies.
The challenges come in putting the data in context to create "intelligence", and that requires a human skill that is not widely available. Beyond baseball, lazy organizations, faced with a shortage of skilled analysts, will either pimp the process and buy cheap (usually ineffective) talent, or pretend it's not worth doing. Baseball is at a bit of an advantage in this respect, because even though baseball offer big paydays to BI analysts, there are tons of numerate people willing to take less money to work in the National Pastime They Love than they would doing the same jobs for a brake-lining manufacturer or trash-hauling service.
CONTEXT IS KING
One of the most controversial data sets in the numerate baseball community is the numbers and techniques analysts use to analyse "ballpark effects", that is, the way that specific stadia affect individual performances. It's necessary for team managers to have some appreciation of it, because a stadium creates a playing environment that will affect, for example, whether the long fly ball a left-handed hitter pulls down the line becomes a long out or a four-bagger. Incrementally, over a long set of events in the same environment, certain players will outperform others that in a different environment, they would lag behind. So baseball managers incorporate a bit of this knowledge in a decision on who to play in a given day and how to construct their lineup.
A baseball general manager would use a bit of this information to assign priorities in judging potential trades (a right-handed batter who has little batter who hits a ton of homers but not many doubles in the Cubs' home park will likely amp up doubles and lose some homers playing the Brewers' home field) and in thinking about the kinds of tweener players (not first and second rounders, but still of some interest) a team might draft.
The challenges of most park data is a park affects different players to different degrees (important non-baseball comparison later). Right-handed hitters, in general, are affected by a stadium differently from left-handed ones. Two right-handed sluggers with the same homer frequency can be affected differently from each other if one hits mostly line-drives while the other gets more loft on hits as a rule. And by the time you chop it up into clearly differentiated slices, most of the sample sizes will be so small as to be only marginally useful. And there are a lot of non-talent-related factors: weather patterns that vary year to year, time of day games start, and other things.
As a result, park factor analysis is pretty controversial. It's not the idea the parks don't affect results (there is a small cluster of individuals who argue the contrary. They are the same MBWT flat-earthers who point to a small number of valid data points to argue there's no global warming occurring in the face of a tidal wave of contrary evidence), it's that it's hard to know exactly why in any given case, and then how to use the numbers to inform better decisions.
Take this year's Park Factor data (courtesy of ESPN). I've trimmed a few columns, like walks (too indirect) and triples (too rare an event to be meaningful). There are still several things "wrong" with this table that a numerate person wouldn't have allowed to seep through, but let's take a quick look at the data because it's fun to consider.
The way to read the table is to understand the numbers are "normalized" to 1.000. In the league average stadium, the numbers would read 1.000 straight across. Lower numbers mean lower instances. So in Angel Stadium, runs occur at .768 of the league average (about 77% of the league average, about 23% lower than average). In Arlington, there are about 42.5% more doubles than in the average stadium.
2004 Park Factors
|Bank One Ballpark||1.021||1.307||1.064||1.068|
|Olympic Stadium / Bithorn*||.872||.699||.992||1.012|
|Pro Player Stadium||.833||.757||1.019||1.059|
|Tropicana Field / Tokyo Dome*||1.051||1.150||1.009||.972|
|U.S. Cellular Field||1.146||1.361||1.111||1.159|
Park Factor compares the rate of stats at home vs. the rate of stats on the road.
A rate higher than 1.000 favors the hitter. Below 1.000 favors the pitcher.
PF = ((homeRS + homeRA)/(homeG)) / ((roadRS + roadRA)/(roadG))
* Teams with home games in multiple stadiums list aggregate Park Factors.
It's a blast to walk through these numbers, in spite of manufactured flaws and small sample size. Remember, this season is about 40 games old for most teams, so that's roughly 20 games at each stadium, a pretty small pile of numbers that could change pretty seriously with just two or three upcoming games in a park going very differently than the few played already.
Some of the interesting data points:
- Fenway Park was long an offensive park. After a redesign of the upper deck some years back, it became a balanced park that actually favored most pitchers a little. It took years for the perception to catch up with the reality. But in the last couple of years, it's drifted back apparently to being a hitter's park. The perception, though is still sticking to the previous offensive-suppression environment.
- Kansas City's Kauffman Stadium was an offensive pressure-cooker the last couple of years, last year especially. The team made what they planned to be small tweaks to tone it down a little, but this year so far, it's playing as the most run-suppressing stadium in the majors.
- Houston's Enron Field, thought of as an offensive bandbox, is also so far this year playing as a pitcher-friendly place, if the numbers hold water. I wonder if this means Enron's stock price is a good value...perhaps not.
- The San Francisco Giants' stadium, since its recent construction has been an extreme pitchers' park (well, for everyone but one guy, but he's just "seven sigma") and this year has been pretty marked in the other direction and in all the categories. Perhaps it's because this year they sold the naming rights to the park to Seattle's Best Coffee and the pitchers are getting free macchiatos between innings.
SOME CRITICISM -- MAKING THESE NUMBERS BETTER
When you look at numbers, you should always be suspicious when you get certain artifacts that tell you the person putting the data together doesn't really have a knack for analysis. There are some indicators here that make me suspicious.
- There are insignificant numbers in the charts. In the example I used before, Angel Stadium's run factor, the table runs the number .768. The trailing 8 is insigificant and the thousandths (third decimla place) in every one of the numbers ought to be gone, with some rounding scheme applied so they all appear as two digits. When people throw insignificant digits at you, especially in data allegedly being presented to inform, its one of three things: they're trying to hide a number or two in the pile by making the set harder to read; they're ignorant of the significance of the numbers they're showing you (not just statistically, but ignorant in every respect); they know better but they're lazy. But the whole chart would be a better analysis if the maker had rounded down to two digits.
- There's a line for Olympic Stadium/Bithorn I put in italics because these numbers already, as I mentioned, suffer from small sample size, but Les Expos Pauvres have to play in two different home stadia, and mingling the two very dissimilar stadia together removes every value but one from the line (trying to judge how individual Expo players' accumulated stats miught be affected by being an Expo and playing in those home stadia). That line really shouldn't appear, even with triple the data it's not going to be meaningful.
- The inclusion of a pair of games played in Tokyo in the data set. I let the asterisked Tropicana Field/Tokyo pair through because the Devil Rays played but two home games there this year. But really, the Rays and the Yankees' data for those games just should appear anywhere in the data set because it affects each of those teams' apparent performance and affects the league averages by being dumped in there (when no other teams will be playing home or road games there).
Beyond baseball, you can use BI to increase the chances for future performance imporvement, but you need to use real analysts (these should be either professionally trained or even just people with very good skills, but not merely cheap leftovers you scrape up off the sidewalk like no-hit good-glove utility infielders).
Look at numbers, analyse numbers, but remember that context is everything. And, don't forget the numbers may only be as good as the analysis theat wen into them...as Mark Twain also said: Figures dont lie, but liars can figure.
free website counter