Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Wednesday, February 16, 2011

Super Bowl Notes

A couple items came up during the Super Bowl that touch on posts I've made here in the past. First, it was repeatedly mentioned during the game broadcast that although Green Bay ended up with a somewhat mediocre record of 10-6, all their losses were extremely competitive. Green Bay never trailed by more than 7 points during any game at any point in the season, and their 6 losses were by 3, 3, 3, 3, 4, and 4 points, for an average of 3.33. This average is one of the all-time lows, and compares well to the teams mentioned in my Close But No Cigar post from a year ago.

The second item has to do with the endgame strategy surrounding Green Bay's field goal attempt at the end of the game. Holding a three point lead, Green Bay had fourth down from the Steelers' 5 yard line with just over 2 minutes left in the game. This is a classic situation where statisticians claim teams are routinely too conservative. Kicking the field goal hands the ball back to Pittsburgh, while going for it either wins the game with a touchdown, or pins Pittsburgh deep in their own end of the field.

That kind of analysis is not too unusual; similar arguments about 4th downs come up many times each season. However, on the Advanced NFL Stats site, an interesting addition to this argument appeared. That site has a "win probability" engine - for any given game situation, it provides the probability of each team winning, based on historical game outcomes from similar situations. For the game situation after the Green Bay field goal (~2 minutes left, in their own end of the field, down 6), the trailing team is expected to win 25% of the time. For the same situation, but with the team trailing by 3 instead of by 6, they are expected to win 21% of the time.

This looks like something along the lines of my 13 is worth more than 14 post from 2009. How can it be that teams trailing by 3 in this situation are less successful than teams trailing by more? It may be that teams trailing by 3 put too much value on reaching overtime, so they play to tie the game rather than to win it outright. When teams are further behind, they are forced to avoid this bad strategy. Also, on the other side of the ball, defenses may play differently as well, when they feel they have a safer lead. This may also be a poor strategy.

In any case, it's another interesting example of a counterintuitive statistic.

Monday, March 15, 2010

Chess Query Language

It's amazing how many tools are available on the web for seemingly obscure tasks. Recently, a friend of mine was writing a short story, and he needed an answer to this question: In high-level chess games, how often do the different pieces survive through the game without being captured (ignoring kings)? In the context of this question, each of the 30 starting pieces is treated as distinct; we want to know how often the pawn that starts on the a2 square survives, how often the b2-pawn survives, etc., rather than how often general pawns survive.

I think this qualifies as an obscure question. It seems simple enough to answer in principle - just get a database of games, and write something to play through each game, tracking which pieces survive. Simple enough, but a fair bit of work. Luckily there's a tool that will do this type of thing: Chess Query Language.

CQL is quite powerful, and it's pretty straightforward to set up a CQL query. For example, to answer the above question about survival rates, I started by creating a query to see how often the white queen's rook survives:

:forany Rook R
(:position :initial $Rook[a1])
(:position :terminal $Rook[a-h1-8])

That's it. The first line creates a Rook piece designator; the second and third lines specify positions that have to exist in a game for the game to match the query. Thus the query will match any game where a rook is on the a1 square in the initial game position, and that same rook is somewhere on the board in the terminal game position.

This query took about 45 minutes to run through a database of about 2.5 million games, and found that this rook survived in about 1.4 million of them. I just had this repeat for all pieces and pawns to generate the final answer.

So, I can advise that if you're ever involved in a Harry Potter-style human chess game, you should volunteer to be one of the wing pawns. Don't allow yourself to play as a knight, whatever you do.

Thursday, January 7, 2010

Close But No Cigar

A friend of mine, who cheers for the Pittsburgh Steelers, has been talking about how close their losses have been this year. The Steelers finished 9-7, but the 7 losses were by a total of only 28 points, so a couple plays one way or the other could have produced a very different result. I'm quite familiar with this line of thinking, since my team is the Chargers, and last year they lost 8 games by a total of only 34 points (including a 1-point loss on a terrible blown call in a game I drove 1000 miles to see. But I digress.) I have my trusty database, so I thought I'd take a look at how these two seasons stack up historically. The following items all consider the years from 1978 through 2008.
  • Among all teams that finished at 9-7, the 2009 Steelers' total margin of defeat (28) was the lowest. The next closest were the 1993 Broncos (30) and the 2002 Saints (35).
  • The 2009 Steelers' average margin of defeat (4.0) is the lowest among all teams that lost 7 or more games.
  • Among all teams that finished at 8-8, the 2008 Chargers' total margin of defeat (34) is tied for the lowest with the 1999 Raiders.
  • The 2008 Chargers' average margin of defeat (4.25) is the lowest among all teams that lost 8 or more games. The 2009 Steelers are the only team with 7 losses to have a lower average margin of defeat.
It looks like these two seasons were in fact quite unusual. Here are a few other interesting facts I came across while tabulating these results.
  • The 16-0 2007 Patriots are a bit of a special case, but you might say that they hold the record for smallest average margin of defeat, at zero.
  • The 1983 Redskins finished 14-2, and their two losses were by one point each.
  • The next best average margin of defeat (2.33) was by the 2000 Titans, who finished 13-3 and whose losses were by a total of 7 points.
  • The team with the worst average margin of defeat (24.7) was the 1989 Steelers. They actually made the playoffs at 9-7 (unlike this year's Steelers), but had several huge losses, including a 51-0 game against Cleveland. In the playoffs, the Steelers won their first game, and then lost to the Broncos... by 1.

Friday, May 29, 2009

13 is worth more than 14

In the NFL it is, anyway. Maybe.

I read a fair bit about sports, and in particular I have an interest in the statistics that people use to try to analyze them. One thing I've heard a few times as an example of a counter-intuitive stat is this: NFL teams scoring 13 points in a game win more often than teams scoring 14 points. I've recently come into possession of a database of game data, so I thought I'd have a look at this for myself.

My data contains all the regular-season and playoff games back to the 1978 season, so it's a pretty good sample size of about 7500 games. The first item to look at is the 13 vs. 14 thing, and sure enough:

13: 225-562-2 28.6%
14: 144-670-2 17.8%

There you have it - teams scoring 13 win significantly more often than teams scoring 14. Of course, the real question is what (if anything) this means. The most likely cause for this effect is the unusual way points are scored in football. Almost all scoring is through 3-point field goals and 7-point touchdowns. This means that 13 can really be thought of as 2 field goals and 1 touchdown, and 14 as 2 touchdowns. Maybe field-goal-heavy scores outperform touchdown-heavy scores in general. Let's see:

6: 18-307-0 5.5%
7: 16-621-2 2.7%

16: 249-256-0 49.3%
19: 160-129-0 55.4%
20: 543-432-2 55.7%
21: 308-405-0 43.2%

27: 568-176-0 76.3%
28: 328-152-2 68.3%

That certainly seems to support the FG vs. TD explanation, and in fact it's quite striking how poorly the multiples of 7 perform. 7 points wins 3% of games; it performs worse than 5, 6, 8, and 9 points. 14 wins 18%; it performs worse than 11, 12, and 13. 21 points is the highest score that loses more than half its games, and it performs worse than 16 (!). 28 does worse than 23, and so on.

The FG vs. TD explanation makes sense for a few reasons. First, the time one team is scoring is time that the other team isn't. 3 successful possessions, for a TD and 2 FGs, will generally take more time than 2 successful TD possessions, leaving the opponent with less time to score their own points. Second, teams that are trailing by a large amount won't try for field goals. That is, a team losing 20-7 will have to go for a touchdown, while a team losing 10-7 is more likely to take a field goal. Finally, there may be game conditions making certain games conducive to more field goals. For instance, a game with heavy snow or fog might reduce offense, causing both teams to score few TDs.

I must admit, though, that this effect doesn't last forever. Teams scoring 49 or 56 points have won 100% of their games since 1978. I guess the lesson is that if you're going to score touchdowns, you should try to score 7 or 8 of them, not just 2 :).

Monday, February 9, 2009

30% of all numbers start with 1

Well, not exactly. However, it is true that in certain common types of data, numbers beginning with 1 appear the most frequently, making up about 30% of the values. Numbers beginning with 2 appear slightly less frequently, about 18% of the time. Each successive digit appears with a lower frequency, until 9 shows up as the leading digit in less than 5% of the values. This property is called Benford's Law, after physicist Frank Benford.

This is quite a counter-intuitive thing. Why should there be more bank balances starting with a 1 than with a 9? Six times more of them, in fact. Why should this be true for the lengths of all the rivers in the world? The technical explanation for this is that these types of real world values are distributed logarithmically. For a more intuitive explanation, an example is probably more helpful.

Let's say you invest $100 in an account that pays 10% annually. This means that your investment will double every 7.3 years. The investment will reach $200 after 7.3 years, so for that entire first 7.3 years the investment's value began with a 1. Now, it will take another 7.3 years to double again. However, this time it's doubling to $400, not $300. The investment was valued in the $100s for the same amount of time it was valued in the $200s and $300s. The point is that the investment is growing at a rate proportional to its own size. It's compounding. The investment only spends 4 years or so in the $200s, 3 years in the $300s, and so on, finally breezing through the $900s in just over a year. Then this repeats for all the 4-digit values: 7.3 years in the $1000s, 4 in the $2000s, etc.

The reason I've been thinking about this topic recently is that it's been mentioned in relation to Bernie Madoff's ponzi scheme. Benford's Law is a good tool for detecting fraudulent data, because people faking such data often don't take it into account. Apparently, Madoff was sophisticated enough to generate numbers that met Benford's Law reasonably well.

As a quick real-world test of this, I thought I'd check the sizes of all the files on my computer's hard drive. The results seem to fit the Law's prediction quite well:











Digit# Files% FilesBenford
148,29528.6%30.1%
232,89319.5%17.6%
323,29713.8%12.5%
415,9259.4%9.7%
512,6127.5%7.9%
610,6656.3%6.7%
78,4385.0%5.8%
89,9725.9%5.1%
96,4983.9%4.6%


Wikipedia Link: Benford's Law

Monday, January 26, 2009

NFL Win Probability Calculator

Two things I've been interested in for a long time are NFL football and mathematics. This makes a site like Advanced NFL Stats right up my alley. In particular I like the Win Probability Calculator that was recently posted there. You enter a game state, including the score, field position, and time remaining, and the system produces the probability of each team winning the game. I think this could be a great tool for arguments about going for it on 4th down, or for attempting 2-point conversions.