I use this blog as a soap box to preach (ahem... to talk :-) about subjects that interest me.

Tuesday, November 20, 2012

Misunderstood Science #2 - Another question of probability

In this article, I want to describe a problem that illustrates how to correctly estimate probability in a way that will surprise the non-statistician.

Here is the problem: An American friend of yours has two children. You know that one of them is a girl, but cannot remember the gender of the other child. What is the probability that they are both girls?

Many people would equate the lack of information concerning the gender of the other child with equal probability of the two possible genders, and answer 50%.

Their reasoning would be completely wrong but, as it turns out, their answer would be correct, at least in practical terms. If you read and understood my previous article on probabilities http://giuliozambon.blogspot.com.au/2012/11/misunderstood-science-question-of.html, you might know why. But let’s proceed in order.

For the genders of two children, there are four possibilities: MM, MF, FM, and FF, which represent the sample space of the problem. If we assume that boys and girls are equally probable, the four possibilities are also equally probable, at 25% of probability each. As you know that one of the children is a girl, you can exclude the MM case. As a result, you are left with three possibilities and can conclude that the probability of both children being girls is 1/3, or approximately 33.3%. And obviously, the probability that your friend’s other child is a boy is 66.7%.

But then, why did I say that the 50-50 answer is correct from a practical point of view? There are two reasons:
1. No parent gives the same name to their two daughters.
2. The frequencies (and hence, the inferred probabilities) of given names are very low.

Let’s start by taking into consideration that two daughters in the same family always have different names. We do so by splitting the ‘F’ of the above possibilities into ‘x’ and ‘f’, where ‘x’ indicates the girls with a particular first name, and ‘f’ the other girls, who have any other name. This results in a sample space consisting of MM, Mf, Mx, fM, ff, fx, xM, xf, and xx.

‘x’ can be any name we want, including the name of the daughter we know to belong to your friend’s family (even if we don’t know that name). Then, after discarding MM as we did before, we can also discard the possibilities that don’t contain ‘x’. This leaves us with Mx, fx, xM, xf, and xx. Now, as parents never give to two daughters the same name, we can also discard xx, and remain with the four possibilities Mx, xM, fx, and xf.

If we assume that boys and girls are on average equally probable and that the genders of children of the same family are independent from each other, we can calculate the probabilities associated with the four possibilities:
PMx = PxM = PM * Px
Pfx = Pxf = Pf * Px

We can use the frequency ‘y’ with which the name ‘x’ occurs among girls as an estimate of its probability, and rewrite the two expressions as follows:
PMx = PxM = PM * PF * y = 0.25 * y
Pfx = Pxf = PF * (1 – y) * PF * y = 0.25 * (y – y2)

As you can see, if y2 is much less than y (i.e., much less tan 1 as stated in our condition 2), all four possibilities have, for all practical purposes, the same probabilities. Then, the probability that the children are both girls is indeed 50%.

But is it true that all names have a frequency much less than 1? If you look at the web site of the US Social Security Administration, you will find the page http://www.ssa.gov/oact/babynames/limits.html from which you can download the number of children born in any particular year and given any particular name (but only if that name was given to at least five children).

Let’s say that your friend’s daughter was born in 2011. Then, you quickly find out that of the 33,723 names listed, out of a total of 3,623,043 girls, the most frequent girl name was Sophia, which was given 21,695 times. If your friend gave to his daughter the name Sophia, with y = 21,695 / 3,623,043 = 0.060, the resulting probability for two girls is around 48.45%.

Perhaps 48.45% is not close enough to 50%, but consider that the average occurrence of any name is 33,723 / 3,623,043 = ~108, which provides y = 0.00003. Then, the probability of two girls not knowing the name of the daughter your friend certainly had becomes 49.999%. Or perhaps you find out that the name of your friend’s daughter is Hilde, which in 2011 only occurred 5 times out of 3,623,043. In that case, the probability of him having two daughter is almost exactly 50%.

All in all, we can conclude that 50% is, for all practical purposes, correct, even if the reasoning of many people to reach that value is wrong.

No comments:

Post a Comment