Maths mums -Complex statistics - can you help?

6 replies

ohIdoliketobebesidethe · 17/04/2009 22:03

Hello.

My husband has struggled with this work-related problem and I suggested Mumsnet could answer it.

Is there a way of understanding the underlying standard deviation of a population, given 6 sets of samples for each of 28 days. Each sample has between 2000 and 7000 results per day. He does not know the distribution of the underlying population (he knows the mean) but assumes it is like the side of Mount Fuji - a mode at about 0.01 and a mean of 0.45 but with individual results as high as 10.00. He has tried the CLT theorem and T-tests (and ANOVA test) but is unsure if either of them can work out accurately the underlying population's standard deviation.

(Please wow him with your mumsnet brains as I have really egged this up )

Thanks so much.

OP posts:

apostrophe · 18/04/2009 23:46

This reply has been deleted

Message withdrawn

MrVibrating · 18/04/2009 23:59

With statistics, a little knowledge is a dangerous thing.

Why does he want to know the standard deviation? This information is only relevant in relation to a specific distribution, and you have stated that the distribution is not known (although we should be able to take a guess in a minute). If the distribution is skewed (like Mount Fuji), then the SD is irrelevant without the skewness anyway.

Your description of the results makes it sound like they are not taken from a population that can be described with a single distribution. They look to me like (for example) response times for a web site: most of the time these are really quick (when the web server can serve your request straight away) but quite a lot of the time it is substantially slower (when the web server is busy and has to wait for access to a file or a slice of processor time) and sometimes it takes an order of magnitude longer when something weird happens. A mean of these results is (excuse the pun) meaningless, and a 'standard' deviation even more so.

What does a frequency chart of the results look like? Plot 3 charts, one in the 0-10 range (to get an understanding of the overall distribution), one 0-1 (to look at the distribution around the mean) and one 0-0.02 (or is the mode 0.1 rather than 0.01?)

MrVibrating · 19/04/2009 00:02

oops, x-post - nothing wrong with apostrophe's answer (hi fellow maths geek).

Original poster

ohIdoliketobebesidethe · 20/04/2009 21:37

Thank you both for your responses - I have I think been proved right about mumsnet brains. I will have to pin him down to answer your specific questions though because I don't get it.

OP posts:

Original poster

ohIdoliketobebesidethe · 20/04/2009 22:42

He does not know the distribution - he has no individual measurements, only the average of thousands of measurements (he has an average for each of 6 samples of 2000 - 7000 measurement for each of 28 days of data). His assumption that the individual measurements lie on a distribution between 0.01 and 10.00 is just a(n) (informed) guess.

The results are indeed taken from a web company. The data is the Revenue Per Click his site receives whenever a visitor to his site clicks on an advert. The ads are provided by Google and they do not share the distribution of Revenue-Per-Click, just the total. He agrees that the underlying population of Revenue Per Clicks varies each day.

The reason he wants to know the standard deviation of the underlying population is that when he tests something new on the site (e.g., more ads) he does not know whether he can trust the change in Revenue Per Click that the test shows or whether it is just noise. He assumed (and I agree that little knowledge and assumptions is dangerous) that with 6 identical tests each with c. 7000 clicks per day he could work out the level of noise.

This is his barmy sounding 'double central limit theorem' method:

look at the stan dev across the 6 daily samples (each with 7000 clicks) and work out the standard deviation of an individual sample by divided the st dev of the 6 samples by the sqrt of 6.

- From this he now has the st dev of one single sample. - He now applies the CLT on the st dev of the 7000 clicks to get to the st dev of the under lying population: st dev of the underlying population = st dev of sample / sqrt (7000). - He does this every day for 28 days and then takes the average of the 28 double-CLT st dev for the population.

He has given up on this idea as the 6 tests, though identical in how they were set up, varied in size and he could not cope with it all.

So today he has gone and paid some statistician to help him. I think this concept is too intense for a Mumsnet post, but I am glad that there have been enough sensible posts to make my husband a) impressed and b) give up trying to solve this alone.

OP posts:

apostrophe · 21/04/2009 22:26

This reply has been deleted

Message withdrawn

Flip

Swipe left for the next trending thread