Please or to access all these features

Add post

Watch this thread

Save thread

Start a new thread

Flip thread

Hide thread

My feed

Active Unanswered threads

Getting started FAQ's

Unanswered threads Acronyms Talk guidelines

Hide shortcut buttons

Talk

Other subjects

Flip

Original poster

Anyone know about statistics?

20 replies

StealthPolarBear · 27/08/2010 13:54

I'm trying to come up with a process to decide whether a new piece of data fits the overall trend. So if you imagine a chart with the data points on, and then a line of best fit.
When I add each new one, I'm going to say how confident I am that it's 'right' iyswim. So I can see two possibilities:

I have all the other data and a line of best fit. I use the LOBF to calculate an expected value for this new data point. I then see how far it is from that.
I add the new data point, calculate the LOBF for the data including the new point, and then I see how far it is from that.

Which is better? Or are they essentially the same?

OP posts:

AMumInScotland · 27/08/2010 13:57

I think you have to do 1 - you are comparing this single point with a line which already exists, to see how well it matches.

If you do 2, you might make a very big change to your LOBF, which is enough to make the new point seem like a reasonable fit (or at least no worse than the existing points) even though you had to shift the line a long way to accommodate it.

Does that seem to make sense?

LostArt · 27/08/2010 14:10

You need to do 1, otherwise you will change the LOBF.

Stealth - I used to work in NHS IT years ago. I'm glad to see the NHS statistics is still alive and kicking!

MrsBadger · 27/08/2010 14:14

why is the new piece of data less reliable than the old, iyswim?

MamaChris · 27/08/2010 14:17

1 sounds a sensible approach. But how will you judge how far from the LOBF is acceptable?

Original poster

StealthPolarBear · 27/08/2010 14:22

exactly MrsB that was my thinking. There are actually 2 situations, one is an error in data load, and in that situation the new point is a lot less reliable than the rest (as we've had the rest loaded for a while) so 1 makes sense. However I'm also looking for patterns and trends, such as activity being coded differently or different referral patterns, in which case 2 makes sense.
I think for all these situations I need to understand what question I'm asking.

Thank you all for your help! In this case I need to go for 1 then :)

OP posts:

Original poster

StealthPolarBear · 27/08/2010 14:24

MamaChris, it'll have to be based on previous data, so I suppose what's the furthest anything has been and still been 'ok'. I think that's what's confusing me a bit. Not sure if I'm looking at sd (would I need to plot all the data and see if it looks like a gaussian curve) or confidence levels somehow. I am a novice at all this and really want to learn but stats textbooks are so dry :o

OP posts:

EdgarAllenPop · 27/08/2010 14:24

is the correct approach.
calculates the deviance from the total trend, and muddies the data-set ...

Original poster

StealthPolarBear · 27/08/2010 14:26

LostArt - we don't do this at the moment! When the new data is loaded I run some queries, load them into an excel spreadheet so I can sort. I then scan down them and see if anything jumps out, as you can imagine a lot is missed!

OP posts:

Original poster

StealthPolarBear · 27/08/2010 14:27

Thanks EAP, you've backed up what AMum and everyone else has been saying. That's what I'll do. So now I need to decide how far from the expected it is and whether that's asseptable (might use little Supernanny icons). Would a funnel plot have any application here?

OP posts:

MamaChris · 27/08/2010 14:39

I recall something about cusum plots for process control - detecting when a series of data points start to depart from the earlier trend - but I'm not sure how applicable that would be here, if you want to be able to detect single outliers.

Would the usual regression diagnostics to look for outliers help at all? You don't need to refit the model necessarily, you could just add the new point, and a new residual.

Original poster

StealthPolarBear · 27/08/2010 14:42

erm thank you :) You lost me a bit but I will look into those.
As I've said I will also be looking at the data as a whole to look at trends etc so the first thing you mentioned would be useful there.

OP posts:

MamaChris · 27/08/2010 16:53

Have a google for "outlier detection regression". I think cusum is quite complicated to get right, but outlier detection is fairly straightforward. Good luck!

Original poster

StealthPolarBear · 27/08/2010 16:59

thank you - I will! Yes I looked it up and it looked far too complicated for me!

One last question, it's just occured to me that there are likely to be cyclic trends - dip in activity in December and February etc, increase in January. What should I be looking into to handle this? I can compre to the same month in the previous two years but not happy about going any further back really - data not as good and coding different.

Oh one more last question, promise this one really is the last - are there any online courses or books you'd recommend? I've read a load of public health information books which were good, but I want something more applied statisticy - all the stats books are really dry. I need something with real life examples!

OP posts:

MrsBadger · 27/08/2010 17:43

will post if one comes ot mind but we are more lab based so tend to design the expt to fit a handy stats test, iyswim (ANOVA is our fave)

Original poster

StealthPolarBear · 27/08/2010 21:04

thanks :)

OP posts:

Original poster

StealthPolarBear · 27/08/2010 21:05

I have a feeling (in fact I'm 99% sure-see what I did there?) that we have the wrong tool for the job, we have a quick DQ tool, whereas I think what we need is an actual statistical tool :( Can't see them buying that though

OP posts:

MamaChris · 28/08/2010 07:56

If you need to account for seasonality, you can model this explicitly - include a covariate for month of observation, or fit a shaped function. But I think here you're getting towards "proper" statistics!

What field are you in? I'll try and have a think about books, but my background is maths, and I now work in biological sciences, so the field I know well is fairly narrow.

Original poster

StealthPolarBear · 28/08/2010 08:00

Thank you, yes that's the only way I could think to do it - include a "MonthNumber" in any of my calculations. Just wondered if there was a standard way.

Not sure what field I'm in tbh! I did a maths degree with the bare minimum of stats, then moved onto computers, now I'm working as a Data Manager for the NHS - doesn't require a huge statistical knowledge but this one project would benefit from it and I'd love to get into it too.

OP posts:

MamaChris · 28/08/2010 19:22

The problem with simply including a MonthNumber, is that means you need to fit 11 parameters (1 for each month, less 1 for a default), which over just 2 years' data means none of them will be estimated very accurately. Also it may make sense that, say, January and February are more similar than January and June, and it would be nice to take account of this. (Does it, in your data?) Which is where proper modelling of time series comes in.

Good you're not scared of a bit of algebraic notation :) I suspect what might work for you is fairly standard regression, with and underlying model of monthly trends that takes account of seasonality. Then standard diagnostics can identify outlying observations and a series of outliers might make you suspect a change in overall trend. Time series are not really my thing, but I'll have a look through my books and try and think of something useful.

dignified · 05/09/2010 19:24

Spearman Rank is good for analysing data and calculates the probablility for you.

Flip