A mind dump of mathematics...: Errors, and Statistical Theory

Why are errors so important in science? Whenever we make empirical measurements, it is impossible to make an exact measurement due to the limitations of the measuring device. For example, if you were to measure a desk from top to bottom, and you use a measuring tape, you might find that the height of the desk is slightly over three and a half feet. Meaning, the desk actually ends between the line of three and a half and three and 5/8 inches. What then is the height of the desk? You could "ball park it" and guess what the value could be, but a more scientific way of doing it would be to say the value is, not only in between these two marks on the ruler, but give the range in which it could be in. In this case, you could say that the value is 3.5625 in (the number right between 1/2 and 5/8 of an inch) plus or minus an error amount. How much is this error? In some cases it may be up to you, in others there are ways to calculate it. In science you will be working with large amounts of date and the error equates to the standard deviation of the data. For this particular example, a good number for the error would be plus or minus an eighth of an inch. You could say 1/16, since it was between 1/2 and 1/8, it's up to you. Keep in mind though, you never want to "over compensate" for the error. An example of this would be saying the error is plus or minus an inch. That's way too much!

As an example of the importance of errors, consider the following example. Imagine we are measuring the temperature of a cooling object, and we get the two following readings..

Is the difference between these two temperatures significant? There is no way of knowing without the error. If the error is plus of minus 0.001, then the difference is most definitely significant. If however, the error happens to be 0.01, the error is insignificant. When ever drawing conclusions or publishing work, all errors must be accounted for an documented, or else the data doesn't not mean much.

When working in science, it is important to distinguish the difference between the words, precise and accurate. Precise means that all experimental values obtained, are all around the same value. This value does NOT need to be the true value, (I will get to this a little later.) there may be some error associated with your device that is causing all the data to be shifted to an incorrect value.
Being accurate is actually obtaining close to the actual value of the experiment. As an example, say you are measuring the speed of light, which happens to be 3 times 10 to the 8th. (3E8) And you get values of 2.21E7, 2.23E7, 2.235E7, 2.36E7,...ect. This values is very off from the actual value of the speed of light, one order of magnitude. However, while this data may not be accurate, it is most definitely precise. This is because all values obtained are close to each other.

This brings the next point of these notes. The types of errors that can happen. There are two types of errors that can happen to data. Systematic error, and random error. Systematic error has already been mentioned, it is the error associated with the system. (System---->Systematic) This means that there is something flawed with the device that measures consistently incorrect. In the previous example there was a systematic error of approximately 1.55E1 as all values were off from the actual speed of light by this amount. The other type of error is called random error. In this type of error, some random event happened that caused a data point to be way off the normal distribution of data points. For example, say I wanted to start a stop watch right as someone dropped a ball to measure the time it took to hit the floor. If I happened to sneeze right when he dropped it and didn't start it until the ball was almost touching the ground, this would account for a random error because for that one data point, the ball takes a LOT shorter time to hit the ground since I didn't start the watch until it was almost touching the ground and didn't have that much more left to travel.

When working with date points, the average value of them is extremely important. The average value, or mean, is the sum of all the data points, divided by the amount of data points. In other words...

Where we denote the average amount by a bar over the variable. This notation will be used a lot. While this is the most common way to get the average of a set of data, there is another way. Say for example, that 50% of our measurements equaled 2, 20% of them equaled 4 and 30% equaled 6. The average value is then..

Note that for the former case, this method is used when we have a distribution. A distribution is essentially the percentage of each value of x there is for the entire set of data. For a large amount of data, one could formulate a distribution function that shows a curve of the distribution of each value of x in the data. If we were to sum all these distributions up over the x axis, we would obtain..

This is because distribution is percentage, and the sum of all the different amounts of percentages better be equal to 1, or we have made a mistake somewhere. The limits negative to positive infinity are chosen purely for principles. It represents the sum of all the data over all the values measured. When we take the average over the distribution, we use a different notation other than the bar over the variable, we put angled brackets around the variable as well. To obtain the average value of the distribution function, just as we did for the above example, we would multiply the distribution of each x value, by the x value itself and sum all these up.

The quantity f(x)dx actually represents the probability of the data. So in our example of finding the average value, we said that the value of 2 occured 50% of the time. Therefore, if you were to pick a value out of all of the valued obtained, there would be a 50% chance you get 2. If you sum these over an interval x-dx<x<x+dx, you obtain the probability of getting a value between x-dx and x+dx.

So let's get back to errors. An error in a measurement is defined as..

Where X is the true value or expected value. When adding up errors or any data for that matter, it is helpful to use what is known as a root mean square of the data. Notice that in the above expression for error, the error could be positive or negative depending on whether is is above of below X. To get rid of this, we square the difference. Squaring the difference also has another benefit, it better shows the difference between numbers. The difference between 7 and 8 is only 1, while the difference between 49 and 64 is 15! Much higher! So the method typically used is to square each difference up, then sum up all the squares, then take the square root and this will be the root mean square of the data.
So if we take the sum of the squares of the above errors and plug this value into the integral expression above for finding the average, we get what is essentially the average of the squared error.

The first term is what is known as the variance and is a measure of how far apart the values are from each other. Note that a set of data can only have a variance if the data has an expected value. Since the integral is measuring the error of each data point from the expected value, it is also inadvertently measuring the spread of each data point from each other. Think about it, if all the data taken is EXTREMELY close to the expected value, the average of the squared errors will be smaller than if the values are further apart. Therefore, if the value of the integral is small, the data points are closer together and if it is large, the data points are further apart.
The second term is simply taking the square root of the variance, completing the root mean square procedure. This value is a measure of the spread of the distribution. You could also look at this as being the root-mean-square of all the errors, giving an average error over the distribution. This term is called the standard deviation of the data.

Imagine you measure 100 pieces of data for a given experiment. The following day you measure another 100, and you continue to do this for a week. At the end of the week, you have seven sets of data, each with 100 measurements. If you wish to get the most accurate value for the true value of the data (X), then you would average each set of data, giving you seven averages, and then average all the averages together, giving you one average value that accounts for 700 measurements. The standard deviation of this data will be denoted as sigma sub m. What is the relation between the standard deviation of this average compared to the average of one set of data? First consider a set of n measurements. The error of each measurement is given by..

And the error in the mean of these data points is..

Some thought must go into why we were able to put X into the summation. You really need to understand and treat the summation as the sum of n terms. I know this may sound obvious but a lot of the time you can get locked into thinking of it as one item, when in fact it is n items. The average of the differences of each data point with the expected value, will be equal to the average of the data points, minus the expected value. Think about it!
So the error of the mean, is the average of the individual errors. This should make sense. We now want to square both sides to further exemplify the error so it is more obvious.

Notice that the double sum went to zero. Why is this the case? Think about it. E represents the error. As you take more and more measurements, the value of the error gets really small. The errors between e_i and e_j are independent and therefore their average is zero.

So we see that there are two ways of reducing the mean error and the mean standard deviation. Either we take more and more measurements making n bigger, or make more precise measurements and make sigma smaller. The latter is more cost effective than the former.

Treatment of Error in Functions
When we have a function Z=Z(A), the error in Z is given by..

This means we can write the relationship as...

As an example of how we would find the error delta Z, we use the following function..

For a function of more than one variable, the procedure somewhat similar..

As an example, if Z=A+B...

Least Squares Regression Line

Imagine you have a scatter plot of data points, and you want to find the best fit line to those points. A technique you can use is called the method of least squares. Our assumption in this derivation is that the error lies entirely on the y axis only.

For a given pair of values, the deviation of the i'th reading is...

The best values of m and c are taken when the following formula is a minimum.

The reason we are squaring each term is the same logic as the root-mean-square idea. (Mentioned earlier.) To obtain the minimum of the above sum of the squares (Hence the term, least squares), you simply take the two partial derivatives, set them both equal to zero and solve.

Using this expression for m, and the second equation in our system of equations, one can find the best fit line of a set of data points.

A mind dump of mathematics...

Tuesday, October 25, 2011

Errors, and Statistical Theory

No comments:

Post a Comment