Logo

The Data Daily

The intuitive understanding of correlation coefficient

The intuitive understanding of correlation coefficient

Correlation is one of the statistics’ all time classic, yet it is still a busy measure that everyone uses in their analysis process. In classic interpretation, correlation is a measure of relationship or correspondence between two variables. This is usually visualized through a correlation plot and measured using correlation coefficient (r) that ranges between -1 to 1. The important takeaway from r is it shows the degree of relationship between two variables in terms how a change in one variable will lead to a change in the corresponding variable.

While interpreting the correlation plot is widely known and quite straightforward, the intuitive understanding on the equation of correlation coefficient (r) is less widely known. This is what the article is about. Why do the result is between -1 and 1; and where do the signs come from.

Before we continue to the explanation about r, let’s recall the definition of correlation.

This is where r comes into play, it shows the degree of relationship between two variables. Sample’s correlation coefficient itself is calculated using this following formula:

Which can also be transformed into:

The numerator of the equation is basically covariance measures, showing the relationship or the correspondence between the values of X and Y variables. The correspondence will tell us if the changes in X variable’s values will bring a change in Y variable’s values. Because of this reason, we want to make the values in X and Y variables comparable, hence we normalize the values by centering them, i.e. subtracting the values in X and Y variables with their mean (sample mean for sample, and population mean for population).

Then centered values of X and Y variables are then multiplied which yields to positive and negative values. Why is it so?

If the centered values in X and Y are both above and below the mean, the multiplication will yield to positive values. If the values in X are above the mean and the values in Y are below the mean, the multiplication will yield to negative values.

The summation ensures the balance of the positive and negative values. It will yield to the direction of the changes between the corresponding X and Y variables. The product of the summation is then standardized by the corresponding standard deviation from X and Y variables. This ensures that the deviation of the values in X about their mean are proportional to the deviation of the values in Y about their mean. In the other words, this procedure ensures that the comparability of the values in X and Y variables.

Finally the whole result is averaged by (n-1) (for sample. For population the result is divided with n).

Correlation coefficient (r) is therefore ranging between -1 to 1, because it is a product of a chain of mathematical procedures. The negative and positive signs are the products of the falling and rising of the values in variable X and Y. Correlation coefficient shows the strength of relationship or correspondence between X and Y variables in a sense that a change in X variable’s values will lead to a change in Y variable’s values.

Correlation coefficient (r) is therefore “normalized covariance” because the covariance between two variables are centered and standardized to ensure the comparability between the two variables. The degree of r that goes between -1 to 1 is the product of a chain of mathematical procedures. Positive and negative signs are the products of the combination of falling and rising in X and Y variables’ values. Correlation coefficient is then interpreted as a degree of relationship in a sense that a change in one variable will lead to a change on the corresponding variable.

Images Powered by Shutterstock