The 9 concepts and formulas in probability that every data scientist should know

Read original article here

Probability is the likelihood of an event occurring; it is a mathematical model to describe random phenomena. In other words, probability is a branch of mathematics that provides models to describe random processes. These mathematical tools allow to establish theoretical models for random phenomena and to use them to make predictions. Like every model, the probabilistic model is a simplification of the world. However, the model is useful as soon as it captures the essential features. In this article, we present 9 fundamental formulas and concepts in probability that every data scientist should understand and master in order to appropriately handle any project in probability.

1. A probability is always between 0 and 1 The probability of an event is always between 0 and 1, If an event is impossible: If an event is certain: For example, throwing a 7 with a standard six-sided dice (with faces ranging from 1 to 6) is impossible so its probability is equal to 0. Throwing head or tail with a coin is certain, so its probability is equal to 1.

If the elements of a sample space (the set of all possible results of a randomized experiment) are equiprobable (= all elements have the same probability), then the probability of an event occurring is equal to the number of favourable cases (number of ways it can happen) divided by the number of possible cases (total number of outcomes): \[P(A) = \frac{\text{number of favourable cases}}{\text{number of possible cases}}\] For example, all numbers of a six-sided dice are equiprobable since they all have the same probability of occurring. The probability of rolling a 3 with a dice is thus \[P(3) = \frac{\text{number of favourable cases}}{\text{number of possible cases}} = \frac{1}{6}\] because there is only one favourable case (there is only one face with a 3 on it), and there are 6 possible cases (because there are 6 faces altogether).

The probability of the union of two events is the probability of either occurring: Suppose that the probability of a fire breaking out in two houses in a given year is: in at least one of the two houses: 80%, so The probability of a fire breaking out in house A or house B is By summing \(P(A)\) and \(P(B)\), the intersection of A and B, i.e. \(P(A \cap B)\), is counted twice. This is the reason we subtract it to count it only once. If two events are mutually exclusive (i.e., two events that cannot occur simultaneously), the probability of both events occurring is equal to 0, so the above formula becomes For example, the event “rolling a 3” and the event “rolling a 6” on a six-sided dice are two mutually exclusive events since they cannot both occur at the same time. Since their joint probability is equal to 0, the probability of rolling a 3 or 6 on a six-sided dice is

If two events are independent, the probability of the intersection of the two events (i.e., the joint probability) is the probability of the two events occurring: For instance, if two coins are flipped, the probability of both coins being tails is If two events are mutually exclusive, their joint probability is equal to 0:

The independence of two events can be verified thanks to the above formula. If the equality holds, the two events are said to be independent, otherwise the two events are said to be dependent. Formally, the events A and B are independent if and only if In the example of the two coins: so the following equality holds The two events are thus independent, denoted \(T_1{\perp\!\!\!\perp}T_2\). In the example of the fire breaking out in two houses (see section 4): so the following equality does not hold The two events are thus dependent (or not independent), denoted \(A \not\!\perp\!\!\!\perp B\).

Suppose two events A and B and \(P(B) > 0\). The conditional probability of A given (knowing) B is the likelihood of event A occurring given that event B has occurred: Note that, in general, the probability of A given B is not equal to the probability of B given A, that is, \(P(A | B) \ne P(B | A)\). From the formula of the conditional probability, we can derive the multiplicative law: If two events are independent, \(P(A \cap B) = P(A) \cdot P(B)\), and: \[P(B | A) = \frac{P(B \cap A)}{P(A)}\] \[P(B | A) = \frac{P(B) \cdot P(A)}{P(A)}\] \[P(B | A) = P(B) \text{ (Eq. 3)}\] Equations 2 and 3 mean that knowing that one event occurred does not influence the probability of the outcome of the other event. This is in fact the definition of the independence: if knowing that one event occurred does not help to predict (does not influence) the outcome of the other event, the two events are by essence independent. From the formulas of the conditional probability and the multiplicative law, we can derive the Bayes’ theorem: \[P(B | A) = \frac{P(B \cap A)}{P(A)} \text{ (from conditional probability)}\] \[P(B | A) = \frac{P(A \cap B)}{P(A)} \text{ (since } P(A \cap B) = P(B \cap A))\] \[P(B | A) = \frac{P(A | B) \cdot P(B)}{P(A)} \text{ (from multiplicative law)}\] In order to illustrate the conditional probability and the Bayes’ theorem, suppose the following problem: In order to determine the presence of a disease in a person, a blood test is performed. When a person has the disease, the test can reveal the disease in 80% of cases. When the disease is not present, the test is negative in 90% of cases. Experience has shown that the probability of the disease being present is 10%. A researcher would like to know the probability that an individual has the disease given that the result of the test is positive. To answer this question, the following events are defined: D: the person has the disease Moreover, we use a tree diagram to illustrate the statement: (The sum of all 4 scenarios must be equal to 1 since these 4 scenarios include all possible cases.) We are looking for the probability that an individual has the disease given that the result of the test is positive, \(P(D | P)\). Following the formula of the conditional probability (Eq. 1) we have: From the tree diagram, we can see that a positive test result is possible under two scenarios: (i) when a person has the disease, or (ii) when the person does not actually have the disease (because the test is not always correct). In order to find the probability of a positive test result, \(P(P)\), we need to sum up those two scenarios: The probability of having the disease given that the result of the test is positive is only 47.06%. This means that in this specific case (with the same percentages), an individual has less than 1 chance out of 2 of having the disease knowing that his test is positive! This relatively small percentage is due to the facts that the disease is quite rare (only 10% of the population is affected) and that the test is not always correct (sometimes it detects the disease although it is not present, and sometimes it does not detect it although it is present). As a consequence, a higher percentage of healthy people have a positive result (9%) compared to the percentage of people who have a positive result and who actually have the disease (8%). This explains why several diagnostic tests are often performed before announcing the result of the test, especially for rare diseases.

Based on the example of the disease and the diagnostic test presented above, we explain the most common accuracy measures: Before diving into the details of these accuracy measures, here is an overview of the measures and the tree diagram with the labels added for each of the 4 scenarios: The false negatives (FN) are the number of people incorrectly labeled as not having the disease or the condition, when in reality it is present. It is like telling a women who is 7 months pregnant that she is not pregnant. From the tree diagram, we have: The false positives (FP) are the number of people incorrectly labeled as having the disease or the condition, when in reality it is not present. It is like telling a man he is pregnant. From the tree diagram, we have: The sensitivity of a test, also referred as the recall, measures the ability of a test to detect the condition when the condition is present (the percentage of sick people who are correctly identified as having the disease): where TP is the true positives. From the tree diagram, we have: The specificity of a test measures the ability of a test to correctly exclude the condition when the condition is absent (the percentage of healthy people who are correctly identified as not having the disease): where TN is the true negatives. From the tree diagram, we have: The positive predictive value, also referred as the precision, is the proportion of positives that correspond to the presence of the condition, so the proportions of positive results that are true positive results: From the tree diagram, we have: The negative predictive value is the proportion of negatives that correspond to the absence of the condition, so the proportions of negative results that are true negative results: From the tree diagram, we have:

In order to use the formula in section 2, one must know how to count the number of possible elements. There are 3 main counting techniques in probability: See below how to count the number of possible elements in case of equiprobable results. The number of permutations is as follows: with \(r\) the length, \(n\) the number of elements and \(r \le n\). Note that \(0! = 1\) and \(k! = k \times (k – 1) \times (k – 2) \times \cdots \times 2 \times 1\) if \(k = 1, 2, \dots\) The order is important in permutations! Count the permutations of length 2 of the set \(A = \{a, b, c, d\}\), without a letter being repeated. How many permutations do you find? library(gtools)

x

Images Powered by Shutterstock

The Data Daily

The 9 concepts and formulas in probability that every data scientist should know