Joint Probability

Probability and Stochastic Processes

Sergios Theodoridis , in Machine Learning (Second Edition), 2020

2.2.2 Discrete Random Variables

A discrete random variable x can take any value from a finite or countably infinite set X . The probability of the event, " x = x X ," is denoted as

(2.6) P ( x = x ) or simply P ( x ) .

The function P is known as the probability mass function (PMF). Being a probability of events, it has to satisfy the first axiom, so P ( x ) 0 . Assuming that no two values in X can occur simultaneously and that after any experiment a single value will always occur, the second and third axioms combined give

(2.7) x X P ( x ) = 1 .

The set X is also known as the sample or state space.

Joint and Conditional Probabilities

The joint probability of two events, A , B , is the probability that both events occur simultaneously, and it is denoted as P ( A , B ) . Let us now consider two random variables, x , y , with sample spaces X = { x 1 , , x n x } and Y = { y 1 , , y n y } , respectively. Let us adopt the relative frequency definition and assume that we carry out n experiments and that each one of the values in X occurred n 1 x , , n n x x times and each one of the values in Y occurred n 1 y , , n n y y times. Then,

P ( x i ) n i x n , i = 1 , 2 , , n x , and P ( y j ) n j y n , j = 1 , 2 , , n y .

Let us denote by n i j the number of times the values x i and y j occurred simultaneously. Then, P ( x i , y j ) n i j n . Simple reasoning dictates that the total number, n i x , that value x i occurred is equal to

(2.8) n i x = j = 1 n y n i j .

Dividing both sides in the above by n, the following sum rule readily results.

(2.9)

The conditional probability of an event A, given another event B, is denoted as P ( A | B ) , and it is defined as

(2.10)

provided P ( B ) 0 . It can be shown that this is indeed a probability, in the sense that it respects all three axioms [6]. We can better grasp its physical meaning if the relative frequency definition is adopted. Let n A B be the number of times that both events occurred simultaneously, and let n B be the number of times event B occurred, out of n experiments. Then we have

(2.11) P ( A | B ) = n A B n n n B = n A B n B .

In other words, the conditional probability of an event A, given another event B, is the relative frequency that A occurred, not with respect to the total number of experiments performed, but relative to the times event B occurred.

Viewed differently and adopting similar notation in terms of random variables, in conformity with Eq. (2.9), the definition of the conditional probability is also known as the product rule of probability, written as

(2.12)

To differentiate from the joint and conditional probabilities, probabilities P ( x ) and P ( y ) are known as marginal probabilities. The product rule is generalized in a straightforward way to l random variables, i.e.,

P ( x 1 , x 2 , , x l ) = P ( x l | x l 1 , , x 1 ) P ( x l 1 , , x 1 ) ,

which recursively leads to the product

P ( x 1 , x 2 , , x l ) = P ( x l | x l 1 , , x 1 ) P ( x l 1 | x l 2 , , x 1 ) P ( x 1 ) .

Statistical independence: Two random variables are said to be statistically independent if and only if their joint probability is equal to the product of the respective marginal probabilities, i.e.,

(2.13) P ( x , y ) = P ( x ) P ( y ) .

Bayes Theorem

The Bayes theorem is a direct consequence of the product rule and the symmetry property of the joint probability, P ( x , y ) = P ( y , x ) , and it is stated as

(2.14)

where the marginal, P ( x ) , can be written as

P ( x ) = y Y P ( x , y ) = y Y P ( x | y ) P ( y ) ,

and it can be considered as the normalizing constant of the numerator on the right-hand side in Eq. (2.14), which guarantees that summing up P ( y | x ) with respect to all possible values of y Y results in one.

The Bayes theorem plays a central role in machine learning, and it will be the basis for developing Bayesian techniques for estimating the values of unknown parameters.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128188033000118

Conventional HSMMs*

Shun-Zheng Yu , in Hidden Semi-Markov Models, 2016

5.2.1 Smoothed Probabilities

The joint probability that state ( i , d ) transits to state j at time t + 1 and observation sequence takes o 1 : T given the model parameters is

ξ t ( i , d ; j ) P [ S [ t d + 1 : t = i , S t + 1 = j , o 1 : T | λ ] = α t ( i , d ) a i j ( d ) b j ( o t + 1 ) β t + 1 ( j , 1 ) , i j

and

ξ t ( i , d ; i ) = α t ( i , d ) a i i ( d ) b i ( o t + 1 ) β t + 1 ( i , d + 1 ) , d D 1 .

Then the smoothed probability of state transition from state ( i , d ) to state j i at time t + 1 given the model and the observation sequence is ξ t ( i , d ; j ) / P [ o 1 : T | λ ] . The smoothed probability of being in state i for duration d at time t given the model and the observation sequence, as defined in Eqn (2.10), is η t ( i , d ) / P [ o 1 : T | λ ] , and we have

η t ( i , d ) = j S \ { i } ξ t ( i , d ; j )

We also have ξ t ( i , j ) = d ξ t ( i , d ; j ) and γ t ( j ) = γ t + 1 ( j ) + i S \ { j } [ ξ t ( j , i ) ξ t ( i , j ) ] , as yielded by Eqn (2.14).

Then γ 1 ( i ) can be used to estimate the initial probabilities π ˆ i , t γ t ( j ) I ( o t = v k ) for the observation probabilities b ˆ j ( v k ) , and t ξ t ( i , d ; j ) for the transition probabilities a ˆ i j ( d ) .

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012802767700005X

Time Series: ARIMA Methods

G.C. Tiao , in International Encyclopedia of the Social & Behavioral Sciences, 2001

3 Stationary vs. Nonstationary Processes

When the joint probability distribution of any k observations y t+1,…, y t+k from a stochastic process remains the same for different values of t, the process is said to be 'stationary.' In practice, this means that there is a state of equilibrium in which the overall behavior of the time series stays roughly the same over time. When this condition does not hold, the series is said to be nonstationary. The theory of stationary processes has been well developed and plays an important role in time series analysis. Although many of the time series encountered in practice exhibit a nonstationary behavior, it is frequently the case that a suitable transformation such as differencing the data will render them stationary so that the theory of stationary processes can be applied.

To illustrate, Fig. 1 shows the time series of monthly interest rates of 90-day Treasury bills from January 1985 to December 1993, and Fig. 2 the series of month to month changes of the rates (first differences). The interest rate series exhibits a drifting or wandering behavior without a stable level. On the other hand, the series of the first differences seems to fluctuate about a fixed mean level with constant variance over the observational period. This example shows that a drifting nonstationary series can be transformed into a stationary one by the differencing operation. In fact, financial time series such as stock prices, prices of derivatives, and exchange rates often behave in this manner.

Figure 1. Monthly interest rates of 90-day Treasury bills, 1985–93

Figure 2. Month to month changes of the interest rates

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0080430767005209

Bayesian Networks

Richard E. Neapolitan , Xia Jiang , in Probabilistic Methods for Financial and Marketing Informatics, 2007

Section 3.1

Exercise 3.1

In Example 3.3 it was left as an exercise to show for all value of s, l, and c that

P ( S , l , c ) = P ( s | c ) P ( l | c ) p ( c ) .

Show this.

Exercise 3.2

Consider the joint probability distribution P in Example 3.1.

1.

Show that P satisfies the Markov condition with the DAG in Figure 3.31 (a) and that P is equal to the product of its conditional distributions in that DAG.

Figure 3.31. The probability distribution discussed in Example 3.1 satisfies the Markov condition with the DAGs in (a) and (b), but not with the DAG in (c).

2.

Show that P satisfies the Markov condition with the DAG in Figure 3.31 (b) and that P is equal to the product of its conditional distributions in that DAG.

3.

Show that P does not satisfy the Markov condition with the DAG in Figure 3.31 (c) and that P is not equal to the product of its conditional distributions in that DAG.

Exercise 3.3

Create an arrangement of objects similar to the one in Figure 3.4, but with a different distribution of letters, shapes, and colors, so that, if random variables L, S, and C are defined as in Example 3.1, then the only independency or conditional independency among the variables is IP (L, S). Does this distribution satisfy the Markov condition with any of the DAGs in Figure 3.31? If so, which one(s)?

Exercise 3.4

Consider the joint probability distribution of the random variables defined in Example 3.1 relative to the objects in Figure 3.4. Suppose we compute that distribution's conditional distributions for the DAG in Figure 3.31 (c), and we take their product. Theorem 3.1 says this product is a joint probability distribution that constitutes a Bayesian network with that DAG. Is this the actual joint probability distribution of the random variables?

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123704771500207

Understanding the social network

Dong Wang , ... Lance Kaplan , in Social Sensing, 2015

8.4.1 Deriving the Likelihood

The key contribution of the social-aware EM scheme lies in incorporating the role of uncertain provenance into the MLE algorithm. To compute the log likelihood, the function P(SC,z|SD, θ) needs to be computed first. The source claim graph SC can be divided into subsets, SC j , one per claim C j . The subset describes which sources espoused the claim and which did not. Since claims are independent, we have:

(8.5) P ( S C , z | S D , θ ) = j = 1 N P ( S C j , z j | S D , θ )

which can in turn be re-written as:

(8.6) P ( S C , z | θ ) = j = 1 N P ( S C j | S D , θ , z j ) P ( z j )

where P(SC j |SD, θ, z j ) is the joint probability of all observations involving claim C j . Unfortunately, in general, the sources that make these observations may not be independent since they may be connected in the social network leading to a possibility that one expressed the observation of another. Let p ik = P(S i C j |S k C j ) be the probability that source S i makes claim C j given that his parent S k (in the social dissemination network) makes that claim. p ik is referred to as repeat ratio and can be approximately computed from graph SC, for pairs of nodes connected in graph SD, as follows:

(8.7) p i k = number of times S i and S k make same claim number of claims S k makes

Hence, the joint probability that a parent S p and its children S i make the same claim is given by P ( S p C j ) i P ( S i C j | S p C j ) which is P ( S p C j ) i p i p . This probability accounts for the odds of one source repeating claims by another. Note that the model assumes that a child will not independent make the same claim as a parent. For illustration, let us now consider the special case of social network topology SD, where the network is given by a forest of two-level trees.* Hence, when considering claim C j , sources can be divided into a set M j of independent subgraphs, where a link exists in subgraph gM j between a parent and child only if they are connected in the social network and the parent claimed C j . The link implies source dependency as far as the claim in question is concerned. The intuition is that if the parent does not make the claim, then the children act as if they are independent sources. If the parent makes the claim, then each child repeats it with a given repeat probability. The assumed repeat probability determines the degree to which the algorithm accounts for redundant claims from dependent sources. The higher it is, the less credence is given to the dependent source. Two scenarios are illustrated by the two simple examples in Figure 8.1, showing the situation where source S 1, who has children S 2, S 3, and S 4, makes claim C 1 and when it does not make it, respectively. Note the differences in the computed probabilities of its children making claim C 1. In general, let S g denote the parent of subgraph g and c g denote the set of its children, if any. Equation (8.6) can then be rewritten as follows:

Figure 8.1. Simple illustrative examples for proof.

(8.8) P ( S C , z | S D , θ ) = j = 1 N P ( z j ) × g M j P ( S g C j | θ , z j ) i c g P ( S i C j | S g C j )

where

(8.9) P ( z j ) = d z j = 1 ( 1 d ) z j = 0 P ( S g C j | θ , z j ) = a g z j = 1 , S g C j = 1 ( 1 a g ) z j = 1 , S g C j = 0 b g z j = 0 , S g C j = 1 ( 1 b g ) z j = 0 , S g C j = 0 P ( S i C j | S g C j ) = p i g S g C j = 1 , S i C j = 1 1 p i g S g C j = 1 , S i C j = 0 a i S g C j = 0 , S i C j = 1 , z j = 1 ( 1 a i ) S g C j = 0 , S i C j = 0 , z j = 1 b i S g C j = 0 , S i C j = 1 , z j = 0 ( 1 b i ) S g C j = 0 , S i C j = 0 , z j = 0

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012800867600008X

Picturing Bayesian Classifiers

Giorgio Maria Di Nunzio , Alessandro Sordoni , in Data Mining Applications with R, 2014

2.4.1.3 Poisson Model

In the Poisson model, an object is generated by a multivariate Poisson random variable. Each object oj is represented as an m-dimensional vector of frequencies, oj     (N 1,j ,…,Nm,j ), and each feature count is governed by a Poisson random variable:

(2.10) N k , j ~ Pois θ f k | c

Using the NB conditional independence assumption, we can write the probability of the object as:

(2.11) P o j | c i ; θ k = 1 m θ f k | c i N k , j e θ f k | c i ,

and, by taking the logs we obtain:

(2.12) log P o j | c i ; θ k = 1 m N k , j log θ f k | c i θ f k | c i

With these concepts in hand, we can easily apply Equation (2.2) to calculate the posterior distribution P(ci |oj ; θ) and classify the unlabeled object oj . The generic function nbClassify allows us to call the correct classification function according to the class of the object x that is passed as argument. An example of the implementation of the classification function using the Bernoulli model (Equation 2.6) is reported in Listing 2.3.

The method returns the joint probabilities P(oj , ci ; θ ˆ ) for the two classes, ci and c ¯ i , for all the objects' vectors contained in the dataset matrix.

Listing 2.3

Classification Function for the Bernoulli Model (File nb.R)

1 nbClassify.bernoulli <− function(x, dataset) {

2   dataset <− prepareDataset(x, dataset)

3   # precompute the sum over features of log(1 − param)

4   sumnegative <− colSums(log(1 − x$features.params) )

5   # compute p(d, c)

6   logfeatures <− log(x$features.params)

7   lognfeatures <− log(1 − x$features.params)

8   scores <− dataset %*% (logfeatures − lognfeatures)

9   scores <− t(t(scores) + sumnegative + log(x$classes.params) )

10   return(scores)

11 }

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124115118000025

Learning Bayesian Networks

Richard E. Neapolitan , Xia Jiang , in Probabilistic Methods for Financial and Marketing Informatics, 2007

4.4.2 Learning a DAG in Which P Is Embedded Faithfully*

In Example 4.26 in a sense we compromised because the DAG we learned did not entail all the conditional independencies in P. This is fine if our goal is to learn a Bayesian network which will later be used to do inference. However, another application of structure learning is causal learning, which is discussed in the next subsection. When learning causes it would be better to find a DAG in which P is embedded faithfully. We discuss embedded faithfulness next.

Definition 4.2

Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (W, E) such that V ∪ W. We say that (G, P) satisfy the embedded faithfulness condition if all and only the conditional independencies in P are entailed by G, restricted to variables in V. Furthermore, we say that P is embedded faithfully in G.

Example 4.27

Again suppose V = {X,Y, Z,W}, and the set of conditional independencies in P is

{ I P ( X , { Y , W } ) , I P ( Y , { X , Z } ) } .

Then P is embedded faithfully in the DAG in Figure 4.17 . It is left as an exercise to show this. By including the variable H in the DAG, we are able to entail all and only the conditional independencies in P restricted to variables in V.

Figure 4.17. If the set of conditional independencies in P is {IP (X,{Y, W}), IP (Y, {X, Z})}, then P is embedded faithfully in this DAG.

Variables such as H are called hidden variables because they are not among the observed variables. By including them in the DAG, we can achieve faithfulness.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123704771500219

Computational Analysis and Understanding of Natural Languages: Principles, Methods and Applications

China Venkaiah Vadlamudi , Sesha Phani Deepika Vadlamudi , in Handbook of Statistics, 2018

4.4 Mutual Information

Just as previously, using the joint probabilities we can define

i ( E F ) = log ( p ( E F ) ) = log ( p ( E / F ) * p ( F ) ) = log ( p ( E / F ) ) log ( p ( F ) ) = i ( E / F ) + i ( F )

Similarly, we have

i ( F E ) = i ( F / E ) + i ( E )

If E and F are independent, since p ( E F ) = p ( E ) * p ( F ) , we have

i ( E F ) = i ( E ) + i ( F ) = i ( F E )

If the events are not independent then the occurrence of an event F reduces the uncertainty of event E. That is event F provides information about E. The amount of information provided by F about E, denoted by, i F E , is i(E) − i(E/F). Substituting for i(E/F) we have i F E = i ( E ) i ( E / F ) = i ( E ) i ( E F ) + i ( F ) = i ( E ) + i ( F ) i ( E F ) . The quantity, i F E , which is also equal to i E F is called the mutual information and it is denoted by i(E; F).

Example 48

Mutual information of the events given in Example 47 can be seen to be 1 bit.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0169716118300233

Reasoning with Uncertain Information

Nils J. Nilsson , in Artificial Intelligence: A New Synthesis, 1998

19.2.1 A General Method

The general setting for probabilistic inference is that we have a set, V, of propositional variables V1, …, Vk, and we are given, as evidence, that the variables in a subset, ɛ, of V, have certain definite values, ɛ = e (of True or False). In agent applications, the "given" variables would typically have values determined by perceptual processes. We desire to calculate the conditional probability, p(Vi = vi |ɛ = e), that some variable, Vi, has value vi, given the evidence. We call this process probabilistic inference.

Since Vi has value True or False, there are two conditional probabilities in which we might be interested, namely, p(Vi = True|ɛ = e) and p(Vi = False|ɛ = e). Of course, we need only calculate one of these because p(Vi = True|ɛ = e) + p(Vi = False|ɛ = e) = 1, regardless of the value of ɛ. I illustrate by describing a "brute-force" method for calculating p(Vi = True|ɛ = e). Using the definition for conditional probability, we have

p ( V i = T r u e | ɛ = e ) = p ( V i = T r u e , ɛ = e ) p ( ɛ = e )

p(Vi = True, ɛ = e) is obtained by using our rule for calculating lower order joint probabilities from given higher order ones:

p ( V i = T r u e , ɛ = e ) = V i = T r u e , ɛ = e P ( V 1 , , V k )

where the Vi, i = 1, …, k constitute our collection of propositional variables. That is, we sum over all values of the joint probability for which Vi = True and for which the evidence variables have their given values. The calculation of p(ɛ = e) can be done in a similar manner, although as my next example illustrates, it need not be explicitly calculated.

As an example, suppose we have joint probabilities given by

p(P, Q, R) = 0.3

p(P, Q, ¬R) = 0.2

p(P, ¬Q, R) = 0.2

p(P, ¬Q, ¬R) = 0.1

p(¬P, Q, R) = 0.05

p(¬P, Q, ¬R) = 0.1

p(¬P, ¬Q, R) = 0.05

p(¬P, ¬Q, ¬R) = 0.0

We are given ¬R as evidence and wish to calculate p(Q|¬R). Using the procedure just given, we calculate

p ( Q | ¬ R ) = p ( Q , ¬ R ) p ( ¬ R ) = [ p ( P , Q , ¬ R ) + p ( ¬ P , Q , ¬ R ) ] p ( ¬ R ) = ( 0 . 2 + 0 . 1 ) p ( ¬ R ) = 0 . 3 p ( ¬ R )

Now we can either calculate the marginal pR) directly or (as is usually done) calculate p(Q|¬R) by the same method just used—avoiding the calculation of pR) by taking advantage of the fact that p(Q|¬R) + p(¬Q|¬R) = 1. I proceed with the latter method:

p ( ¬ Q | ¬ R ) = p ( ¬ Q , ¬ R ) p ( ¬ R ) = [ p ( P , ¬ Q , ¬ R ) + p ( ¬ P , ¬ Q , ¬ R ) ] p ( ¬ R ) = ( 0 . 1 + 0 . 0 ) p ( ¬ R ) = 0 . 1 p ( ¬ R )

Since these two quantities must sum to one, we have that p(Q|¬R) = 0.75.

In general, probabilistic inference using this method is intractable because, to perform it in cases in which we have k variables, we need an explicit list of all of the 2 k values of the joint probability, p(V 1, V 2, …, Vk ). For many problems of interest, we couldn't write down such a list even if we knew it (which we generally do not).

In view of this intractability, we might ask, "how do humans reason so effectively with uncertain information?" Pearl [Pearl 1986, Pearl 1988, Pearl 1990] surmised that we do it by formulating our knowledge of a domain in a special manner that greatly simplifies the computation of the conditional probabilities of certain variables given evidence about them. These efficient knowledge formulations involve what are called conditional independencies among various of the variables—a subject to which I now turn.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780080499451500265

Designing of Latent Dirichlet Allocation Based Prediction Model to Detect Midlife Crisis of Losing Jobs due to Prolonged Lockdown for COVID-19

Basabdatta Das , ... Abhijit Das , in Cyber-Physical Systems, 2022

13.3.3.2 Categorization in Bayesian model

In Bayesian probability theory, one of the joint probability events is the hypothesis, and other event is data. We wish to examine the truth of the data, given the hypothesis. In our experiment, we have a dataset D. To examine new data, we shall observe some part of the data, while we have to assume some other part of the data. Therefore we are to find a θ so that

(13.3) h θ observed ( w i ) assumed ( w i ) .

Next, we find the likelihood of training dataset assuming that the training cases are all independent of each other.

The continuous generative model of Bayesian probability leads us to determine t such that~ N (t|μextracted_wi, σ 2 extracted_wi); where θ=(μdepressive_wij, μnondepressive_wij, σ depressive_wij,…….)

the measurement of the probability that a word wij is depressive is

(13.4) P ( depressive| t ) = P ( t |depressive ) P ( depressive ) P ( t |depressive ) P ( depressive ) P ( t | ¬ depressive ) P ( ¬ depressive )

θ is derived by applying maximum likelihood.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128245576000030