BAYESIAN PARAMETER ESTIMATION : General theory and 예제

학술/공학

BAYESIAN PARAMETER ESTIMATION : General theory and 예제

ksyoon 2020. 4. 3. 21:15

Bayesian 접근법은 특별한 경우 즉 multivariate Gaussian에서 원하는 density $ p(x| \mathcal{D})$ 를 얻기 위하여 사용된다. 이 접근법은 알려지지 않은 density가 parameterized 되는 경우에 적용 될 수 있도록 일반화 될 수 있다.

$ \bullet \; $ $ p(x| \theta) $는 알려져 있다고 가정한다. 그러나 $ \theta$의 값은 정확하게 모른다.

$ \bullet \;$ $ \theta$의 대한 초기 값은 알려진 prior density $ p( \theta)$ 에 포함되어 있다고 가정한다.

$ \bullet \;$ $\theta$에 대한 나머지 정보는 알려지지 않은 Probability density $p(x)$에 따라 뽑은 $x_1,x_2,...,x_n$ 샘플로 구성된 하는 집합 $\mathcal{D}$에 포함되어 있다.

처음 문제는 posterior density 인 $ p(\theta| \mathcal{D})$ 를 구하는 것이다. 왜냐하면 이것으로 부터 $ p(x| \mathcal{D})$ 를 계산하기 때문이다.

$$ p(x| \mathcal{D}) = \int p( x| \theta)p(\theta| \mathcal{D})d \theta $$

Bayes 공식에 의하면

$$ p(\theta | \mathcal{D})= \cfrac{p(\mathcal{D}|\theta)p(\theta)}{\int p(\mathcal{D}|\theta)p(\theta) d \theta} $$

그리고 독립이란 가정하에서

$$ p(\mathcal{D}|\theta)= \prod_{k=1}^n p(x_k | \theta) $$

Bayesian 공식의 해를 얻는데 있어서 몇가지 의문이 드는데, 첫번째는 계산을 하는데 어려움에 대한 걱정이고, 두번째는 $ p(x)$에 대하여 $ p(x|\mathcal{D})$의 convergence에 대한 문제이다. convergence에 대한 문제는 간단하게 다룰 것이고, 그 후에 계산의 문제를 살펴 볼 것이다. single category에 대한 집합에서 여러개의 샘플을 명시적으로 나타내기 위하여 $ \mathcal{D}^n = \{x_1,....,x_n\}$이라고 한다 만약 $ n > 1 $ 아래의 식을 얻는다.

$$ p(\mathcal{D}^n|\theta)= p(x_n |\theta) p(\mathcal{D}^{n-1} | \theta) $$

이식을 대입하면 반복된 식을 얻을 수 있다.

$$ p(\mathcal{D}^n|\theta)= \cfrac{p(x_n| \theta)p(\theta|\mathcal{D}^{n-1})}{\int p(x_n | \theta) p(\theta | \mathcal{D}^{n-1} )d\theta} $$

$ p(\theta | \mathcal{D}^0)= p(\theta) $라는 것을 알고, 이 식을 반복 사용하여서 $ p(\theta), p(\theta | x_1), p(\theta| x_1, x_2) $ 가 나온다. $ p(\theta | \mathcal{D}^n) $은 $\mathcal{D}^n $에 의존적이고 그 순서에 영향을 받지 않음을 알수 있다. 이것을 $ recursive \; Bayes \; approach$ 라고 불린다.

예제)

uniform distribution 1차원 샘플을 얻었다고 가정하자.

$$ p(x|\theta) \sim U(0,\theta) = \begin{cases} 1/\theta \quad 0\leq x \leq \theta\\ 0 \quad otherwise,\end{cases}$$

처음에는 우리는 파라미터가 범위 한정적이란 것만 알고 있다. 특별히 $ 0\leq \theta \leq 10 $ 이라고 하자. (이것을 noninformative 또는 "flat prior"라고도 함. 이것에 대해서는 Section 3.5.2에서 다룸) 반복적인 Bayes method를 사용하여 data $ \mathcal{D}=\{4,7,2,8\} $ 에서 $ \theta$ 와 underlying distribution을 추정한다. 데이터가 없을 때에는 $ p(\theta | \mathcal{D}^0 ) = p(\theta) = U(0,10) $ 이다. 첫번째 데이터 $ x_1=4 $가 도착하면 좀 더 개선된 추정이 가능하다.

$$ p(\theta | \mathcal{D}^1 ) \propto p(x|\theta)p( \theta | \mathcal{D}^0) = \begin{cases} 1/\theta \quad for \; 4\leq \theta \leq 10\\ 0 \quad otherwise,\end{cases}$$

참고로 $ x \leq \theta $ 이므로 $ 4 \leq \theta $가 되고 $ 0 \leq \theta \leq 10 $ 이므로 모든 조건을 고려하면 $ \theta $는 $ 4 \leq \theta \leq 10 $ 이 된다.

모든 곳에서 normalization은 무시 한다. 이제 다음 데이터 $ x_2=7 $이 들어오면,

$$ p(\theta | \mathcal{D}^2 ) \propto p(x|\theta)p( \theta | \mathcal{D}^1) = \begin{cases} 1/\theta^2 \quad for \; 7\leq \theta \leq 10\\ 0 \quad otherwise,\end{cases}$$

비슷하게 남은 샘플 데이터에도 동일한 과정을 적용한다. 반복적인 과정은 $ 1/\theta $를 $ p(x | \theta) $을 대신하여 사용되기 때문에 distribution은 $ x $값이 샘플링된 포인트의 가장 큰 값보다 크면 nonzero이다. 구하는 해의 일반 폼(general form)은

$$ p(\theta | \mathcal{D}^n ) \propto 1/\theta^n \quad for \; max[\mathcal{D}^n] \leq \theta \leq 10 $$

주어진 데이터 셋에서 보면 maximum-likelihood solution은 $ \hat{\theta} = 8 $ 이고 이것은 uniform $ p(x | \mathcal{D}) \sim U(0,8) $ 임을 의미한다.

$p(\theta | \mathcal{D}^n) $을 구하여 보면.

$$ p( \theta | \mathcal{D}^1 ) = \cfrac{ \cfrac{1}{\theta} \cfrac{1}{10} }{\int_{4}^{10} \cfrac{1}{\theta} \cfrac{1}{10} d\theta} = \cfrac{\cfrac{1}{\theta} }{ln{10 \over 4}}$$

$$ p( \theta | \mathcal{D}^2 ) = \cfrac{ \cfrac{1}{\theta} \cfrac{1}{\theta} }{\int_{L_2}^{U_2} \cfrac{1}{\theta} \cfrac{1}{\theta} d\theta} = \cfrac{ 1}{\theta^2 (L_2^{-1} - U_2^{-1})} $$

동일하게 하면.

$$ p( \theta | \mathcal{D}^{N+1} ) = \cfrac{ \cfrac{1}{\theta^N} \cfrac{1}{\theta} }{\int_{L_N+1}^{U_N+1} \cfrac{1}{\theta^N} \cfrac{1}{\theta} d\theta} = \cfrac{ N \theta^{-N-1}}{(L_{N+1}^{-N} - U_{N+1}^{-N})} $$

$ N $을 바꾸면

$$ p( \theta | \mathcal{D}^{N} ) = \cfrac{ \theta^{-N} }{\int_{L_N}^{U_N} \theta^{-N} d\theta} = \cfrac{ (N-1) \theta^{-N}}{(L_{N}^{-(N-1)} - U_{N}^{-(N-1)})} $$

각 데이터에서 $ p( \theta | \mathcal{D}^{n} ) $ 을 구하면 아래와 같다.

여기서 참고로

$ f(x| \theta)= \cfrac{1}{\theta} $ for $ 0 \leq x \leq \theta $ 그 외에서는 0이라고 하자.

$ x_{(1)} \leq x_{(2)} \leq ... \leq x_{(n)}$ 이라고 하면. Likelihood 함수는

$$ L(\theta | x) = \prod_{k=1}^n \cfrac{1}{\theta} = \cfrac{1}{\theta^{n}}=\theta^{-n} \; (*) $$

$ 0 \leq x_{(0)} $ 이고 $ \theta \geq x_{(n)} $ 그 외에서는 0

여기에 로그를 취하여 미분을 하면.

$$ \cfrac{d \ln L(\theta | x)}{d \theta} = -\cfrac{n}{\theta} < 0$$

그러므로 $L(\theta | x) = \theta^{-n} $는 $ \theta \geq x_{(n)} $이면 항상 감소 함수 이다. 이 정보와 (*)를 이용하면 $L(\theta | x) $는 $ \theta $는 $ x_{(n)}$에서 최대가 됨을 알수 있다.

그러므로 maximum likelihood estimator for $ \theta $는

$$ \hat{\theta} = x_{(n)} $$