Machine Learning

Stanford Univ, Coursera

Normal Equation (正規方程式)

これ以下、未編集

Hypothesis: $h_{\theta}(x) = {\theta}_0 + {\theta}_1 x_1 + {\theta}_2 x_2 + {\theta}_2 x_2 + {\theta}_3 x_3 + {\theta}_4 x_4 $

$n = $ the number of features (feature の数)
$m = $ the number of training examples (トレーニングセットの数)
$x^{(i)}_j =$ value of feature $j$ in the $i$th training example
$x^{(i)} =$ the features of the $i$th training example
$\displaystyle \boldsymbol{x} = \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{pmatrix} \quad\quad where \quad x_0 = 1 $
$\displaystyle \boldsymbol{{\theta}} = \begin{pmatrix} {\theta}_0 \\ {\theta}_1 \\ \vdots \\ {\theta}_n \end{pmatrix} $
Hypothesis:
$ \begin{eqnarray} h_{\theta}(x) &=& {\theta}_0 + {\theta}_1 x_1 + {\theta}_2 x_2 + {\theta}_2 x_2 + {\theta}_3 x_3 + \cdots + {\theta}_n x_n \\ &=& \begin{pmatrix} {\theta}_0 & {\theta}_1 & \cdots & {\theta}_n \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{pmatrix} \\ &=& \boldsymbol{{\theta}}^T \boldsymbol{x} \end{eqnarray} $

Gradient Descent

Repeat
$\displaystyle \quad {\theta}_j := {\theta}_j - \alpha \frac{1}{m} \sum_{i=1}^{m} ( h_{\theta} (x^{(i)}) - y^{(i)} ) x^{(i)}_j $
ただし $j=0,1,{\cdots},n$について ${\theta}_j$ を同時に更新する必要がある。

Feature Scaling (属性のスケール合わせ)

feature $x_j$ によって数値が極端に変わると、収束に時間がかかってしまうのでスケールを合わせる必要がある。 $\displaystyle \begin{eqnarray} x^{(i)}_j &=& \frac{x^{(i)}_j - {\mu}_j}{s_j} \\ {\mu}_j &=& \frac{1}{m} \sum_{i=1}^{m} x^{(i)}_j \\ s_j &=& {\max}_{i=1,\cdots,m}(x^{(i)}_j) - {\min}_{i=1,\cdots,m}(x^{(i)}_j) \end{eqnarray} $

注意:
"Week 2: Gradient Descent in Practice I - Feature Scaling" のまとめでは、 $\displaystyle \begin{eqnarray} x_i &=& \frac{x_i - {\mu}_i}{s_i} \\ \end{eqnarray} $ と書いてあるが、$i$と$j$の指すものが他の説明と異なるのでわかりにくいと思う。文章では正しくと説明されているようだが。

Learning Rate (学習率) の値の選び方

注意
"Gradient Descent in Practice II - Learning Rate"のビデオで $\alpha$が大き過ぎる場合の手書きの赤い線で $J(\theta)$が左右に交互に振れながら増大していく図を書いたが、これは間違いでは。左右に振れながら$J(\theta)$が増大していくグラフの横軸は $\boldsymbol{x}$の場合でなければならない。

$\alpha$が小さいと、conversion(収束)までに時間がかかる。 $\alpha$が大き過ぎると、$J(\theta)$ 減少しないことがあるかもしれないし、収束しないかもしれない。

Features and Polynomial Regression

複数の features を、ひとつの feature にまとめる場合もある。 Cost Function は線形である必要はなく、Polynomial (多項式) を使う場合もある。ただし、この場合はその項($x^2$とか$\sqrt{x}$とか)が取り得る値で割った値を使うこと。

Yoshihisa Nitta

http://nw.tsuda.ac.jp/