Rで始めるデータサイエンス④データ変換

2019年5月9日2019年5月22日

変数の種類
You might also have noticed the row of three (or four) letter abbreviations under the column names. These describe the type of each variable:

int stands for integers.　整数
dbl stands for doubles, or real numbers.　実数
chr stands for character vectors, or strings.
dttm stands for date-times (a date + a time).

There are three other common types of variables that aren’t used in this dataset but you’ll encounter later in the book:

lgl stands for logical, vectors that contain only TRUE or FALSE. 2進数
fctr stands for factors, which R uses to represent categorical variables with fixed possible values.　カテゴリカル変数
date stands for dates.　日付のみ

dplyer関数
データ変換のための主要な関数。同じように扱う。

第１引数　データフレーム
第２引数　何をするか
第３引数　新たなデータフレーム

Pick observations by their values (filter()).　抽出
Reorder the rows (arrange()).　並び替え
Pick variables by their names (select()).　名前で選択
Create new variables with functions of existing variables (mutate()).　新たな変数を作る
Collapse many values down to a single summary (summarise()).　要約量を作る

goroup_by()と一緒に使える。データセット全体からスコープできる。

filter()

filter(flights, month==1, day==25)　結果を表示
jan25 <- filter(flights, month==1, day==25)　代入
(jan25 <- filter(flights, month==1, day==25))　代入して結果を表示

比較

不等号>,>=,<,<=
否定　!=
等号 ==

%in% のいずれかを含む

=は、定数の定義

浮動小数の問題
有限の精度の演算をするので、無限小数については、記事的な値を取るから、理論的な数学の計算が成り立たないので、near()を使う必要性がある。

There’s another common problem you might encounter when using ==: floating point numbers. These results might surprise you!

sqrt(2) ^ 2 == 2
#> [1] FALSE
1 / 49 * 49 == 1
#> [1] FALSE

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():

near(sqrt(2) ^ 2,  2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

論理演算子
, かつ
& かつ
|または
!否定

ド・モルガンの法則
Sometimes you can simplify complicated subsetting by remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y

欠損値　NA　not available

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
#> [1] NA
# We don't know!

欠損値を保持するには明示的に要求する。
> df <- tibble(x = c(1, NA, 3))
> filter(df, x >= 1)
# A tibble: 2 x 1
x
<dbl>
1 1
2 3
> df <- tibble(x = c(1, NA, 3))
> filter(df, is.na(x)|x >= 1)
# A tibble: 3 x 1
x
<dbl>
1 1
2 NA
3 3
> df <- tibble(x = c(1, NA, 3))
> filter(df, x == NA | x >= 1)
# A tibble: 2 x 1
x
<dbl>
1 1
2 3
→コンピューターの気持ちがわからない。

参考：http://cse.naro.affrc.go.jp/takezawa/r-tips/r/18.html

NULL，NA，NaN，Inf なのか否かを調べる

NULL (何も無い) ，NA (欠損値) ，NaN (非数) ，Inf (無限大) は，大抵は演算を施してもそのままの値（ NA や NAN ）が返ってくる．すなわち，原則として NaN にどのような演算を施しても結果は NaN になる．よって，比較演算子 == すら使えないことになる．

 
 x <- c(1.0, NA, 3.0, 4.0)  # NA はどれかを調べても...
 x == NA                    # NA に対する演算は全て NA となる

[1] NA NA NA

これら 4 つの値の検査を行う関数がそれぞれ用意されている．

命令	is.null()	is.na()	is.nan()	is.finite()	is.infinite()	complete.cases()
対象	NULLか否か	NAか否か	NaNか否か	有限か否か	無限か否か	欠損か否か