Latihan Analisis Regresi OLS

Author

Deri Siswara

Eksplorasi Data

Import Data

datareg <- read.csv('data/wagedata.csv')
head(datareg)
    wage education experience ethnicity area_type    region parttime
1 498.58        14         15      cauc     urban     south       no
2 205.76         9         47      cauc     urban     south      yes
3 490.39        13         14      cauc     urban      west       no
4 237.42        13          4      cauc     urban      west      yes
5 759.73        12         44      cauc     urban northeast       no
6 902.18        15         36      cauc     urban northeast       no

Deskripsi Data

Case study pelatihan analisis regresi OLS dengan RStudio menggunakan data CPS1988 yang dikumpulkan dalam Survei Penduduk (Current Population Survey, CPS) pada bulan Maret 1988 oleh Biro Sensus AS dan dianalisis oleh Bierens dan Ginther (2001).

Data ini merupakan data cross-section dari pria berusia 18 hingga 70 tahun dengan pendapatan tahunan positif lebih dari US$ 50 pada tahun 1992, yang bekerja untuk perusahaan atau organisasi dan menerima gaji atau upah sebagai karyawan.

Salah satu masalah dengan data CPS adalah bahwa data ini tidak menyediakan pengalaman kerja yang sebenarnya. Oleh karena itu, biasanya pengalaman kerja dihitung sebagai usia - pendidikan - 6 (seperti yang dilakukan oleh Bierens dan Ginther, 2001), yang dapat dianggap sebagai pengalaman potensial. Akibatnya, beberapa responden memiliki pengalaman negatif.

Keterangan Data

Variabel Jenis Deskripsi
wage num Upah (dalam dolar per minggu).
education int Jumlah tahun pendidikan.
experience int Jumlah tahun pengalaman kerja potensial.
ethnicity factor Suku. Faktor dengan level “cauc” (Kaukasia) dan “afam” (Afrika-Amerika).
area_type factor Tinggal di area urban (perkotaan) atau rural (pedesaan).
region factor Wilayah bekerja. Faktor dengan level “northeast” (Timur Laut), “midwest” (Midwest), “south” (Selatan), dan “west” (Barat).
parttime factor Apakah bekerja paruh waktu? (yes/no).
str(datareg)
'data.frame':   500 obs. of  7 variables:
 $ wage      : num  499 206 490 237 760 ...
 $ education : int  14 9 13 13 12 15 0 16 16 16 ...
 $ experience: int  15 47 14 4 44 36 48 27 10 28 ...
 $ ethnicity : chr  "cauc" "cauc" "cauc" "cauc" ...
 $ area_type : chr  "urban" "urban" "urban" "urban" ...
 $ region    : chr  "south" "south" "west" "west" ...
 $ parttime  : chr  "no" "yes" "no" "yes" ...
datareg$ethnicity <- as.factor(datareg$ethnicity)
datareg$area_type <- as.factor(datareg$area_type)
datareg$region <- as.factor(datareg$region)
datareg$parttime <- as.factor(datareg$parttime)
str(datareg)
'data.frame':   500 obs. of  7 variables:
 $ wage      : num  499 206 490 237 760 ...
 $ education : int  14 9 13 13 12 15 0 16 16 16 ...
 $ experience: int  15 47 14 4 44 36 48 27 10 28 ...
 $ ethnicity : Factor w/ 2 levels "afam","cauc": 2 2 2 2 2 2 2 2 2 2 ...
 $ area_type : Factor w/ 2 levels "rural","urban": 2 2 2 2 2 2 2 2 2 2 ...
 $ region    : Factor w/ 4 levels "midwest","northeast",..: 3 3 4 4 2 2 4 2 1 2 ...
 $ parttime  : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 1 ...
summary(datareg)
      wage           education       experience    ethnicity  area_type  
 Min.   :  54.87   Min.   : 0.00   Min.   :-2.00   afam: 43   rural:106  
 1st Qu.: 308.64   1st Qu.:12.00   1st Qu.: 8.00   cauc:457   urban:394  
 Median : 521.79   Median :12.00   Median :15.00                         
 Mean   : 604.14   Mean   :13.03   Mean   :18.42                         
 3rd Qu.: 779.28   3rd Qu.:16.00   3rd Qu.:27.00                         
 Max.   :2374.15   Max.   :18.00   Max.   :57.00                         
       region    parttime 
 midwest  :120   no :452  
 northeast:101   yes: 48  
 south    :165            
 west     :114            
                          
                          

Distribusi data untuk peubah numerik

hist(datareg$wage)

hist(datareg$education)

hist(datareg$experience)

Barplot untuk peubah kategori

barplot(table(datareg$ethnicity))

barplot(table(datareg$area_type))

barplot(table(datareg$region))

barplot(table(datareg$parttime))

Analisis Korelasi

kordata <- cor(datareg[,1:3])
round(kordata, 2)
           wage education experience
wage       1.00      0.34       0.13
education  0.34      1.00      -0.33
experience 0.13     -0.33       1.00

Visualisasi Korelasi install.packages('corrplot')

library(corrplot)
corrplot(kordata, type="lower")

Regresi OLS

\[ wage_i = \beta_0 + \beta*experiece_i + \epsilon_i \] Regresi linier sederhana

model1 <- wage ~ experience
reg1 <- lm(model1, data = datareg)
summary(reg1)

Call:
lm(formula = model1, data = datareg)

Residuals:
   Min     1Q Median     3Q    Max 
-667.8 -275.5  -79.1  165.0 1812.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  529.670     31.260  16.944   <2e-16 ***
experience     4.044      1.382   2.925   0.0036 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 405.5 on 498 degrees of freedom
Multiple R-squared:  0.01689,   Adjusted R-squared:  0.01491 
F-statistic: 8.555 on 1 and 498 DF,  p-value: 0.003603

\[ log(wage_i) = \beta_0 + \beta*experiece_i + \epsilon_i \] Regresi linier sederhana dnegan transformasi log

model2 <- log(wage) ~ experience
reg2 <- lm(model2, data = datareg)
summary(reg2)

Call:
lm(formula = model2, data = datareg)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.11088 -0.39666  0.07784  0.45918  1.67853 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.023923   0.052768 114.159   <2e-16 ***
experience  0.008743   0.002334   3.746    2e-04 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6845 on 498 degrees of freedom
Multiple R-squared:  0.02741,   Adjusted R-squared:  0.02546 
F-statistic: 14.03 on 1 and 498 DF,  p-value: 0.0002005

Regresi linier berganda

model3 <- log(wage) ~ experience + area_type
reg3 <- lm(model3, data = datareg)
summary(reg3)

Call:
lm(formula = model3, data = datareg)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.18294 -0.41471  0.08782  0.44539  1.90790 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.792031   0.079222  73.111  < 2e-16 ***
experience     0.009058   0.002303   3.934 9.56e-05 ***
area_typeurban 0.286898   0.073904   3.882 0.000118 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.675 on 497 degrees of freedom
Multiple R-squared:  0.05603,   Adjusted R-squared:  0.05223 
F-statistic: 14.75 on 2 and 497 DF,  p-value: 5.981e-07

Ubah baseline kategori

datareg$area_type <- relevel(datareg$area_type, ref = "urban")

model3 <- log(wage) ~ experience + area_type
reg3 <- lm(model3, data = datareg)
summary(reg3)

Call:
lm(formula = model3, data = datareg)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.18294 -0.41471  0.08782  0.44539  1.90790 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.078929   0.053932 112.714  < 2e-16 ***
experience      0.009058   0.002303   3.934 9.56e-05 ***
area_typerural -0.286898   0.073904  -3.882 0.000118 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.675 on 497 degrees of freedom
Multiple R-squared:  0.05603,   Adjusted R-squared:  0.05223 
F-statistic: 14.75 on 2 and 497 DF,  p-value: 5.981e-07

Pengujian Asumsi Residual

QQPLOT

qqnorm(reg3$residuals)
qqline(reg3$residuals)

shapiro.test(reg3$residuals)

    Shapiro-Wilk normality test

data:  reg3$residuals
W = 0.98734, p-value = 0.0002485

Install Package: install.packages("lmtest")

library(lmtest)

Uji Heteroskedastisitas

bptest(reg3) 

    studentized Breusch-Pagan test

data:  reg3
BP = 0.12002, df = 2, p-value = 0.9418

Uji Multikolinieritas

Install Package: install.packages("car")

library(car)
vif(reg3)
experience  area_type 
  1.001249   1.001249 

Uji Autokorelasi

dwtest(reg3)

    Durbin-Watson test

data:  reg3
DW = 1.9226, p-value = 0.1925
alternative hypothesis: true autocorrelation is greater than 0

Tambahan

Visualisasi data

# Boxplot
boxplot(wage ~ area_type, data = datareg, 
        main = "Upah Berdasarkan Tipe Area", 
        xlab = "Tipe Area", 
        ylab = "Dolar per Minggu", 
        col = "orange")

# Scatterplot
plot(datareg$experience, datareg$wage, 
     main = "Upah Berdasarkan Pengalaman Kerja", 
     xlab = "Pengalaman Kerja", 
     ylab = "Log Dolar per Minggu", 
     col = "blue")

line
function (x, y = NULL, iter = 1) 
{
    xy <- xy.coords(x, y, setLab = FALSE)
    ok <- complete.cases(xy$x, xy$y)
    Call <- sys.call()
    structure(.Call(C_tukeyline, as.double(xy$x[ok]), as.double(xy$y[ok]), 
        as.integer(iter), Call), class = "tukeyline")
}
<bytecode: 0x0000016445f099a0>
<environment: namespace:stats>
# Scatterplot
plot(datareg$experience, log(datareg$wage), 
     main = "Upah Berdasarkan Pengalaman Kerja", 
     xlab = "Pengalaman Kerja", 
     ylab = "Log Dolar per Minggu", 
     col = "blue")

line
function (x, y = NULL, iter = 1) 
{
    xy <- xy.coords(x, y, setLab = FALSE)
    ok <- complete.cases(xy$x, xy$y)
    Call <- sys.call()
    structure(.Call(C_tukeyline, as.double(xy$x[ok]), as.double(xy$y[ok]), 
        as.integer(iter), Call), class = "tukeyline")
}
<bytecode: 0x0000016445f099a0>
<environment: namespace:stats>
# Plot log(wage) terhadap experience
plot(log(datareg$wage) ~ datareg$experience,
     main = "Hubungan Upah dan Pengalaman Kerja",
     ylab = "Dolar per Minggu (Log)",
     xlab = "Pengalaman Kerja",
     pch = 20)

# Garis regresi pertama
abline(a = 5, b = 0.01, col = "blue", lty = 2) # lt = 2: garis putus-putus

# Garis regresi kedua dengan fungsi lm() atau OLS
abline(lm(log(wage) ~ experience, data = datareg), col = "red")

# Tambahkan label untuk kedua garis
legend("bottomright",
       legend = c("Garis Regresi: a = 5, b = 0.1", 
                  "Garis Regresi: OLS"),
       col = c("blue", "red"), # Warna garis
       lty = c(2, 1), # Jenis garis
       bty = "n") # Tidak ada kotak di sekitar legend