Examples
Regression poststratification
using StatsKit, FreqTables, ProportionalFitting
First, we generate some fake data with demographic characteristics.
N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
:sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
:edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
:age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
:inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)
# show the first 6 rows
first(df, 6)
6 rows × 5 columns
sex | edu | age | inc | opn | |
---|---|---|---|---|---|
Cat… | Cat… | Cat… | Float64 | Float64 | |
1 | f | med | <10 | 1.49842 | 3.25607 |
2 | f | med | 11<25 | 490.191 | 2.30299 |
3 | m | lo | 51<70 | 57.0477 | -1.12036 |
4 | f | lo | 26<50 | 850.882 | 2.58526 |
5 | m | no | 26<50 | 171.426 | 2.54066 |
6 | f | hi | >70 | 201.741 | 8.55065 |
Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.
# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)
# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])
# Compute array factors
fac = ipf(Array(tab), pop_margins)
# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}
[:, :, age="11<25"] =
sex ╲ edu │ "hi" "lo" "med" "no"
──────────┼───────────────────────────────────────────────
"f" │ 0.00226476 0.0125168 0.0221265 0.0262771
"m" │ 0.00194546 0.00896005 0.0108611 0.0150482
[:, :, age="26<50"] =
sex ╲ edu │ "hi" "lo" "med" "no"
──────────┼───────────────────────────────────────────────
"f" │ 0.0 0.0244383 0.0462864 0.0
"m" │ 0.00854638 0.0209928 0.0556648 0.0440713
[:, :, age="51<70"] =
sex ╲ edu │ "hi" "lo" "med" "no"
──────────┼───────────────────────────────────────────
"f" │ 0.0124573 0.0344242 0.086933 0.048179
"m" │ 0.0160514 0.0295708 0.0896117 0.0827726
⋮
Create weights by looking up the adjusted counts and then performing normalization.
# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
1.1556073389818806
0.6028851610942277
0.8057217605868462
0.6658769253217086
1.200819877571924
0.23799660287095142
0.43735668244151815
0.29593433259381396
0.43735668244151815
0.34104803121706323
⋮
0.6028851610942277
0.05300828726323624
0.6028851610942277
0.33942646492302503
1.3127442474308966
0.9169183766055442
2.3686849157862353
0.43676273863855186
0.23286526308411443
Let's take a look at how our regression estimates change!
# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 2.31943 0.543711 4.27 <1e-04 1.24045 3.3984
log(inc) 0.0363758 0.0955458 0.38 0.7042 -0.153232 0.225983
────────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 1.8934 0.558464 3.39 0.0010 0.785143 3.00165
log(inc) 0.0657046 0.0991532 0.66 0.5091 -0.131062 0.262471
────────────────────────────────────────────────────────────────────────