Examples
Regression poststratification
using StatsKit, FreqTables, ProportionalFitting
First, we generate some fake data with demographic characteristics.
N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
:sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
:edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
:age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
:inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)
# show the first 6 rows
first(df, 6)
6×5 DataFrame
Row | sex | edu | age | inc | opn |
---|---|---|---|---|---|
Cat… | Cat… | Cat… | Float64 | Float64 | |
1 | f | med | 11<25 | 16680.0 | 8.38927 |
2 | f | lo | 26<50 | 19.114 | 3.26116 |
3 | m | hi | 51<70 | 109.251 | 5.15713 |
4 | m | lo | 11<25 | 8.65787 | 0.742947 |
5 | f | med | 26<50 | 3.60707 | 2.98436 |
6 | f | hi | 26<50 | 3543.56 | 7.49632 |
Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.
# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)
# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])
# Compute array factors
fac = ipf(Array(tab), pop_margins)
# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}
[:, :, age=11<25] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────────────
f │ 0.00337374 0.00584372 0.0446535 0.0
m │ 0.000759526 0.00789354 0.0 0.037476
[:, :, age=26<50] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────────
f │ 0.0120083 0.0237713 0.0635751 0.0211609
m │ 0.00617925 0.0160548 0.0572503 0.0
[:, :, age=51<70] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────────
f │ 0.00783063 0.0542543 0.0621859 0.0482967
m │ 0.0105774 0.0488569 0.167998 0.0
⋮
Create weights by looking up the adjusted counts and then performing normalization.
# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
1.0053990768357948
0.5352238685451375
0.2381565089891971
0.17772751988613258
1.4314289522123937
0.27037464774362274
1.4314289522123937
1.0053990768357948
1.2890240383114946
0.17772751988613258
⋮
1.1000406828933365
3.782574937812475
3.782574937812475
0.6124857769925756
0.3614830645452731
1.2890240383114946
1.0874273026178114
1.292597710562912
1.0053990768357948
Let's take a look at how our regression estimates change!
# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.70299 0.593909 2.87 0.0051 0.524393 2.88158
log(inc) 0.111786 0.106306 1.05 0.2956 -0.099174 0.322747
───────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept) 1.57576 0.557384 2.83 0.0057 0.469655 2.68187
log(inc) 0.0641786 0.106071 0.61 0.5465 -0.146315 0.274672
────────────────────────────────────────────────────────────────────────