Examples
Regression poststratification
using StatsKit, FreqTables, ProportionalFitting
First, we generate some fake data with demographic characteristics.
N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
:sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
:edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
:age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
:inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)
# show the first 6 rows
first(df, 6)
6×5 DataFrame
Row | sex | edu | age | inc | opn |
---|---|---|---|---|---|
Cat… | Cat… | Cat… | Float64 | Float64 | |
1 | m | med | >70 | 882.656 | -0.812009 |
2 | m | med | 11<25 | 29808.5 | 5.09486 |
3 | f | lo | >70 | 8687.3 | 6.73024 |
4 | f | med | >70 | 4407.0 | -0.811454 |
5 | f | lo | 51<70 | 40.4411 | 1.17388 |
6 | m | no | 26<50 | 254.066 | 2.53498 |
Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.
# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)
# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])
# Compute array factors
fac = ipf(Array(tab), pop_margins)
# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}
[:, :, age=11<25] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────────
f │ 0.00125777 0.00793477 0.0102014 0.0481482
m │ 0.00106014 0.00250799 0.00859842 0.0202913
[:, :, age=26<50] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────────
f │ 0.00186086 0.00880454 0.0241485 0.0356173
m │ 0.00627384 0.0123685 0.050885 0.0600415
[:, :, age=51<70] =
sex ╲ edu │ hi lo med no
──────────┼───────────────────────────────────────────
f │ 0.0203398 0.0641578 0.0659878 0.0
m │ 0.0 0.0270383 0.222476 0.0
⋮
Create weights by looking up the adjusted counts and then performing normalization.
# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
0.4218824015628472
0.2044726309965533
0.9733003821995955
1.501592972056934
1.5256878840890673
1.4278021668422063
0.28374770778386127
0.24259103527824905
0.41018269630210424
0.05964064971052649
⋮
0.28374770778386127
1.2100581678576767
0.20937411816006465
0.4825329448644217
0.24259103527824905
1.2100581678576767
0.26007891079525
0.2044726309965533
5.290542408872401
Let's take a look at how our regression estimates change!
# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
───────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept) 1.29953 0.577154 2.25 0.0266 0.154189 2.44487
log(inc) 0.047735 0.0962302 0.50 0.6210 -0.143231 0.238701
───────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
opn ~ 1 + :(log(inc))
Coefficients:
──────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept) 2.21259 0.644666 3.43 0.0009 0.933275 3.49191
log(inc) -0.231303 0.111401 -2.08 0.0405 -0.452375 -0.0102316
──────────────────────────────────────────────────────────────────────────