Examples

Regression poststratification

using StatsKit, FreqTables, ProportionalFitting

First, we generate some fake data with demographic characteristics.

N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
    :sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
    :edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
    :age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
    :inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)

# show the first 6 rows
first(df, 6)
6×5 DataFrame
Rowsexeduageincopn
Cat…Cat…Cat…Float64Float64
1mmed>70882.656-0.812009
2mmed11<2529808.55.09486
3flo>708687.36.73024
4fmed>704407.0-0.811454
5flo51<7040.44111.17388
6mno26<50254.0662.53498

Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.

# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)

# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])

# Compute array factors
fac = ipf(Array(tab), pop_margins)

# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}

[:, :, age=11<25] =
sex ╲ edu │         hi          lo         med          no
──────────┼───────────────────────────────────────────────
f         │ 0.00125777  0.00793477   0.0102014   0.0481482
m         │ 0.00106014  0.00250799  0.00859842   0.0202913

[:, :, age=26<50] =
sex ╲ edu │         hi          lo         med          no
──────────┼───────────────────────────────────────────────
f         │ 0.00186086  0.00880454   0.0241485   0.0356173
m         │ 0.00627384   0.0123685    0.050885   0.0600415

[:, :, age=51<70] =
sex ╲ edu │        hi         lo        med         no
──────────┼───────────────────────────────────────────
f         │ 0.0203398  0.0641578  0.0659878        0.0
m         │       0.0  0.0270383   0.222476        0.0
⋮

Create weights by looking up the adjusted counts and then performing normalization.

# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
 0.4218824015628472
 0.2044726309965533
 0.9733003821995955
 1.501592972056934
 1.5256878840890673
 1.4278021668422063
 0.28374770778386127
 0.24259103527824905
 0.41018269630210424
 0.05964064971052649
 ⋮
 0.28374770778386127
 1.2100581678576767
 0.20937411816006465
 0.4825329448644217
 0.24259103527824905
 1.2100581678576767
 0.26007891079525
 0.2044726309965533
 5.290542408872401

Let's take a look at how our regression estimates change!

# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.29953    0.577154   2.25    0.0266   0.154189   2.44487
log(inc)     0.047735   0.0962302  0.50    0.6210  -0.143231   0.238701
───────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   2.21259     0.644666   3.43    0.0009   0.933275   3.49191
log(inc)     -0.231303    0.111401  -2.08    0.0405  -0.452375  -0.0102316
──────────────────────────────────────────────────────────────────────────