Examples

Regression poststratification

using StatsKit, FreqTables, ProportionalFitting

First, we generate some fake data with demographic characteristics.

N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
    :sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
    :edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
    :age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
    :inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)

# show the first 6 rows
first(df, 6)
6×5 DataFrame
Rowsexeduageincopn
Cat…Cat…Cat…Float64Float64
1fmed11<2516680.08.38927
2flo26<5019.1143.26116
3mhi51<70109.2515.15713
4mlo11<258.657870.742947
5fmed26<503.607072.98436
6fhi26<503543.567.49632

Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.

# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)

# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])

# Compute array factors
fac = ipf(Array(tab), pop_margins)

# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}

[:, :, age=11<25] =
sex ╲ edu │          hi           lo          med           no
──────────┼───────────────────────────────────────────────────
f         │  0.00337374   0.00584372    0.0446535          0.0
m         │ 0.000759526   0.00789354          0.0     0.037476

[:, :, age=26<50] =
sex ╲ edu │         hi          lo         med          no
──────────┼───────────────────────────────────────────────
f         │  0.0120083   0.0237713   0.0635751   0.0211609
m         │ 0.00617925   0.0160548   0.0572503         0.0

[:, :, age=51<70] =
sex ╲ edu │         hi          lo         med          no
──────────┼───────────────────────────────────────────────
f         │ 0.00783063   0.0542543   0.0621859   0.0482967
m         │  0.0105774   0.0488569    0.167998         0.0
⋮

Create weights by looking up the adjusted counts and then performing normalization.

# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
 1.0053990768357948
 0.5352238685451375
 0.2381565089891971
 0.17772751988613258
 1.4314289522123937
 0.27037464774362274
 1.4314289522123937
 1.0053990768357948
 1.2890240383114946
 0.17772751988613258
 ⋮
 1.1000406828933365
 3.782574937812475
 3.782574937812475
 0.6124857769925756
 0.3614830645452731
 1.2890240383114946
 1.0874273026178114
 1.292597710562912
 1.0053990768357948

Let's take a look at how our regression estimates change!

# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  1.70299     0.593909  2.87    0.0051   0.524393   2.88158
log(inc)     0.111786    0.106306  1.05    0.2956  -0.099174   0.322747
───────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  1.57576      0.557384  2.83    0.0057   0.469655   2.68187
log(inc)     0.0641786    0.106071  0.61    0.5465  -0.146315   0.274672
────────────────────────────────────────────────────────────────────────