Examples

Regression poststratification

using StatsKit, FreqTables, ProportionalFitting

First, we generate some fake data with demographic characteristics.

N = 100
p_sex = Categorical([0.49, 0.51])
p_edu = Categorical([0.1, 0.3, 0.4, 0.2])
p_age = Categorical([0.1, 0.3, 0.3, 0.2, 0.1])
p_inc = LogNormal(5, 3)
p_opn = Normal(0, 3)
df = DataFrame(
    :sex => CategoricalArray(["m", "f"][rand(p_sex, N)]),
    :edu => CategoricalArray(["no", "lo", "med", "hi"][rand(p_edu, N)], ordered = true),
    :age => CategoricalArray(["<10", "11<25", "26<50", "51<70", ">70"][rand(p_age, N)], ordered = true),
    :inc => rand(p_inc, N)
)
df.opn = 1.5 .+ log.(df.inc) .* .1 + rand(p_opn, N)

# show the first 6 rows
first(df, 6)

6 rows × 5 columns

sexeduageincopn
Cat…Cat…Cat…Float64Float64
1fmed<101.498423.25607
2fmed11<25490.1912.30299
3mlo51<7057.0477-1.12036
4flo26<50850.8822.58526
5mno26<50171.4262.54066
6fhi>70201.7418.55065

Then, we create post-stratification weights based on population-level margins using the background characteristics through ProportionalFitting.

# Create a cross-table of background characteristics
tab = freqtable(df, :sex, :edu, :age)

# Create margins from the population (aggregates obtained from statistical agency)
pop_margins = ArrayMargins([[0.49, 0.51], [0.05, 0.2, 0.5, 0.25], [0.1, 0.2, 0.4, 0.15, 0.15]])

# Compute array factors
fac = ipf(Array(tab), pop_margins)

# Compute adjusted table
tab_adj = Array(fac) .* tab
2×4×5 Named Array{Float64, 3}

[:, :, age="11<25"] =
sex ╲ edu │       "hi"        "lo"       "med"        "no"
──────────┼───────────────────────────────────────────────
"f"       │ 0.00226476   0.0125168   0.0221265   0.0262771
"m"       │ 0.00194546  0.00896005   0.0108611   0.0150482

[:, :, age="26<50"] =
sex ╲ edu │       "hi"        "lo"       "med"        "no"
──────────┼───────────────────────────────────────────────
"f"       │        0.0   0.0244383   0.0462864         0.0
"m"       │ 0.00854638   0.0209928   0.0556648   0.0440713

[:, :, age="51<70"] =
sex ╲ edu │      "hi"       "lo"      "med"       "no"
──────────┼───────────────────────────────────────────
"f"       │ 0.0124573  0.0344242   0.086933   0.048179
"m"       │ 0.0160514  0.0295708  0.0896117  0.0827726
⋮

Create weights by looking up the adjusted counts and then performing normalization.

# Create a poststratification weight variable by lookup from the table
df.w = [tab_adj[row...] for row in eachrow(df[:, [:sex, :edu, :age]])]
df.w = df.w ./ sum(df.w) .* N # normalize the weights
100-element Vector{Float64}:
 1.1556073389818806
 0.6028851610942277
 0.8057217605868462
 0.6658769253217086
 1.200819877571924
 0.23799660287095142
 0.43735668244151815
 0.29593433259381396
 0.43735668244151815
 0.34104803121706323
 ⋮
 0.6028851610942277
 0.05300828726323624
 0.6028851610942277
 0.33942646492302503
 1.3127442474308966
 0.9169183766055442
 2.3686849157862353
 0.43676273863855186
 0.23286526308411443

Let's take a look at how our regression estimates change!

# perform unweighted regression
frm = @formula(opn ~ log(inc))
res = lm(frm, df)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  2.31943     0.543711   4.27    <1e-04   1.24045    3.3984
log(inc)     0.0363758   0.0955458  0.38    0.7042  -0.153232   0.225983
────────────────────────────────────────────────────────────────────────
# perform weighted regression
res_w = lm(frm, df, wts = df.w)
StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

opn ~ 1 + :(log(inc))

Coefficients:
────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  1.8934      0.558464   3.39    0.0010   0.785143   3.00165
log(inc)     0.0657046   0.0991532  0.66    0.5091  -0.131062   0.262471
────────────────────────────────────────────────────────────────────────