SimPaths Repository Guide¶

A guide to navigating the SimPaths repository structure and codebase.

Table of Contents¶

Repository Structure
Core Components
Key Directories Explained
Sub-package Detail
Data Pipeline Reference
Development Workflow
Code Navigation Tips
Additional Resources

Repository Structure¶

SimPaths/
├── config/                         # Configuration files for simulations
│   ├── default.yml                 # Default simulation parameters
│   ├── test_create_database.yml    # Database creation test config
│   └── test_run.yml                # Test run configuration
│
├── documentation/                  # Comprehensive documentation
│   ├── figures/                    # Diagrams and illustrations
│   ├── wiki/                       # Full documentation website
│   │   ├── getting-started/        # Setup and first simulation guides
│   │   ├── overview/               # Model description and modules
│   │   ├── user-guide/             # Running simulations
│   │   ├── developer-guide/        # Extending the model
│   │   │   └── repository-guide.md # Repository guide (copy for website)
│   │   ├── jasmine-reference/      # JAS-mine library reference
│   │   ├── research/               # Published papers
│   │   └── validation/             # Model validation results
│   ├── repository-guide.md         # Repository structure and navigation guide
│   ├── SimPaths_Variable_Codebook.xlsx    # Codebook of all variables in SimPaths
│   ├── SimPaths_Stata_Parameters.xlsx     # Comparison of parameters: Stata do-files vs Java code
│   └── SimPathsUK_Schedule.xlsx           # Detailed schedule of events and corresponding classes
│
├── input/                          # Input data and parameters
│   ├── InitialPopulations/         # Starting population data
│   │   ├── training/               # De-identified training population (included in repo)
│   │   └── compile/                # Stata pipeline: builds populations, estimates regressions
│   │       ├── do_emphist/         # Employment history reconstruction sub-pipeline
│   │       └── RegressionEstimates/  # Regression coefficient estimation scripts
│   ├── EUROMODoutput/              # Tax-benefit model outputs
│   │   └── training/               # Training UKMOD outputs (included in repo)
│   ├── DoFilesTarget/              # Stata scripts that generate alignment targets
│   ├── align_*.xlsx                # Alignment files (population, employment, etc.)
│   ├── reg_*.xlsx                  # Regression parameter files
│   ├── scenario_*.xlsx             # Scenario configuration files
│   ├── projections_*.xlsx          # Mortality/fertility projections
│   ├── DatabaseCountryYear.xlsx    # Database metadata
│   ├── EUROMODpolicySchedule.xlsx  # Policy schedule
│   ├── policy parameters.xlsx      # Tax-benefit parameters
│   ├── validation_statistics.xlsx  # Validation targets
│   └── input.mv.db                 # H2 donor database (generated by setup)
│
├── output/                         # Simulation outputs
│   ├── [timestamp]_[seed]_[run]/   # Timestamped output folders
│   │   ├── csv/
│   │   │   ├── Statistics1.csv          # Income distribution, Gini, S-Index
│   │   │   ├── Statistics2<N>.csv       # Demographics by age and gender
│   │   │   ├── Statistics3<N>.csv       # Alignment diagnostics
│   │   │   ├── Person<N>.csv            # Person-level output
│   │   │   ├── BenefitUnit<N>.csv       # Benefit-unit-level output
│   │   │   └── Household<N>.csv         # Household-level output
│   │   ├── database/                    # Run-specific persistence output
│   │   └── input/                       # Copied run input artifacts
│   └── logs/                       # Log files (with -f flag on multirun)
│
├── src/                            # Source code
│   ├── main/
│   │   ├── java/simpaths/
│   │   │   ├── data/               # Data handling and parameters
│   │   │   ├── experiment/         # Simulation execution classes
│   │   │   └── model/              # Core model implementation
│   │   │       ├── decisions/      # Intertemporal optimisation grids
│   │   │       ├── enums/          # Categorical variable definitions
│   │   │       ├── taxes/          # EUROMOD donor matching
│   │   │       └── lifetime_incomes/  # Synthetic income trajectory generation
│   │   └── resources/              # Configuration resources
│   └── test/                       # Test classes
│
├── validation/                     # Validation scripts and results
│   ├── 01_estimate_validation/     # Estimation validation
│   └── 02_simulated_output_validation/  # Output validation
│
├── pom.xml                         # Maven build configuration
├── singlerun.jar                   # Executable for single runs
├── multirun.jar                    # Executable for multiple runs
└── README.md                       # Project overview

Core Components¶

1. Entry Points¶

SimPathsStart (`src/main/java/simpaths/experiment/SimPathsStart.java`)¶

Main class for single simulation execution
Handles GUI and command-line interfaces
Manages database setup phases
Key methods:
main(): Entry point
runGUIdialog(): Launch GUI
runGUIlessSetup(): Command-line setup

SimPathsMultiRun (`src/main/java/simpaths/experiment/SimPathsMultiRun.java`)¶

Coordinates multiple simulation runs
Manages parallel execution
Aggregates results across runs
Configurable via YAML files

2. Core Model¶

SimPathsModel (`src/main/java/simpaths/model/SimPathsModel.java`)¶

Central simulation manager
Implements AbstractSimulationManager from JAS-mine
Defines the simulation schedule via buildSchedule()
Manages all simulation modules and processes
Key responsibilities:
Population initialization
Event scheduling
Module coordination
Time progression

3. Data & Parameters¶

Parameters (`src/main/java/simpaths/data/Parameters.java`)¶

Global parameter storage
Loads regression coefficients from Excel
Manages country-specific configurations
Stores alignment targets
Key data structures:
Regression coefficient maps
Policy parameters
Alignment targets
EUROMOD variable definitions

Key Directories Explained¶

`/src/main/java/simpaths/`¶

`data/`¶

Purpose: Data handling, parameter management, and utility classes

Parameters.java: Global parameter storage and Excel data loading
ManagerRegressions.java: Regression coefficient management
CallEUROMOD.java / CallEMLight.java: Interface with tax-benefit models
filters/: Collection filters for querying simulated populations
startingpop/: Initial population data parsing
statistics/: Statistical utilities

`experiment/`¶

Purpose: Simulation execution and coordination

SimPathsStart.java: Single-run entry point
SimPathsMultiRun.java: Multi-run orchestration
SimPathsCollector.java: Output collection and aggregation
SimPathsObserver.java: GUI updates and monitoring

`model/`¶

Purpose: Core simulation logic

SimPathsModel.java: Main simulation manager
Person.java: Individual-level processes and attributes
BenefitUnit.java: Fiscal unit processes
Household.java: Residential unit processes
decisions/: Labour supply and consumption optimization
enums/: Type-safe enumerations (Gender, Country, HealthStatus, etc.)
taxes/: Tax-benefit donor matching system
lifetime_incomes/: Lifetime income projection utilities

`/input/`¶

Critical input files:

File Pattern	Purpose
`align_*.xlsx`	Alignment targets (population, employment, education, etc.)
`reg_*.xlsx`	Regression parameters for behavioral processes
`scenario_*.xlsx`	Policy scenarios and projections
`projections_*.xlsx`	Demographic projections (mortality, fertility)
`DatabaseCountryYear.xlsx`	Tracks current database country/year
`EUROMODpolicySchedule.xlsx`	Tax-benefit policy schedule
`policy parameters.xlsx`	Detailed policy parameters

Subdirectories: - InitialPopulations/: Starting population databases - EUROMODoutput/: Tax-benefit donor population data - DoFilesTarget/: Stata-generated alignment targets

`/config/`¶

YAML configuration files override default parameters. The main file is default.yml, which contains several configuration sections:

model_args: SimPathsModel parameters (alignment switches, behavioral responses)
collector_args: Output options (CSV, database, statistics)
parameter_args: Data directories and input years
innovation_args: Experimental parameters for sensitivity analysis

Additional configuration files for testing: test_create_database.yml, test_run.yml

Sub-package Detail¶

The following sub-packages are self-contained subsystems whose internals are not obvious from the class names alone.

`model/decisions/` — IO engine¶

When IO is enabled, computing optimal consumption–labour choices for every agent at every time step would be prohibitively slow. This package solves the problem once before the simulation runs: it constructs a grid covering all meaningful combinations of state variables (wealth, age, health, family status, etc.), then works backwards from the end of life to find the optimal choice at each grid point (backward induction). During the simulation, agents simply look up their current state in the pre-computed grid.

Class	Purpose
`DecisionParams`	Defines the state-space dimensions and grid parameters for the optimisation problem.
`ManagerPopulateGrids`	Populates the state-space grid points and evaluates value functions by backward induction.
`ManagerSolveGrids`	Solves for optimal policy at each grid point.
`ManagerFileGrids`	Reads and writes pre-computed grids to disk, so they can be reused across runs.
`Grids`	Container for the set of solved decision grids.
`States`	Enumerates the state variables that define each grid point.
`Expectations` / `LocalExpectations`	Computes expected future values over stochastic transitions.
`CESUtility`	CES utility function used in the optimisation.

`model/taxes/` — EUROMOD donor matching¶

Imputes taxes and benefits onto simulated benefit units by matching them to pre-computed EUROMOD donor records.

Class	Purpose
`DonorTaxImputation`	Main entry point. Implements the three-step matching process: coarse-exact matching on characteristics, income proximity filtering, and candidate selection/averaging.
`KeyFunction` / `KeyFunction1`–`4`	Four progressively relaxed matching-key definitions. The system tries the tightest key first and falls back through wider keys if no donors are found.
`DonorKeys`	Builds composite matching keys from benefit-unit characteristics.
`DonorTaxUnit` / `DonorPerson`	Represent the pre-computed EUROMOD donor records loaded from the database.
`CandidateList`	Ranked list of donor matches for a given benefit unit, sorted by income proximity.
`Match` / `Matches`	Store the final selected donor(s) and their imputed tax-benefit values.

The taxes/database/ sub-package handles loading donor data from the H2 database into memory (TaxDonorDataParser, DatabaseExtension, MatchIndices).

`model/lifetime_incomes/` — synthetic income trajectories¶

When IO is enabled, this package creates projected income paths for birth cohorts using an AR(2) process anchored to age-gender geometric means, and matches simulated persons to donor income profiles.

Class	Purpose
`ManagerProjectLifetimeIncomes`	Generates the synthetic income trajectory database for all birth cohorts in the simulation horizon.
`LifetimeIncomeImputation`	Matches each simulated person to a donor income trajectory via binary search on the income CDF.
`AnnualIncome`	Implements the AR(2) income process with age-gender anchoring.
`BirthCohort`	Groups individuals by birth year for cohort-level income projection.
`Individual`	Entity carrying age dummies and log GDP per capita for income regression.

CSV filenames follow the pattern <EntityClass><RunNumber>.csv. With a single run the suffix is 1; with multiple runs each run produces its own numbered file.

For a description of the variables in output CSV files, see documentation/SimPaths_Variable_Codebook.xlsx. For a description of each reg_*, align_*, and scenario_* input file, see Model Parameterisation on the website.

Data Pipeline Reference¶

This section explains how the simulation-ready input files in input/ are generated from raw survey data, and what to do if you need to update or extend them.

The pipeline has three independent parts: (1) initial populations, (2) regression coefficients, (3) alignment targets. Each can be re-run separately.

Data sources¶

Source	Description	Access
UKHLS (Understanding Society)	Main household panel survey; waves 1 to O (UKDA-6614-stata)	Requires EUL licence from UK Data Service
BHPS (British Household Panel Survey)	Historical predecessor to UKHLS; used for pre-2009 employment history	Bundled with UKHLS EUL
WAS (Wealth and Assets Survey)	Biennial survey of household wealth; waves 1 to 7 (UKDA-7215-stata)	Requires EUL licence from UK Data Service
EUROMOD / UKMOD	Tax-benefit microsimulation system	See Tax-Benefit Donors (UK) on the website

Part 1 — Initial populations (`input/InitialPopulations/compile/`)¶

What it produces: Annual CSV files population_initial_UK_<year>.csv used as the starting population for each simulation run.

Master script: input/InitialPopulations/compile/00_master.do

The pipeline runs in numbered stages:

Script	What it does
`01_prepare_UKHLS_pooled_data.do`	Pools and standardises UKHLS waves
`02_create_UKHLS_variables.do`	Constructs all required variables (demographics, labour, health, income, wealth flags) and applies simulation-consistency rules (retirement as absorbing state, education age bounds, work/hours consistency)
`02_01_checks.do`	Data quality checks
`03_social_care_received.do`	Social care receipt variables
`04_social_care_provided.do`	Informal care provision variables
`05_create_benefit_units.do`	Groups individuals into benefit units (tax units) following UK tax-benefit rules
`06_reweight_and_slice.do`	Reweighting and year-specific slicing
`07_was_wealth_data.do`	Prepares Wealth and Assets Survey data
`08_wealth_to_ukhls.do`	Merges WAS wealth into UKHLS records
`09_finalise_input_data.do`	Final cleaning and formatting
`10_check_yearly_data.do`	Per-year consistency checks
`99_training_data.do`	Produces the de-identified training population committed to `input/InitialPopulations/training/`

Employment history sub-pipeline (`compile/do_emphist/`)¶

Reconstructs each respondent's monthly employment history from January 2007 onwards by combining UKHLS and BHPS interview records. The output variable liwwh (months employed since Jan 2007) feeds into the labour supply models.

Script	Purpose
`00_Master_emphist.do`	Master; sets parameters and calls sub-scripts
`01_Intdate.do` – `07_Empcal1a.do`	Sequential stages: interview dating, BHPS linkage, employment spell reconstruction, new-entrant identification

Part 2 — Regression coefficients (`input/InitialPopulations/compile/RegressionEstimates/`)¶

What it produces: The reg_*.xlsx coefficient tables read by Parameters.java at simulation startup.

Master script: input/InitialPopulations/compile/RegressionEstimates/master.do

Note: Income and union-formation regressions depend on predicted wages, so reg_wages.do must complete before reg_income.do and reg_partnership.do. All other scripts can run in any order.

Required Stata packages: fre, tsspell, carryforward, outreg2, oparallel, gologit2, winsor, reghdfe, ftools, require

Script	Module	Method
`reg_wages.do`	Hourly wages	Heckman selection model (males and females separately)
`reg_income.do`	Non-labour income	Hurdle model (selection + amount); requires predicted wages
`reg_partnership.do`	Partnership formation/dissolution	Probit; requires predicted wages
`reg_education.do`	Education transitions	Generalised ordered logit
`reg_fertility.do`	Fertility	Probit
`reg_health.do`	Physical health (SF-12 PCS)	Linear regression
`reg_health_mental.do`	Mental health (GHQ-12, SF-12 MCS)	Linear regression
`reg_health_wellbeing.do`	Life satisfaction	Linear regression
`reg_home_ownership.do`	Homeownership transitions	Probit
`reg_retirement.do`	Retirement	Probit
`reg_leave_parental_home.do`	Leaving parental home	Probit
`reg_socialcare.do`	Social care receipt and provision	Probit / ordered logit
`reg_unemployment.do`	Unemployment transitions	Probit
`reg_financial_distress.do`	Financial distress	Probit
`programs.do`	Shared utility programs called by the estimation scripts	—
`variable_update.do`	Prepares and recodes variables before estimation	—

After running, output Excel files are placed in input/ (overwriting the existing reg_*.xlsx files).

Part 3 — Alignment targets (`input/DoFilesTarget/`)¶

What it produces: The align_*.xlsx and *_targets.xlsx files that the alignment modules use to rescale simulated rates.

Script	Output file
`01_employment_shares_initpopdata.do`	`input/employment_targets.xlsx` — employment shares by benefit-unit subgroup and year
`01_inSchool_targets_initpopdata.do`	`input/inSchool_targets.xlsx` — school participation rates by year
`03_calculate_partneredShare_initialPop_BUlogic.do`	`input/partnered_share_targets.xlsx` — partnership shares by year
`03_calculate_partnership_target.do`	Supplementary partnership targets
`02_person_risk_employment_stats.do`	`employment_risk_emp_stats.csv` — person-level at-risk diagnostics used for employment alignment group construction

Population projection targets (align_popProjections.xlsx) and fertility/mortality projections (projections_*.xlsx) come from ONS published projections and are not generated by these scripts.

When to re-run each part¶

Situation	What to re-run
Adding a new data year to the simulation	Part 1 (re-slice the population for the new year) + Part 3 (update alignment targets)
Re-estimating a behavioural module	Part 2 (the affected `reg_*.do` script only) + Stage 1 validation
Updating employment alignment targets	Part 3 (`01_employment_shares_initpopdata.do`)

After re-running any part, re-run setup (singlerun -Setup or multirun -DBSetup) to rebuild input/input.mv.db before running the simulation.

Setup-generated artifacts¶

Running setup (multirun -DBSetup) creates or refreshes three files in input/:

input.mv.db — H2 database of EUROMOD donor tax-benefit outcomes
EUROMODpolicySchedule.xlsx — maps simulation years to EUROMOD policy systems
DatabaseCountryYear.xlsx — year-specific macro parameters

These must exist before any simulation run. If they are missing, re-run setup.

Training mode¶

The repository includes de-identified training data under input/InitialPopulations/training/ and input/EUROMODoutput/training/. If no initial-population CSV files are found in the main input location, SimPaths automatically switches to training mode. Training mode supports development and CI but is not intended for research interpretation.

Logging¶

With -f on multirun.jar, logs are written to output/logs/run_<seed>.txt (stdout) and output/logs/run_<seed>.log (log4j).

Development Workflow¶

1. Understanding the Code¶

Start here: 1. SimPathsStart.java — Understand initialization 2. SimPathsModel.java — Understand the simulation loop (buildSchedule()) 3. Person.java, BenefitUnit.java, Household.java — Understand agents 4. Module-specific methods in Person.java (e.g., health(), education(), fertility())

2. Key Design Patterns¶

JAS-mine Event Scheduling:

// In SimPathsModel.buildSchedule()
getEngine().getEventQueue().scheduleRepeat(
    new SingleTargetEvent(this, Processes.UpdateYear),
    0.0,   // Start time
    1.0    // Repeat interval
);

Regression-based processes:

double score = Parameters.getRegression(RegressionName.HealthMentalHMLevel)
    .getScore(regressors, Person.class.getDeclaredField("les_c4_lag1"));

Alignment:

ResamplingAlignment.align(
    population,              // Collection to align
    filter,                  // Subgroup filter
    closure,                 // Alignment closure
    targetValue              // Target to match
);

3. Adding New Features¶

Example: Add a new person attribute

Add field to Person.java:
```
private Integer newAttribute;
```

Add getter/setter:

public Integer getNewAttribute() { return newAttribute; }
public void setNewAttribute(Integer value) { this.newAttribute = value; }

Initialize in constructor or relevant process method
Update database schema if persisting (in PERSON_VARIABLES_INITIAL)
Add to outputs in SimPathsCollector.java if needed

See: documentation/wiki/developer-guide/how-to/new-variable.md

4. Modifying Parameters¶

Regression coefficients: Edit Excel files in input/reg_*.xlsx

Policy parameters: Edit input/policy parameters.xlsx

Alignment targets: Edit input/align_*.xlsx

Simulation options: Edit config/default.yml or use GUI

5. Adding GUI Parameters¶

Example:

@GUIparameter(description = "Enable new feature")
private Boolean enableNewFeature = true;

This automatically adds the parameter to the GUI interface.

See: documentation/wiki/developer-guide/how-to/add-gui-parameters.md

6. Testing¶

Run tests via:

mvn test

Or via IDE test runner.

7. Version Control¶

Branch naming conventions: - feature/your-feature-name — New features - bugfix/issue-number-description — Bug fixes - docs/documentation-topic — Documentation updates - experimental/your-description — Experimental work

Main branches: - main — Stable release - develop — Development integration

Find where a process runs: 1. Search for the process name in SimPathsModel.buildSchedule() 2. Follow the method call to the implementation

Find regression parameters: 1. Search for Parameters.getRegression(RegressionName.XXX) 2. The corresponding Excel file is in input/reg_XXX.xlsx

Find alignment logic: 1. Search for classes ending in Alignment (e.g., FertilityAlignment.java) 2. Check buildSchedule() for when alignment occurs

Understand data flow: 1. Input: Excel files → Parameters.java → Coefficient maps 2. Process: Regression score → Probability → Random draw → State change 3. Output: SimPathsCollector.java → CSV/Database

Additional Resources¶

Full Documentation: See documentation/wiki/ for comprehensive guides
Issues: GitHub Issues

SimPaths Repository Guide¶

Table of Contents¶

Repository Structure¶

Core Components¶

1. Entry Points¶

SimPathsStart (src/main/java/simpaths/experiment/SimPathsStart.java)¶

SimPathsMultiRun (src/main/java/simpaths/experiment/SimPathsMultiRun.java)¶

2. Core Model¶

SimPathsModel (src/main/java/simpaths/model/SimPathsModel.java)¶