Trade Variable Treatment in the ML Model

Exports, Imports & Balance — raw vs normalised · NZIPL PC Model

Published

April 9, 2026

How Exports, Imports, and Net Balance Are Treated in the ML Model

Context: NZIPL Predicted Competitiveness (PC) model — Random Forest pipeline (Ishana Ratan, NZIPL).

In the ML Model (Random Forest)

The features fed into the RF include both raw absolute export volumes and RCA values:

“Independent variable spec: (1) raw exports of components similar to / solar; (2) RCA in exports of components similar to / solar” — Model Inputs 2.21.docx, Methodology Notes

GDP and population are also included as separate covariates (“World Development variables”). So the model sees absolute trade alongside country size, letting the RF learn to disentangle “this country exports a lot because it’s huge” from “this country exports a lot in this specific product relative to its size.”

The target variable is the binary RCA > 1 in the final product — RCA is already GDP/trade-normalised (a country’s export share of a product scaled by global exports in that product).

What This Means for SHAP Values

The shap_mean_z scores in pc_features.csv are the mean absolute SHAP contribution of each HS code across all country-year observations in the training data. Because raw export volumes were the features (not normalised), a large-country effect could be embedded — China’s absolute exports of a product can be enormous even if its RCA is modest. But because SHAP is standardised (|z| = z-scored) and averaged across all countries, this scale bias is partially absorbed.

Why `log_gdp` in Our Regression Makes Sense

In the scatterplot regression (exports % GDP ~ SHAP + log_gdp + type + stage), the Y variable is already GDP-normalised. Adding log_gdp controls for a residual size effect that survives normalisation: larger economies tend to have deeper supply chains with more products traded at positive (even if small) intensity, which can inflate the trade share showing up for any given product. The log_gdp coefficient in the HLM therefore absorbs this systematic country-size signal, leaving SHAP to explain the product-level variation.

Summary Table

Layer	Variable	Treatment
ML model features	Exports per HS code	Raw (absolute USD)
ML model target	RCA in final product	Normalised (ratio > 1 = competitive)
Regression Y	Exports per product	% of GDP (normalised)
Regression control	`log_gdp`	Absorbs residual scale effect

Conceptual Note

The slight mismatch — SHAP derived from absolute features, Y measured as GDP share — is real but benign. Within a country, products with higher absolute exports also tend to have higher GDP-normalised exports, so the correlation between SHAP importance and trade intensity is robust. The log_gdp control closes the remaining gap.

A cleaner future specification would re-derive SHAP from a model that uses GDP-normalised trade as features throughout — this would make the SHAP scores directly comparable to the normalised Y variable in the regression without needing the log_gdp correction.