NZIPL Predicted Competitiveness — ML Pipeline Review

Random Forest · SHAP Feature Importance · Replication Roadmap

Author

NZIPL Clean Value Chain Explorer

Published

April 8, 2026

1 Executive Summary

This document reviews the Predicted Competitiveness (PC) machine learning pipeline developed by NZIPL for the Atlas of the Global Clean Industrial Base. The pipeline:

  1. Uses BACI bilateral trade data to compute Revealed Comparative Advantage (RCA) per country × HS6 product × year
  2. Trains a Random Forest classifier for each of the 10 clean technologies to predict whether a country will achieve RCA > 1 (i.e., become competitively specialised)
  3. Applies SHAP (SHapley Additive exPlanations) to extract which product-level RCA variables most drive each prediction
  4. Exports predicted_comp scores (0–1 probability) and per-feature shap_mean_z importance values — the two datasets now embedded in CVCE
On the SHAP ↔︎ Trade correlation

The scatter plots reveal a correlation between shap_mean_z (SHAP feature importance) and trade volumes. This is expected and correct, not a flaw. The SHAP values measure how much a product’s export RCA drives the prediction of technology competitiveness. Since both inputs and outputs are trade-derived, higher-volume products naturally have higher SHAP scores — the model is correctly learning that existing export strength in related sectors predicts future competitiveness. This encodes the product-space logic of economic complexity.


2 Notebook Inventory

The Python notebooks are located at: OneDrive/Bentley Allan's files - NZIPL/Projects/001 Industrial Base/Modeling -- ML

Code
notebooks <- tribble(
  ~Group,              ~Notebook,                        ~Purpose,
  "Core ML",    "Analysis/RCA Batteries Analysis.ipynb", "RF model + SHAP + rankings (Batteries)",
  "Core ML",    "Analysis/RCA Solar Analysis.ipynb",     "RF model + SHAP + rankings (Solar)",
  "Core ML",    "Analysis/RCA Wind Analysis.ipynb",      "RF model + SHAP + rankings (Wind)",
  "Core ML",    "Analysis/RCA {Tech} Analysis.ipynb",    "Same pattern for remaining 7 techs",
  "Features",   "RCA Construction/Battery/Batteries_Vars.ipynb", "Feature matrix: lagged RCA for 69 battery HS codes + 20 proximate + macro",
  "Features",   "RCA Construction/{Tech}/{Tech}_Vars.ipynb",     "Same pattern for each of 10 techs",
  "Data Prep",  "RCA/Code + Cleaned Files/RCA.ipynb",    "Base RCA calculation from BACI",
  "Data Prep",  "ML Code/Product Proximity Update.ipynb","Product proximity (phi) matrix calculation",
  "Data Prep",  "ML Data/Bentley_Backup/Imputation script.ipynb", "KNN imputation (k=3) for missing trade values",
  "Data Prep",  "ML Data/Trade Data/GTA code.ipynb",     "Global Trade Alert policy intervention processing",
  "Viz",        "ML Code/Nature Paper Figures.ipynb",    "Publication bubble charts (country × tech × score)",
  "Viz",        "ML Code/Old/Country Deep Dives.ipynb",  "RCA time series per country",
  "Utility",    "FDI Methodology/Firm Names/Fuzzymatch.ipynb", "Fuzzy name matching for FDI firm data"
)

notebooks |>
  kable(col.names = c("Group", "Notebook", "Purpose"),
        caption = "NZIPL ML Pipeline — Notebook Inventory") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13) |>
  column_spec(1, bold = TRUE, color = nzipl_green) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
NZIPL ML Pipeline — Notebook Inventory
Group Notebook Purpose
Core ML Analysis/RCA Batteries Analysis.ipynb RF model + SHAP + rankings (Batteries)
Core ML Analysis/RCA Solar Analysis.ipynb RF model + SHAP + rankings (Solar)
Core ML Analysis/RCA Wind Analysis.ipynb RF model + SHAP + rankings (Wind)
Core ML Analysis/RCA {Tech} Analysis.ipynb Same pattern for remaining 7 techs
Features RCA Construction/Battery/Batteries_Vars.ipynb Feature matrix: lagged RCA for 69 battery HS codes + 20 proximate + macro
Features RCA Construction/{Tech}/{Tech}_Vars.ipynb Same pattern for each of 10 techs
Data Prep RCA/Code + Cleaned Files/RCA.ipynb Base RCA calculation from BACI
Data Prep ML Code/Product Proximity Update.ipynb Product proximity (phi) matrix calculation
Data Prep ML Data/Bentley_Backup/Imputation script.ipynb KNN imputation (k=3) for missing trade values
Data Prep ML Data/Trade Data/GTA code.ipynb Global Trade Alert policy intervention processing
Viz ML Code/Nature Paper Figures.ipynb Publication bubble charts (country × tech × score)
Viz ML Code/Old/Country Deep Dives.ipynb RCA time series per country
Utility FDI Methodology/Firm Names/Fuzzymatch.ipynb Fuzzy name matching for FDI firm data

3 Pipeline Architecture

3.1 End-to-end flow

┌──────────────────────────────────────────────────────────┐
│  1. RAW DATA                                             │
│     BACI bilateral trade (HS6, 2002–2023)               │
│     FDI % GDP · Tariffs · Manufacturing % GDP           │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│  2. RCA CALCULATION   (RCA.ipynb)                       │
│                                                          │
│     RCA(c,p,y) = [X_cpy / X_c·] / [X_·p· / X_···]     │
│                                                          │
│     → rca_cyh.csv  (all country-product-year RCAs)      │
│     → phi_equal_year.csv  (product proximity matrix)    │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│  3. FEATURE ENGINEERING  ({Tech}_Vars.ipynb × 10)       │
│                                                          │
│     b_{HS6}  (×60-90)  : lagged RCA for tech HS codes  │
│     nb_{HS6} (×15-25)  : lagged RCA for proximate prods│
│     FDI, tariffs, manufacturing % GDP, trade openness  │
│                                                          │
│     → battery_trade.csv, solar_trade.csv, …  (~90 cols) │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│  4. KNN IMPUTATION  (Imputation script.ipynb)           │
│     k = 3 · fills missing export values                 │
│     → final_trade_df_imputed.csv                        │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│  5. RANDOM FOREST CLASSIFICATION  (Analysis notebooks)  │
│                                                          │
│     Target:   y = int(RCA_tech > 1)                     │
│     Model:    RFC(n_estimators=100, max_depth=10,        │
│                   random_state=28, test_size=0.25)       │
│                                                          │
│     → predicted_comp = predict_proba(X)[:, 1]           │
│       (probability of competitive specialisation)       │
└──────────────────────┬───────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────┐
│  6. SHAP FEATURE IMPORTANCE  (same Analysis notebooks)  │
│                                                          │
│     explainer = shap.TreeExplainer(rf_model)            │
│     shap_vals = explainer.shap_values(X)[1]  ← class 1 │
│                                                          │
│     z-normalise by year:  z = (x − μ_year) / σ_year    │
│     shap_mean_z = mean(|z|)  per feature                │
│                                                          │
│     → per-tech SHAP importance CSVs                     │
│       (now consolidated in data/pc/pc_features.csv)     │
└──────────────────────────────────────────────────────────┘

3.2 Model specification

Code
specs <- tribble(
  ~Parameter,         ~Value,         ~Notes,
  "Algorithm",        "RandomForestClassifier", "scikit-learn; tree-based, scale-invariant",
  "n_estimators",     "100",          "Number of decision trees",
  "max_depth",        "10",           "Max tree depth; limits overfitting",
  "random_state",     "28",           "Fixed for reproducibility across all techs",
  "Train/test split", "75% / 25%",    "Stratified random split",
  "Cross-validation", "5-fold GridSearchCV", "Only for Solar & Wind; others use fixed params",
  "Target variable",  "int(RCA > 1)", "Binary: 1 = competitive, 0 = not",
  "Output",           "predict_proba[:, 1]", "Probability of competitiveness; range 0–1",
  "Lag structure",    "1-year lag on all IV features", "Avoids look-ahead bias",
  "Missing data",     "KNN (k=3)",    "Pre-imputed before model training"
)

specs |>
  kable(col.names = c("Parameter", "Value", "Notes"),
        caption = "Random Forest Model Specification") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  column_spec(1, bold = TRUE) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
Random Forest Model Specification
Parameter Value Notes
Algorithm RandomForestClassifier scikit-learn; tree-based, scale-invariant
n_estimators 100 Number of decision trees
max_depth 10 Max tree depth; limits overfitting
random_state 28 Fixed for reproducibility across all techs
Train/test split 75% / 25% Stratified random split
Cross-validation 5-fold GridSearchCV Only for Solar & Wind; others use fixed params
Target variable int(RCA > 1) Binary: 1 = competitive, 0 = not
Output predict_proba[:, 1] Probability of competitiveness; range 0–1
Lag structure 1-year lag on all IV features Avoids look-ahead bias
Missing data KNN (k=3) Pre-imputed before model training

4 SHAP Feature Importance — What We Have

The data/pc/pc_features.csv contains the SHAP mean |z-score| for each HS6 code × technology combination that passed the importance threshold.

4.1 Coverage

Code
feat |>
  filter(!is.na(hs_code)) |>
  group_by(tech) |>
  summarise(
    `HS codes` = n_distinct(hs_code),
    `SHAP range` = paste0(round(min(shap_mean_z), 3), " – ",
                          round(max(shap_mean_z), 3)),
    `Mean SHAP` = round(mean(shap_mean_z), 3),
    `Top category` = {
      df <- cur_data() |> group_by(category) |>
        summarise(s = mean(shap_mean_z)) |> slice_max(s, n = 1)
      df$category[1]
    },
    .groups = "drop"
  ) |>
  rename(Technology = tech) |>
  kable(caption = "SHAP feature coverage per technology") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
SHAP feature coverage per technology
Technology HS codes SHAP range Mean SHAP Top category
Batteries 31 0.504 – 0.656 0.565 Machinery
Biofuel 66 0.512 – 0.811 0.645 Machinery
Electrolyzers 42 0.504 – 0.717 0.576 Chemicals
Geothermal 54 0.502 – 0.725 0.601 Electronics
Heat Pumps 26 0.515 – 0.822 0.644 Machinery
Magnets 8 0.529 – 0.667 0.583 Chemicals
Nuclear 22 0.506 – 0.699 0.556 Machinery
Solar 34 0.504 – 0.672 0.554 Machinery
Transmission 24 0.55 – 0.817 0.696 Industrial Materials
Wind 50 0.501 – 0.696 0.558 Chemicals

4.2 SHAP distribution by category

Code
feat |>
  filter(!is.na(hs_code)) |>
  mutate(tech = factor(tech, levels = names(TECH_COLORS))) |>
  ggplot(aes(x = shap_mean_z, y = reorder(category, shap_mean_z),
             color = category)) +
  geom_jitter(height = 0.2, alpha = 0.6, size = 1.5) +
  stat_summary(fun = mean, geom = "point", size = 3.5, shape = 18, color = "black") +
  facet_wrap(~ tech, ncol = 5, scales = "free_x") +
  scale_color_brewer(palette = "Dark2", guide = "none") +
  labs(title = "SHAP Importance Distribution by Category and Technology",
       subtitle = "Each dot = one HS6 code · Diamond = category mean",
       x = "SHAP mean |z-score|", y = NULL) +
  theme_nzipl()

4.3 Top-10 SHAP features per technology

Code
feat |>
  filter(!is.na(hs_code)) |>
  group_by(tech) |>
  slice_max(shap_mean_z, n = 10) |>
  ungroup() |>
  mutate(hs_code = formatC(as.integer(hs_code), width = 6, flag = "0"),
         shap_mean_z = round(shap_mean_z, 4)) |>
  select(Technology = tech, `HS Code` = hs_code,
         Description = description, Category = category,
         `SHAP mean |z|` = shap_mean_z) |>
  kable(caption = "Top-10 SHAP features per technology (by mean |z-score|)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 11) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white") |>
  collapse_rows(columns = 1, valign = "middle")
Top-10 SHAP features per technology (by mean |z-score|)
Technology HS Code Description Category SHAP mean &#124;z&#124;
Batteries 391190 Polysulphides, polysulphones and similar products of chemical synthesis n.e.s. in chapter 39; in primary forms Chemicals 0.6565
847990 Machines and mechanical appliances; parts, of those having individual functions Machinery 0.6478
370130 Photographic plates and film; in the flat, sensitised, unexposed, with any side exceeding 225mm, of any materials other than paper, paperboard or textiles Chemicals 0.6424
903190 Instruments, appliances and machines; parts and accessories for those measuring or checking devices of heading no. 9031 Machinery 0.6392
282540 Nickel oxides and hydroxides Chemicals 0.6192
902790 Microtomes and parts and accessories thereof Machinery 0.6099
291030 Epoxides, epoxyalcohols, epoxyphenols and epoxyethers; with a three-membered ring and their halogenated, sulphonated, nitrated or nitrosated derivatives, 1-chloro-2,3-epoxypropane (epichlorohydrin) Chemicals 0.6020
282200 Cobalt oxides and hydroxides; commercial cobalt oxides Chemicals 0.5977
382490 Chemical products, preparations and residual products of the chemical or allied industries, n.e.s. or included in heading no. 3824 Chemicals 0.5843
903180 Instruments, appliances and machines; for measuring or checking n.e.s. in chapter 90 Machinery 0.5817
Biofuel 940390 Furniture; parts Machinery 0.8113
381590 Reaction initiators, reaction accelerators and catalytic preparations, unsupported, n.e.s. or included Chemicals 0.8098
844190 Machinery; parts of machinery for making up paper pulp, paper or paperboard, including cutting machines of all kinds Machinery 0.7910
841939 Dryers; for products n.e.s. in heading no. 8419, not used for domestic purposes Machinery 0.7764
121299 Vegetable products (including unroasted chicory roots, chicorium intybus sativum variety); n.e.s. in chapter 12, fresh, chilled, frozen or dried, ground or unground, primarily for human consumption Other 0.7746
841381 Pumps and liquid elevators; n.e.s. in heading no. 8413 Machinery 0.7726
441520 Wood; pallets, box pallets and other load boards; pallet collars Industrial Materials 0.7718
842121 Machinery; for filtering or purifying water Machinery 0.7502
281122 Silicon dioxide Chemicals 0.7448
841410 Pumps; vacuum Machinery 0.7413
Electrolyzers 845610 Machine-tools; operated by laser or other light or photon beam process Machinery 0.7167
841480 Pumps and compressors; for air, vacuum or gas, n.e.s. in heading no. 8414 Machinery 0.7120
390710 Polyacetals; in primary forms Chemicals 0.6961
381512 Catalysts, supported; reaction initiators, reaction accelerators and catalytic preparations, with precious metal or precious metal compounds as the active substance, n.e.s. or included Chemicals 0.6612
842121 Machinery; for filtering or purifying water Machinery 0.6519
841370 Pumps; centrifugal, n.e.s. in heading no. 8413, for liquids Machinery 0.6458
841490 Pumps and compressors; parts, of air or vacuum pumps, air or other gas compressors and fans, ventilating or recycling hoods incorporating a fan Machinery 0.6453
381519 Catalysts, supported; reaction initiators, reaction accelerators and catalytic preparations, with an active substance other than nickel or precious metals or their compounds, n.e.s. or included Chemicals 0.6421
711021 Metals; palladium, unwrought or in powder form Metals 0.6421
842139 Machinery; for filtering or purifying gases, other than intake air filters for internal combustion engines Machinery 0.6374
Geothermal 722490 Steel, alloy; semi-finished products Metals 0.7247
842199 Machinery; parts for filtering or purifying liquids or gases Machinery 0.7138
845931 Machine-tools; for boring-milling by removing metal, numerically controlled Machinery 0.7132
845891 Lathes; for removing metal, numerically controlled, other than horizontal lathes Machinery 0.6970
851580 Welding machines and apparatus; n.e.s. in heading no. 8515, whether or not capable of cutting Electronics 0.6941
845811 Lathes; for removing metal, horizontal, numerically controlled Machinery 0.6938
720270 Ferro-alloys; ferro-molybdenum Metals 0.6911
842139 Machinery; for filtering or purifying gases, other than intake air filters for internal combustion engines Machinery 0.6876
730721 Steel, stainless; tube or pipe fittings, flanges, of stainless steel Metals 0.6804
401693 Rubber; vulcanised (other than hard rubber), gaskets, washers and other seals, of non-cellular rubber Chemicals 0.6800
Heat Pumps 841410 Pumps; vacuum Machinery 0.8217
841430 Compressors; of a kind used in refrigerating equipment Machinery 0.7803
841590 Air conditioning machines; with motor driven fan and elements for temperature control, parts thereof Machinery 0.7517
846241 Machine-tools; punching or notching machines (including presses), including combined punching and shearing machines, numerically controlled, for working metal Machinery 0.7439
841490 Pumps and compressors; parts, of air or vacuum pumps, air or other gas compressors and fans, ventilating or recycling hoods incorporating a fan Machinery 0.7278
841899 Refrigerating or freezing equipment; parts thereof, other than furniture Machinery 0.7152
853400 Circuits; printed Electronics 0.7095
820720 Tools, interchangeable; (for machine or hand tools, whether or not power-operated), dies for drawing or extruding metal Metals 0.6900
841582 Air conditioning machines; containing a motor driven fan, other than window or wall types, incorporating a refrigerating unit Machinery 0.6842
854121 Electrical apparatus; transistors, (other than photosensitive), with a dissipation rate of less than 1W Electronics 0.6747
Magnets 820720 Tools, interchangeable; (for machine or hand tools, whether or not power-operated), dies for drawing or extruding metal Metals 0.6669
847982 Machines; for mixing, kneading, crushing, grinding, screening, sifting, homogenising, emulsifying or stirring Machinery 0.6036
283324 Sulphates; of nickel Chemicals 0.5930
680421 Millstones, grindstones, grinding wheels and the like; of agglomerated synthetic or natural diamond Industrial Materials 0.5852
847990 Machines and mechanical appliances; parts, of those having individual functions Machinery 0.5751
851440 Heating equipment; for the heat treatment of materials by induction or dielectric loss, industrial or laboratory, other than furnaces and ovens Electronics 0.5583
260300 Copper ores and concentrates Metals 0.5537
260600 Aluminium ores and concentrates Metals 0.5289
Nuclear 841940 Distilling or rectifying plant; not used for domestic purposes Machinery 0.6995
721922 Steel, stainless; flat-rolled, width 600mm or more, hot-rolled, (not in coils), of a thickness of 4.75mm or more but not exceeding 10mm Metals 0.6501
720270 Ferro-alloys; ferro-molybdenum Metals 0.6333
841989 Machinery, plant and laboratory equipment; for treating materials by change of temperature, other than for making hot drinks or cooking or heating food Machinery 0.6170
840420 Boilers; condensers, for steam or other vapour power units Machinery 0.5963
282520 Lithium oxide and hydroxide Chemicals 0.5817
840690 Turbines; parts of steam and other vapour turbines Machinery 0.5785
261390 Molybdenum ores and concentrates; other than roasted Metals 0.5707
722790 Steel, alloy; bars and rods, hot-rolled, in irregularly wound coils, n.e.s. in heading no. 7227 Metals 0.5611
722011 Steel, stainless; flat-rolled, width less than 600mm, hot-rolled, of a thickness of 4.75mm or more Metals 0.5430
Solar 830249 Mountings, fittings and similar articles; suitable for other than buildings or furniture, of base metal Metals 0.6722
321490 Mastics; n.e.s. in heading no. 3214 Chemicals 0.6681
940390 Furniture; parts Machinery 0.6591
392350 Plastics; stoppers, lids, caps and other closures, for the conveyance or packing of goods Chemicals 0.6246
851430 Furnaces and ovens; industrial or laboratory, other than those functioning by induction, dielectric loss or resistance heated Electronics 0.6240
760429 Aluminium; alloys, bars, rods and profiles, other than hollow Metals 0.6019
392010 Plastics; plates, sheets, film, foil and strip, of polymers of ethylene, non-cellular and not reinforced, laminated, supported or similarly combined with other materials Chemicals 0.5885
730890 Iron or steel; structures and parts thereof, n.e.s. in heading no. 7308 Metals 0.5780
320890 Paints and varnishes; based on polymers n.e.s. in heading no. 3208, dispersed or dissolved in a non-aqueous medium Chemicals 0.5717
841989 Machinery, plant and laboratory equipment; for treating materials by change of temperature, other than for making hot drinks or cooking or heating food Machinery 0.5591
Transmission 270900 Oils; petroleum oils and oils obtained from bituminous minerals, crude Industrial Materials 0.8174
847990 Machines and mechanical appliances; parts, of those having individual functions Machinery 0.8147
850490 Electrical transformers, static converters and inductors; parts thereof Electronics 0.8145
847981 Machines and mechanical appliances; for treating metal, including electric wire coil-winders Machinery 0.8052
722611 Steel, alloy; flat-rolled, width less than 600mm, of silicon-electrical steel, grain-oriented Metals 0.7743
847989 Machines and mechanical appliances; n.e.s. in item no. 8479.8, having individual functions Machinery 0.7709
846241 Machine-tools; punching or notching machines (including presses), including combined punching and shearing machines, numerically controlled, for working metal Machinery 0.7643
842890 Lifting machinery; handling, loading or unloading machinery n.e.s. in heading no. 8428 Machinery 0.7355
850300 Electric motors and generators; parts suitable for use solely or principally with the machines of heading no. 8501 or 8502 Electronics 0.7307
845811 Lathes; for removing metal, horizontal, numerically controlled Machinery 0.7247
Wind 730799 Iron or steel; tube or pipe fittings, n.e.s. in item no. 7307.9, other than stainless steel Metals 0.6965
903180 Instruments, appliances and machines; for measuring or checking n.e.s. in chapter 90 Machinery 0.6823
732690 Iron or steel; articles n.e.s. in heading no. 7326 Metals 0.6669
850431 Electrical transformers; n.e.s. in item no. 8504.2, having a power handling capacity not exceeding 1kVA Electronics 0.6213
401693 Rubber; vulcanised (other than hard rubber), gaskets, washers and other seals, of non-cellular rubber Chemicals 0.6193
401699 Rubber; vulcanised (other than hard rubber), articles n.e.s. in heading no. 4016, of non-cellular rubber Chemicals 0.6173
732399 Iron or steel; table, kitchen and other household articles and parts thereof, of iron or steel n.e.s. in heading no. 7323 Metals 0.6138
871690 Trailers, semi-trailers and other vehicles not mechanically propelled; parts thereof for heading no. 8716 Machinery 0.6111
321490 Mastics; n.e.s. in heading no. 3214 Chemicals 0.6017
854460 Insulated electric conductors; for a voltage exceeding 1000 volts Electronics 0.5962

5 Predicted Competitiveness Scores — What We Have

5.1 Country × Technology coverage

Code
pc_top3 <- pc |>
  group_by(tech) |>
  filter(year == max(year)) |>
  slice_max(pc_score, n = 3, with_ties = FALSE) |>
  summarise(Top3 = paste(iso3, collapse = ", "), .groups = "drop")

pc |>
  group_by(tech) |>
  summarise(
    Countries = n_distinct(iso3),
    Years     = paste0(min(year), "–", max(year)),
    `Mean PC` = round(mean(pc_score, na.rm = TRUE), 3),
    .groups = "drop"
  ) |>
  left_join(pc_top3, by = "tech") |>
  rename(Technology = tech, `Top 3 (latest yr)` = Top3) |>
  kable(caption = "Predicted competitiveness coverage") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
Predicted competitiveness coverage
Technology Countries Years Mean PC Top 3 (latest yr)
Batteries 155 2003–2024 0.050 KOR, CHN, JPN
Biofuel 155 2003–2024 0.209 NLD, ZAF, UGA
Electrolyzers 155 2003–2024 0.078 ITA, CHN, KOR
Geothermal 155 2003–2024 0.130 DEU, ITA, POL
Heat Pumps 155 2003–2024 0.142 CHN, THA, ITA
Magnets 155 2003–2024 0.049 CHN, JPN, MYS
Nuclear 155 2003–2024 0.063 USA, RUS, CZE
Solar 155 2003–2024 0.059 JPN, CHN, MYS
Transmission 155 2003–2024 0.207 DNK, AUT, CHN
Wind 155 2003–2024 0.049 CHN, DNK, IND

5.2 PC score evolution for key countries

Code
focus <- c("CHN","USA","DEU","IND","KOR","JPN","BRA","VNM","CHL","AUS")

pc |>
  filter(iso3 %in% focus) |>
  mutate(tech = factor(tech, levels = names(TECH_COLORS))) |>
  ggplot(aes(x = year, y = pc_score, color = iso3, group = iso3)) +
  geom_line(linewidth = 0.7, alpha = 0.85) +
  geom_point(data = ~ filter(.x, year == max(year)),
             size = 2) +
  facet_wrap(~ tech, ncol = 5) +
  scale_color_brewer(palette = "Paired", name = "Country") +
  scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
  labs(title = "Predicted Competitiveness Scores — Selected Countries",
       subtitle = "Probability of achieving RCA > 1 in each clean technology",
       x = NULL, y = "Predicted Competitiveness") +
  theme_nzipl() +
  theme(legend.position = "bottom",
        legend.key.size = unit(0.4, "cm"))

5.3 2023 Rankings heatmap

Code
top_countries <- pc |>
  filter(year == max(year)) |>
  group_by(iso3) |>
  summarise(avg_pc = mean(pc_score, na.rm = TRUE), .groups = "drop") |>
  slice_max(avg_pc, n = 30) |>
  pull(iso3)

pc |>
  filter(year == max(year), iso3 %in% top_countries) |>
  mutate(
    country = if_else(is.na(country) | country == "", iso3, country),
    tech    = factor(tech, levels = names(TECH_COLORS))
  ) |>
  ggplot(aes(x = tech, y = reorder(country, pc_score, mean),
             fill = pc_score)) +
  geom_tile(color = "white", linewidth = 0.4) +
  geom_text(aes(label = round(pc_score, 2)),
            size = 3.2, color = "white", fontface = "bold",
            family = "Archivo") +
  scale_fill_gradientn(
    colors  = c("#f8fafc", "#d1fae5", "#6ee7b7", "#059669", nzipl_green),
    limits  = c(0, 1),
    name    = "PC Score",
    labels  = scales::percent
  ) +
  labs(title = "Predicted Competitiveness — Top 30 Countries (latest year)",
       subtitle = paste0("Year: ", max(pc$year),
                        "  ·  Score = P(RCA > 1)"),
       x = NULL, y = NULL) +
  theme_nzipl() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1, size = 11),
        legend.position = "right")


6 RCA Data — Input to the Model

The pc_rca.parquet file contains the actual RCA values used as model features.

Code
rca_latest <- rca_all |>
  filter(year == max(year)) |>
  group_by(tech, iso3, country, region) |>
  summarise(mean_rca = mean(rca, na.rm = TRUE), .groups = "drop")

top30 <- rca_latest |>
  group_by(iso3) |> summarise(m = mean(mean_rca),.groups="drop") |>
  slice_max(m, n = 25) |> pull(iso3)

rca_latest |>
  filter(iso3 %in% top30) |>
  mutate(country = if_else(is.na(country) | country == "", iso3, country),
         tech = factor(tech, levels = names(TECH_COLORS))) |>
  ggplot(aes(x = tech, y = reorder(country, mean_rca, mean),
             fill = pmin(mean_rca, 3))) +
  geom_tile(color = "white", linewidth = 0.4) +
  scale_fill_gradientn(
    colors = c("#f8fafc", "#fef3c7", "#f59e0b", "#b45309", "#7c2d12"),
    name   = "RCA\n(capped at 3)",
    limits = c(0, 3)
  ) +
  labs(title = "Mean RCA Across Product Categories — Top 25 Countries",
       subtitle = paste0("Year: ", max(rca_all$year),
                        "  ·  RCA > 1 = Competitive specialisation"),
       x = NULL, y = NULL) +
  theme_nzipl() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))


7 Why SHAP Correlates with Trade — and What It Means

7.1 The causal logic

Product-space theory

The NZIPL model is grounded in economic complexity theory (Hidalgo & Hausmann 2009). Products that are “proximate” — meaning countries tend to export them together — share underlying productive capabilities (skilled labour, specialised equipment, supply chain infrastructure). The Random Forest is learning this capability-sharing structure: countries that already have RCA in upstream/related products are more likely to develop RCA in the target technology.

SHAP measures the contribution of each input RCA variable to the model’s prediction. High-SHAP products are the most informative capability proxies for the target technology.

7.2 Observed correlation is not circularity

Code
feat |>
  filter(!is.na(hs_code)) |>
  group_by(tech, category) |>
  summarise(
    n          = n(),
    mean_shap  = round(mean(shap_mean_z), 4),
    sd_shap    = round(sd(shap_mean_z), 4),
    .groups    = "drop"
  ) |>
  rename(Technology = tech, Category = category,
         `N codes` = n, `Mean SHAP |z|` = mean_shap,
         `SD SHAP` = sd_shap) |>
  arrange(Technology, desc(`Mean SHAP |z|`)) |>
  kable(caption = "Average SHAP importance by technology × SHAP category") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 12) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white") |>
  collapse_rows(columns = 1, valign = "middle")
Average SHAP importance by technology × SHAP category
Technology Category N codes Mean SHAP &#124;z&#124; SD SHAP
Batteries Machinery 10 0.5770 0.0442
Chemicals 15 0.5682 0.0471
Metals 3 0.5561 0.0127
Electronics 2 0.5286 0.0181
Industrial Materials 1 0.5044 NA
Biofuel Machinery 18 0.6721 0.0884
Industrial Materials 14 0.6474 0.0641
Other 10 0.6446 0.0787
Chemicals 17 0.6442 0.0871
Metals 7 0.5721 0.0394
Electrolyzers Chemicals 15 0.5791 0.0520
Machinery 19 0.5777 0.0690
Metals 7 0.5702 0.0474
Electronics 1 0.5380 NA
Geothermal Electronics 3 0.6505 0.0467
Machinery 27 0.6092 0.0620
Chemicals 6 0.6035 0.0740
Metals 14 0.5842 0.0708
Industrial Materials 4 0.5627 0.0340
Heat Pumps Machinery 12 0.6756 0.0994
Electronics 3 0.6703 0.0416
Chemicals 1 0.6373 NA
Metals 8 0.6147 0.0587
Industrial Materials 2 0.5398 0.0253
Magnets Chemicals 1 0.5930 NA
Machinery 2 0.5894 0.0202
Industrial Materials 1 0.5852 NA
Metals 3 0.5832 0.0736
Electronics 1 0.5583 NA
Nuclear Machinery 5 0.5999 0.0690
Metals 11 0.5515 0.0490
Chemicals 5 0.5307 0.0305
Industrial Materials 1 0.5194 NA
Solar Machinery 4 0.5651 0.0665
Chemicals 13 0.5632 0.0426
Metals 8 0.5566 0.0570
Electronics 8 0.5350 0.0408
Industrial Materials 1 0.5085 NA
Transmission Industrial Materials 1 0.8174 NA
Electronics 3 0.7534 0.0535
Machinery 12 0.7043 0.0807
Chemicals 1 0.6687 NA
Metals 7 0.6450 0.0819
Wind Chemicals 9 0.5674 0.0467
Metals 15 0.5599 0.0571
Electronics 6 0.5554 0.0520
Machinery 19 0.5552 0.0436
Industrial Materials 1 0.5007 NA

Interpretation guide:

SHAP z
> 0.75 Critical: direct industrial prerequisite for this technology
0.60 – 0.75 High: closely related capability or upstream material
0.50 – 0.60 Moderate: broader industrial base indicator
< 0.50 Low: weak signal, general economic complexity proxy

The narrow observed range (0.50–0.82) across all products indicates that all features in the final dataset passed a threshold filter — only above-threshold features were exported to CSV. The absolute scale is less important than the relative ranking within a technology.


8 Replication Roadmap in CVCE (R)

8.1 Feasibility assessment

Code
components <- tribble(
  ~Component,              ~Original,                    ~R_equivalent,          ~Status,           ~Difficulty,
  "RCA calculation",       "pandas/numpy",               "Already in CVCE pipeline (bilateral_ds)", "✅ Done", "Easy",
  "Product proximity (φ)", "Custom scipy/pandas",        "Compute from bilateral_ds with dplyr", "🔲 Missing", "Hard",
  "Feature matrix",        "10 per-tech Python notebooks", "New 11_build_ml_features.R", "🔲 Missing", "Medium",
  "KNN imputation",        "sklearn KNNImputer",         "VIM::kNN() or missForest", "🔲 Missing", "Easy",
  "Random Forest",         "sklearn RandomForestClassifier", "ranger (faster, same accuracy)", "🔲 Missing", "Easy",
  "SHAP values",           "shap.TreeExplainer",         "fastshap or treeshap", "🔲 Missing", "Easy",
  "PC scores export",      "predict_proba[:,1]",         "ranger predictions",  "🔲 Missing", "Easy",
  "Model validation",      "GridSearchCV 5-fold",        "caret or tidymodels",  "🔲 Missing", "Medium"
)

components |>
  kable(col.names = c("Component", "Original (Python)", "R Equivalent",
                      "Status", "Difficulty"),
        caption = "CVCE replication roadmap: Python → R") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  column_spec(4, bold = TRUE) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
CVCE replication roadmap: Python → R
Component Original (Python) R Equivalent Status Difficulty
RCA calculation pandas/numpy Already in CVCE pipeline (bilateral_ds) ✅ Done | asy |
Product proximity (φ) Custom scipy/pandas Compute from bilateral_ds with dplyr 🔲 Missing | ard |
Feature matrix 10 per-tech Python notebooks New 11_build_ml_features.R 🔲 Missing | edium |
KNN imputation sklearn KNNImputer VIM::kNN() or missForest 🔲 Missing | asy |
Random Forest sklearn RandomForestClassifier ranger (faster, same accuracy) 🔲 Missing | asy |
SHAP values shap.TreeExplainer fastshap or treeshap 🔲 Missing | asy |
PC scores export predict_proba[:,1] ranger predictions 🔲 Missing | asy |
Model validation GridSearchCV 5-fold caret or tidymodels 🔲 Missing | edium |

8.2 Proposed R implementation

The replication would add one new build script to the CVCE pipeline:

# scripts/build_data/11_build_ml_pc.R  (proposed)
#
# Replicates the Python NZIPL ML pipeline in R:
#   1. Compute RCA matrix from bilateral_ds
#   2. Build product proximity (phi) using co-export correlation
#   3. Construct per-tech feature matrices (lagged RCA + macro)
#   4. Train ranger Random Forest (equivalent to sklearn RFC)
#   5. Compute SHAP values via fastshap
#   6. Export pc_scores.parquet + pc_features.csv (replacing datawheel inputs)

library(ranger)     # RF: similar to sklearn RFC
library(fastshap)   # SHAP values for any ML model
library(VIM)        # KNN imputation

Key advantage of replication: - Re-run with each BACI update (currently 2024 data available, model trained on 2023) - Add new features: ORBIS firm counts, patent data, supply chain distance - Extend to new technologies or product-type subsets - Produce SHAP values at the country-year level (not just aggregated)

8.3 The one hard piece: Product Proximity

The phi matrix (product-product proximity, used to identify which non-technology products to include as features) is the only component that requires significant new code. The formula is:

φ(p,p') = min[ P(RCA_p | RCA_p'), P(RCA_p' | RCA_p) ]
        = min[ (countries with RCA>1 in both) / (countries with RCA>1 in p),
               (countries with RCA>1 in both) / (countries with RCA>1 in p') ]

This can be computed from the existing bilateral data in R, but requires iterating over ~5,000 HS6 pairs, which is computationally intensive (~2–4 hours for full matrix, or ~10 min for the green tech subset).


9 Known Limitations

Code
limits <- tribble(
  ~Issue,                   ~Description,                                    ~Severity,
  "Survivorship bias",      "Only countries with continuous trade records included; sparse traders dropped", "Medium",
  "Forward SHAP bias",      "SHAP computed on full dataset, not temporally held-out test set", "High",
  "Lag depth",              "1-year lag may be insufficient for policy impacts (typically 2–3 years)", "Low",
  "Hyperparameter tuning",  "Fixed params (n=100, depth=10) across all techs; Solar/Wind only ones with GridSearchCV", "Medium",
  "Class imbalance",        "Not handled; ~10-40% of country-years are 'competitive' depending on tech", "Medium",
  "RCA volatility",         "Single extreme trade events can spike RCA for one year", "Low",
  "Target variable scope",  "RCA > 1 on any category; not distinguishing manufacturing vs raw material competitiveness", "Medium",
  "SHAP range compression", "Only above-threshold features exported; absolute z-scores not comparable across techs", "Low"
)

limits |>
  kable(col.names = c("Issue", "Description", "Severity"),
        caption = "Known model limitations") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 13) |>
  column_spec(3,
    color = ifelse(limits$Severity == "High", "#dc2626",
                   ifelse(limits$Severity == "Medium", "#d97706", "#16a34a")),
    bold = TRUE) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
Known model limitations
Issue Description Severity
Survivorship bias Only countries with continuous trade records included; sparse traders dropped Medium
Forward SHAP bias SHAP computed on full dataset, not temporally held-out test set High
Lag depth 1-year lag may be insufficient for policy impacts (typically 2–3 years) Low
Hyperparameter tuning Fixed params (n=100, depth=10) across all techs; Solar/Wind only ones with GridSearchCV Medium
Class imbalance Not handled; ~10-40% of country-years are 'competitive' depending on tech Medium
RCA volatility Single extreme trade events can spike RCA for one year Low
Target variable scope RCA > 1 on any category; not distinguishing manufacturing vs raw material competitiveness Medium
SHAP range compression Only above-threshold features exported; absolute z-scores not comparable across techs Low

10 Data Files Summary

Code
tribble(
  ~File,                      ~Rows,    ~Columns,       ~Source,
  "data/pc/pc_features.csv",  paste0(nrow(feat), " (", sum(!is.na(feat$hs_code)), " with HS code)"),
                               "tech, hs_code, description, category, shap_mean_z",
                               "Consolidated from 10 per-tech SHAP CSVs",
  "data/pc/pc_scores.parquet", format(nrow(pc), big.mark=","),
                               "year, iso3, tech, pc_score, country, region",
                               "predicted_competitiveness sheet of nzipl_data_20251006.xlsx",
  "data/pc/pc_rca.parquet",   format(nrow(rca_all), big.mark=","),
                               "iso3, category, tech, year, rca, country, region",
                               "rca_{tech} sheets of nzipl_data_20251006.xlsx",
  "data/pc/pc_countries.csv", format(nrow(ctry), big.mark=","),
                               "iso3, country, region",
                               "country sheet of nzipl_data_20251006.xlsx"
) |>
  kable(col.names = c("File", "Rows", "Columns", "Source"),
        caption = "CVCE data/pc/ file inventory") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE, font_size = 12) |>
  column_spec(1, monospace = TRUE, bold = TRUE) |>
  row_spec(0, bold = TRUE, background = nzipl_green, color = "white")
CVCE data/pc/ file inventory
File Rows Columns Source
data/pc/pc_features.csv 402 (357 with HS code) tech, hs_code, description, category, shap_mean_z Consolidated from 10 per-tech SHAP CSVs
data/pc/pc_scores.parquet 34,100 year, iso3, tech, pc_score, country, region predicted_competitiveness sheet of nzipl_data_20251006.xlsx
data/pc/pc_rca.parquet 171,120 iso3, category, tech, year, rca, country, region rca_{tech} sheets of nzipl_data_20251006.xlsx
data/pc/pc_countries.csv 155 iso3, country, region country sheet of nzipl_data_20251006.xlsx

Generated by NZIPL Clean Value Chain Explorer · April 08, 2026