How the ML Competitiveness Model Works

A Visual Explainer — No Machine Learning Background Required

Author

Net Zero Industrial Policy Lab · Johns Hopkins SAIS

Published

May 14, 2026

About this document

This explainer was written for colleagues at NZIPL and incoming staff who are curious about how the ML competitiveness model works. No maths, no code — just intuition, pictures, and real data from the Solar technology value chain. The model was built by Ishana Ratan (NZIPL Research Associate) using Python’s scikit-learn and shap libraries.

1 The Question We Are Trying to Answer

The core question is deceptively simple:

Given everything we know about a country’s trade patterns today, how likely is it to be a competitive exporter of a clean technology product in the future?

For example: Should we expect India to become competitive in Solar modules? What about Vietnam in Wind components, or Chile in Battery materials?

We want a data-driven answer — one that looks at hundreds of countries and products simultaneously, learns patterns from history, and produces a score between 0 and 1 for each country and technology.

2 Step 1 — What Does “Competitive” Mean Here?

We define competitiveness using a standard trade-economics concept: the Revealed Comparative Advantage (RCA).

The idea is simple: a country is considered competitive in a product if it exports more than its fair share of that product relative to the world.

\[ \text{RCA}_{c,p} = \frac{\text{Exports}_{c,p} / \text{Total exports}_c}{\text{World exports}_p / \text{Total world trade}} \]

RCA > 1 → the country exports proportionally more of product \(p\) than the world average → Competitive
RCA ≤ 1 → the country exports proportionally less → Not (yet) competitive

This converts a continuous trade number into a binary label: 1 = competitive, 0 = not competitive. This binary label is what the model learns to predict.

Distribution of RCA scores across countries and Solar products. The cutoff at 1 defines the binary competitive / not-competitive label.

Why binary?

A binary label (competitive / not competitive) makes the model simpler and more interpretable than trying to predict exact RCA values. It also avoids the model being dominated by extreme exporters like China. Statistical advisor Ben Bagozzi (University of Delaware) recommended this approach specifically for this kind of imbalanced trade data.

3 Step 2 — What Does the Model Learn From?

The model takes as input a set of features (also called predictors or variables) describing each country’s current economic and trade situation. These were carefully chosen to capture different dimensions of industrial capability:

Feature type	Examples	Economic intuition
Trade in related products	RCA in solar wafers, polysilicon, glass	“Can you make the ingredients?”
Trade in similar products	Proximity-weighted export similarity	“Are neighbouring capabilities present?”
Product centrality	Eigenvector centrality in export network	“Is the country well-connected in trade?”
Foreign direct investment	FDI in solar manufacturing	“Are companies investing here?”
Trade barriers	Tariffs, non-tariff measures	“Can products flow in/out?”
Domestic conditions	GDP, population, institutional quality	“Is the broader environment supportive?”

Solar — Top 20 most important features by SHAP mean |z|. Higher = more influential in the model’s predictions.

4 Step 3 — The Random Forest Algorithm

The model uses a Random Forest — an ensemble of many decision trees. Here is the intuition:

4.1 What is a Decision Tree?

Think of it as a flowchart of yes/no questions. For example:

Is RCA in polysilicon > 0.8?
  YES → Is GDP per capita > $8,000?
          YES → Predict: COMPETITIVE (1)
          NO  → Predict: NOT COMPETITIVE (0)
  NO  → Is FDI in solar > $200M?
          YES → Predict: COMPETITIVE (1)
          NO  → Predict: NOT COMPETITIVE (0)

Each question splits the data into two groups. The tree keeps asking questions until it reaches a leaf node (a final prediction). Each leaf says: “among all the historical cases that matched this combination of conditions, X% were competitive.”

4.2 What makes it a Forest?

Instead of one tree, the Random Forest grows hundreds of trees (typically 100–500), each trained on:

A random subsample of the training data (so each tree sees different examples)
A random subset of features at each split (so trees don’t all ask the same questions)

The final prediction is the vote count: if 73 out of 100 trees classify a country as competitive, the model’s predicted competitiveness score is 0.73.

Schematic of the Random Forest voting process. Each tree produces a 0/1 prediction; the forest averages them into a probability score.

4.3 Why Random Forest and not a simpler model?

The team evaluated several approaches (from the methodology notes):

Model	Pros	Cons
Linear Regression	Simple, interpretable	Assumes straight-line relationships; struggles with interactions
Lasso/Ridge	Handles many variables	Still assumes linear relationships
Random Forest	Captures non-linear patterns and interactions	Slightly harder to interpret
Gradient Boosting	Often highest accuracy	Prone to overfitting; harder to tune

Random Forest was chosen because clean-tech competitiveness involves conditional relationships — e.g., having RCA in polysilicon only matters for Solar competitiveness if a country also has basic manufacturing infrastructure. Linear models cannot capture these “and” conditions; decision trees can.

Model performance (Solar): Area under ROC curve ≈ 0.90 (excellent; a random model would score 0.50).

5 Step 4 — What are SHAP Values?

Once the model is trained, a natural question is: why did it predict a particular score? SHAP (SHapley Additive exPlanations) provides the answer.

5.1 The intuition

Imagine you want to know why a country got a high competitiveness score of 0.78. SHAP asks: how much did each feature “push” the score up or down from the baseline?

For example: - High RCA in silicon wafers: pushed score UP by +0.18 - Low FDI in solar: pushed score DOWN by −0.05 - High GDP: pushed score UP by +0.08 - Low trade barriers: pushed score UP by +0.12

The SHAP values sum to the difference between the country’s predicted score (0.78) and the average score across all countries (say, 0.45).

5.2 SHAP Feature Importance: aggregating across all countries

To get an overall measure of how important each feature is across the entire dataset, NZIPL uses the mean absolute z-score of SHAP values (shap_mean_z):

For each (country, year) observation, SHAP calculates a push value for every feature
Those push values are standardised (converted to z-scores) to make them comparable across features with different units
The mean of the absolute z-scores across all observations gives the overall importance of each feature

A feature with shap_mean_z = 2.5 is more important than one with shap_mean_z = 0.3 — it has a larger, more consistent effect on predictions across the dataset.

Solar — SHAP importance vs % of countries with RCA>1 per SHAP category. Each point is one HS product; size shows how many countries are competitive in that product.

6 Step 5 — What the Scores Mean

After training, the model is applied to all countries and years to produce Predicted Competitiveness (PC) scores — a number between 0 and 1 for each (country × technology × year) combination.

Solar — Predicted Competitiveness scores over time for a selection of countries. Rising scores suggest the model detects strengthening competitive fundamentals.

How to read the PC score

A score of 0.85 means 85 out of 100 trees in the model classified that country as competitive in Solar — strong signal
A score of 0.50 is ambiguous — the model is uncertain
A score of 0.10 means the model sees little evidence of competitive fundamentals at this time
Scores change over time as the country’s trade patterns, FDI, and policies evolve

7 Step 6 — From ML Scores to Policy Intelligence

The NZIPL Clean Value Chain Explorer combines the ML scores with the Green Dictionary and BACI bilateral trade data to answer policy-relevant questions:

Solar — Predicted Competitiveness rankings across countries and recent years. Darker green = stronger competitive fundamentals.

7.1 What can this tell a policymaker?

Identify emerging competitors early: A country with a rising PC score — even if it does not yet have high RCA — may be building the product-space and capability foundations for future competitiveness
Target industrial policy at capability gaps: The feature importance chart (Step 4) tells you which products are most predictive of competitiveness. Countries missing those products face structural gaps
Distinguish structural vs cyclical changes: A country whose PC score drops sharply in one year but recovers is likely experiencing a trade shock; a country whose score steadily declines has structural issues
Compare value chain positions: Countries that are competitive in Upstream (Raw Material, Processed Material) vs Downstream (Final Product) are in structurally different positions with different policy implications

8 Appendix: Technical Summary

ML Pipeline — Technical Summary
Component	Detail
Model type	Random Forest (binary classification)
Target variable	Binary RCA > 1 in the final product (e.g. solar modules)
Training data	Country-level trade, FDI, policy, institutional indicators · 2003–2024
Feature categories	Trade in related/similar products, product-space centrality, FDI, trade barriers, domestic conditions
Key hyperparameters	100–500 trees · random feature subsets at each split · bootstrap sampling
Predicted output	PC Score = proportion of trees classifying country as competitive (0–1)
Feature importance	SHAP mean \|z\| = standardised average push of each feature across all predictions
Model performance	Solar AUC (ROC) ≈ 0.90 · F1 score reported per technology
Technologies covered	Batteries, Biofuel, Electrolyzers, Geothermal, Heat Pumps, Magnets, Nuclear, Solar, Transmission, Wind
Built by	Ishana Ratan (NZIPL Research Associate) · Python scikit-learn + shap · Advised by Ben Bagozzi (U of Delaware)

8.1 Glossary

RCA (Revealed Comparative Advantage): A country’s share of exports in a product relative to its share of total world trade. RCA > 1 = competitive.

Feature: Any variable the model uses as input (e.g., “RCA in silicon wafers”, “FDI in solar manufacturing”).

Decision Tree: A flowchart of yes/no questions about the features; each branch leads to a prediction.

Random Forest: An ensemble of many decision trees, each trained on a random subsample of data; final prediction = vote average.

SHAP Value: The numerical contribution of one feature to one prediction — how much it pushed the score up or down from the baseline.

SHAP mean |z|: The average absolute SHAP value (standardised) across all observations — overall feature importance.

PC Score: Predicted Competitiveness score (0–1) = proportion of trees voting “competitive”.

AUC (Area Under ROC Curve): Model performance measure; 1.0 = perfect, 0.5 = random guessing, ≥ 0.8 = good.

F1 Score: Balances precision (when the model says “competitive”, is it right?) and recall (does it find all competitive cases?). Appropriate for imbalanced data (most countries are not competitive in most products).