Model Data Inventory — Shared Material (PC / RCA model + optimization tool)
Model Data Inventory — Shared Material (PC / RCA model + optimization tool)
Compiled: 2026-06-25 · Author of source material: Alon & Ishana (NZIPL modeling team) Purpose: inventory of two shared drops on OneDrive (000 Data Infrastructure/), what each contains, and — critically — whether the material is reproducible and whether EVs are covered.
1. CICE Updated/ — the model inputs & per-tech RCA analyses
Folder holding the Predicted-Competitiveness (PC) / RCA model material.
| Item | Size | What it is |
|---|---|---|
ML_vars.zip | 968 MB | Model variables + per-technology RCA analysis notebooks (see below) |
baci.zip | 3.07 GB | Raw CEPII BACI bilateral trade (the model’s trade input) |
Product Proximity Update.ipynb | 14 KB | Product-space proximity / relatedness update (standalone) |
Inside ML_vars.zip
ML_vars/
├── Bentley_data_cleaning.R # WDI download + cleaning (context variables)
├── wdi.xlsx · wdi_impute.csv # World Bank Development Indicators (imputed) — model covariates
├── IRENA_INSPIRE_Patents_April_2024(INSPIRE_data).csv # patent counts (feature)
├── Analysis/RCA <Tech> Analysis.ipynb # 14 per-tech notebooks (the actual model runs)
└── RCA Construction/<Tech>/ # ⚠️ ALL EMPTY directory stubs (no files shipped)
Context / covariate data present (reproducibility inputs):
wdi_impute.csv+Bentley_data_cleaning.R— the WDI macro covariates (GDP, population, FDI, electricity use, governance, trade openness, manufacturing share, …) and how they’re built.IRENA_INSPIRE_Patents…csv— patent feature.baci.zip— raw trade (the RCA base).
Per-tech model notebooks (Analysis/) — 14 technologies, each a fully-executed notebook (RandomForest classifier on an RCA>1 target + shap.TreeExplainer for feature importance): Batteries · Biofuel · Electrolyzer · Geothermal · Heatpump · Magnets · Nuclear · Solar · Transmission · Wind · EV · plus DRI, DAC, Mass Timber (not in CVCE).
Reproducibility status — ⚠️ PARTIAL
The notebooks read their main inputs from hardcoded local paths on the author’s machine that are not bundled, e.g. (from RCA EV Analysis.ipynb):
ev = pd.read_csv("…/RCA Construction/EV/ev_trade.csv") # ❌ missing (folder empty)
rca_hs17= pd.read_csv("…/RCA Construction/EV/hs17_rca_cyh.csv") # ❌ missing
hs_master=pd.read_csv("…/concordance/HS_Descriptions.csv") # ❌ no concordance/ in zip
wdi = pd.read_csv("…/ML_vars/wdi_impute.csv") # ✅ present
So: present = model code (notebooks), WDI covariates, patents, raw BACI. Missing = the intermediate “RCA Construction” per-tech panels (ev_trade.csv, hs17_rca_cyh.csv, …) and the HS concordance file. Those RCA Construction/<Tech>/ folders ship as empty stubs. → You cannot push-button rerun the model as-is; the intermediate RCA panels between raw BACI and the RF must be regenerated or requested.
EVs specifically — answering “is it empty?”
- Not empty as code: there is a complete
RCA EV Analysis.ipynb(22 cells, fully executed, dated 2026-05-15) — RandomForest + SHAP, identical pattern to the other techs. - Empty as data: the
RCA Construction/EV/output folder is an empty directory in the zip. The notebook writes its results there on the author’s machine (out_dir = …/RCA Construction/EV/):ev_rca_by_category_year.csv— RCA by category × year (→ atlas radar axis)X_rank_save[['country_code','year','predicted_comp']]— PC scores by country-year (→ pc_scores)df_2024_ev[['country_code','rank','rank_label','predicted_comp']]— 2024 ranks- a per-HS6 SHAP table (
shap_df,shap_by_cat) (→ pc_features SHAP weights)
None of these output CSVs are in the zip. They exist only inside the notebook’s run.
Implication for Task 3 (wire EVs into the atlas radar + PC scatter): the three CSVs the CVCE pipeline needs (PC scores, RCA-by-category, SHAP-by-HS6) are computed but not shipped. Cleanest path is to ask Ishana/Alon for the contents of RCA Construction/EV/ (3 small CSVs) — that is exactly the pc_scores.parquet / pc_rca.parquet / pc_features.csv material for EVs. Failing that, the values are partially recoverable by re-executing the EV notebook, but only after the missing ev_trade.csv / hs17_rca_cyh.csv / HS_Descriptions.csv inputs are obtained.
2. Alon's optimization tool.zip — CICE-V1 Interactive POC
A self-contained, email-able proof-of-concept dashboard: a country-competitiveness sensitivity tool. “If country X invested an extra ΔExport (as % of GDP) in technology T, how would its probability of being competitive — P(RCA>1) — move, and which HS categories drive it?”
web/
├── interactive_poc.html # the dashboard (template + manifest + HS6 meta inlined)
├── interactive_template.html # hand-written template (__INLINE_DATA__ substituted at build)
├── pipeline_flowchart.html + FLOWCHART_GUIDE.md # methodology flowchart + its style guide
├── README.md # methodology + build instructions
└── data/
├── manifest.json # {tech:{iso3:[years]}} availability + country names + ALPHAS grid
├── _hs_meta.js # HS6 code → description cache
└── <Tech>.json.gz # one gzipped bundle per tech (country × year payloads)
Scope: 10 technologies (Batteries, Biofuel, Electrolyzer, Geothermal, Heatpump, Magnets, Nuclear, Solar, Transmission, Wind) × ~153 countries × 21 years (2003–2023). No EVs / DRI / DAC / Timber.
What the dashboard does (methodology):
- Left panel — P(RCA>1) vs. investment expressed as ΔExport (% of GDP, with a $ axis). The blue curve is the full log-linear-form (LLF) model prediction; drag the marker to read
P,ΔP,ΔExport (%GDP),ΔExport ($). An orange dotted “validity range” marks the largest investment α at which the local linearization stays within 50% of the full prediction. A static label reportsP_baseline, marginaldP/dExport ($), and the validity threshold. - Right panel — RCA Impact (%) by HS2 category, recomputed live as you drag. Bar height =
Σ_{i∈C} gᵢ² / Σ_all gⱼ² × 100(bars sum to 100%); stacked segments are individual HS6 products by their L2² share. Categories: Chemicals, Industrial Materials, Electronics, Machinery, Metals, Textiles, Other.
Per-country-year payload (in each <Tech>.json.gz): pb (baseline P), gdp, exp (export level), val, and pf (the per-HS6 gradient / prediction vector) — the inputs that drive the draggable sensitivity curve.
Reproducibility: the bundles are build artifacts (gitignored upstream), regenerated by scripts/build_interactive_data.py --techs ALL --countries ALL --years ALL. That build script is not in this zip (only its outputs + the assembled HTML are). Runs over HTTP only (uses fetch); Plotly.js + pako load from CDN.
Bottom line
| Question | Answer |
|---|---|
| Is there an EV model? | Yes — RCA EV Analysis.ipynb, fully run (RF + SHAP), May 15. |
| Is the EV data shipped? | No — RCA Construction/EV/ is an empty folder; output CSVs not included. |
| Can we reproduce the model from this drop? | Not fully — covariates (WDI), patents, raw BACI present, but the intermediate per-tech RCA panels + HS concordance are missing (hardcoded author-local paths). |
| Fastest route to EVs in the atlas | Request the 3 CSVs in RCA Construction/EV/ from Ishana/Alon (PC scores, RCA-by-category, SHAP-by-HS6). |
| Does Alon’s tool cover EVs? | No — 10 techs, EVs not among them. |
