# ---- paths ----
orbis_parquet_path <- "../../../data/orbis.parquet" # adjust
# ---- connect (DuckDB reads Parquet lazily) ----
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")Identifying Battery Manufacturing Firms in ORBIS from HS Codes
1. Motivation and scope
This document identifies firms likely involved in battery manufacturing using ORBIS firm-level data. The starting point is a pair of HS product codes for batteries, which are mapped to economic activities (NACE) using Eurostat correspondences.
Caveat (non-negotiable): ORBIS does not contain HS-coded product output. Firms are identified via activity classifications (NACE), motivated by HS–CN–CPA correspondences. The mapping is inherently many-to-many and should be interpreted as a screening, not proof of HS-level production.
2. HS → CN → CPA → NACE mapping
2.1 HS codes of interest
We start from the following HS6 codes:
- 850760 — Lithium-ion accumulators
- 850780 — Other accumulators
2.2 CN–CPA correspondence (Eurostat)
Using Eurostat CN–CPA correspondence tables (CN 2022 ↔︎ CPA 2.1), we obtained:
| CN (2022) | CPA (2.1) |
|---|---|
| 85076000 | 27.20.23 |
| 85078000 | 27.20.23 |
Thus, both HS codes map to CPA 27.20.23.
2.3 CPA → NACE anchor
CPA 27.20.23 belongs to the product family associated with:
- NACE Rev.2 = 27.20
- Manufacture of batteries and accumulators
This NACE code is used as the activity anchor in ORBIS.
3. ORBIS data and NACE encoding
3.1 Data source
The ORBIS data used here is a processed Parquet file containing firm identifiers, classifications, balance-sheet variables, and derived statistics.
It is not raw ORBIS: several variables (growth rates, moments, quantiles) appear to have been computed by a previous researcher.
3.2 Setup
Edit these paths to match your environment.
3.3 Schema inspection
schema_df <- DBI::dbGetQuery(
con,
sprintf("DESCRIBE SELECT * FROM read_parquet('%s')", orbis_parquet_path)
)
schema_df column_name column_type null key default extra
1 column00 VARCHAR YES <NA> <NA> <NA>
2 bvdid VARCHAR YES <NA> <NA> <NA>
3 CATEGORY_OF_COMPANY VARCHAR YES <NA> <NA> <NA>
4 CONSCODE VARCHAR YES <NA> <NA> <NA>
5 CLOSDATE_year VARCHAR YES <NA> <NA> <NA>
6 FIAS VARCHAR YES <NA> <NA> <NA>
7 TFAS VARCHAR YES <NA> <NA> <NA>
8 TOAS VARCHAR YES <NA> <NA> <NA>
9 DEPR VARCHAR YES <NA> <NA> <NA>
10 _40025 VARCHAR YES <NA> <NA> <NA>
11 NAME_INTERNAT VARCHAR YES <NA> <NA> <NA>
12 POSTCODE VARCHAR YES <NA> <NA> <NA>
13 CITY_INTERNAT VARCHAR YES <NA> <NA> <NA>
14 REGION_IN_COUNTRY VARCHAR YES <NA> <NA> <NA>
15 COUNTRY VARCHAR YES <NA> <NA> <NA>
16 EBTA VARCHAR YES <NA> <NA> <NA>
17 major_sector VARCHAR YES <NA> <NA> <NA>
18 nace2_main_section VARCHAR YES <NA> <NA> <NA>
19 naceccod2 VARCHAR YES <NA> <NA> <NA>
20 nacecdes2 VARCHAR YES <NA> <NA> <NA>
21 Year_diff VARCHAR YES <NA> <NA> <NA>
22 Deflator VARCHAR YES <NA> <NA> <NA>
23 FIAS_growth_log VARCHAR YES <NA> <NA> <NA>
24 TFAS_growth_log VARCHAR YES <NA> <NA> <NA>
25 geo_ind VARCHAR YES <NA> <NA> <NA>
26 Data_source VARCHAR YES <NA> <NA> <NA>
27 IndexType VARCHAR YES <NA> <NA> <NA>
28 Index VARCHAR YES <NA> <NA> <NA>
29 Time VARCHAR YES <NA> <NA> <NA>
30 mean VARCHAR YES <NA> <NA> <NA>
31 kurtosis VARCHAR YES <NA> <NA> <NA>
32 variance VARCHAR YES <NA> <NA> <NA>
33 quantile_01 VARCHAR YES <NA> <NA> <NA>
34 quantile_05 VARCHAR YES <NA> <NA> <NA>
35 quantile_10 VARCHAR YES <NA> <NA> <NA>
36 quantile_25 VARCHAR YES <NA> <NA> <NA>
37 quantile_50 VARCHAR YES <NA> <NA> <NA>
38 quantile_75 VARCHAR YES <NA> <NA> <NA>
39 quantile_90 VARCHAR YES <NA> <NA> <NA>
40 quantile_95 VARCHAR YES <NA> <NA> <NA>
41 quantile_99 VARCHAR YES <NA> <NA> <NA>
42 quantile_100 VARCHAR YES <NA> <NA> <NA>
The relevant activity fields are:
naceccod2— NACE Rev.2 code (string)nacecdes2— NACE descriptionnace2_main_section— high-level section
3.4 NACE encoding checks
We verify that 4-character NACE codes are digits-only.
nchar_dist <- DBI::dbGetQuery(
con,
sprintf(
"
SELECT
length(naceccod2) AS len,
COUNT(*) AS n
FROM read_parquet('%s')
GROUP BY 1
ORDER BY 1
",
orbis_parquet_path
)
)
nchar_dist len n
1 1 2
2 2 107
3 3 757431
4 4 22680915
digit_check <- DBI::dbGetQuery(
con,
sprintf(
"
SELECT
SUM(CASE WHEN length(naceccod2)=4 AND regexp_full_match(naceccod2, '^[0-9]{4}$')
THEN 1 ELSE 0 END) AS ok_4digit,
SUM(CASE WHEN length(naceccod2)=4 THEN 1 ELSE 0 END) AS total_4char
FROM read_parquet('%s')
",
orbis_parquet_path
)
)
digit_check ok_4digit total_4char
1 22680915 22680915
If ok_4digit == total_4char, then NACE4 is encoded as digits-only strings like "2720".
4. Extraction of battery manufacturing firms
4.1 Row-level extraction (NACE 2720)
We filter ORBIS rows with NACE Rev.2 code:
- 2720 — Manufacture of batteries and accumulators
f <- DBI::dbGetQuery(
con,
sprintf(
"
SELECT *
FROM read_parquet('%s')
WHERE naceccod2 = '2720'
",
orbis_parquet_path
)
)
dim(f)[1] 9889 42
str(f, max.level = 1)'data.frame': 9889 obs. of 42 variables:
$ column00 : chr "2478" "3250" "5911" "7897" ...
$ bvdid : chr "DZ09B0764222" "DZ98B0862237" "DZ05B0107843" "DZ09B0764222" ...
$ CATEGORY_OF_COMPANY: chr "MEDIUM SIZED COMPANY" "MEDIUM SIZED COMPANY" "MEDIUM SIZED COMPANY" "MEDIUM SIZED COMPANY" ...
$ CONSCODE : chr "U1" "U1" "U1" "U1" ...
$ CLOSDATE_year : chr "2011" "2011" "2012" "2012" ...
$ FIAS : chr "1258925.89566195" "27440.0665062956" "88965.06" "77710.06" ...
$ TFAS : chr "NA" "NA" "NA" "NA" ...
$ TOAS : chr "5245143.70604891" "81268.6767126696" "100412.85" "81807.12" ...
$ DEPR : chr "NA" "5455.81018696975" "-11594.57" "NA" ...
$ _40025 : chr "Corporate" "Corporate" "Corporate" "Corporate" ...
$ NAME_INTERNAT : chr "Sarl Bergane Energy" "Sarl Sofacc" "Sarl STE Algerienne DES Accumulateurs Unite Oran" "Sarl Bergane Energy" ...
$ POSTCODE : chr NA NA NA NA ...
$ CITY_INTERNAT : chr "Berriane" "Berriane" "Hassi Bounif" "Berriane" ...
$ REGION_IN_COUNTRY : chr "Ghardaia" "Ghardaia" "Oran" "Ghardaia" ...
$ COUNTRY : chr "Algeria" "Algeria" "Algeria" "Algeria" ...
$ EBTA : chr "NA" "23358.788139299" "-12840.47" "NA" ...
$ major_sector : chr "Industrial, Electric & Electronic Machinery" "Industrial, Electric & Electronic Machinery" "Industrial, Electric & Electronic Machinery" "Industrial, Electric & Electronic Machinery" ...
$ nace2_main_section : chr "C - Manufacturing" "C - Manufacturing" "C - Manufacturing" "C - Manufacturing" ...
$ naceccod2 : chr "2720" "2720" "2720" "2720" ...
$ nacecdes2 : chr "Manufacture of batteries and accumulators" "Manufacture of batteries and accumulators" "Manufacture of batteries and accumulators" "Manufacture of batteries and accumulators" ...
$ Year_diff : chr "1" "1" "1" "1" ...
$ Deflator : chr "93.0593225571861" "93.0593225571861" "100" "100" ...
$ FIAS_growth_log : chr "-2.78502945130906" "-0.106790099147275" "0.0544241875515256" "0.11352463757536" ...
$ TFAS_growth_log : chr "NA" "NA" "NA" "NA" ...
$ geo_ind : chr "Algeria Ghardaia Berriane 2011" "Algeria Ghardaia Berriane 2011" "Algeria Oran Hassi Bounif 2012" "Algeria Ghardaia Berriane 2012" ...
$ Data_source : chr "Google Maps API" "Google Maps API" "Google Maps API" "Google Maps API" ...
$ IndexType : chr "Index1" "Index1" "Index1" "Index1" ...
$ Index : chr "44104" "44104" "45180" "44104" ...
$ Time : chr "2011" "2011" "2012" "2012" ...
$ mean : chr "27.3400991596588" "27.3400991596588" "24.3141939574904" "28.1336912066559" ...
$ kurtosis : chr "1.64769923617101" "1.64769923617101" "1.64075687641879" "1.69208409768496" ...
$ variance : chr "76.1899159220837" "76.1899159220837" "61.3591105593609" "96.2355460172974" ...
$ quantile_01 : chr "14.7206357192994" "14.7206357192994" "11.7148934364319" "11.6192949771881" ...
$ quantile_05 : chr "16.070845413208" "16.070845413208" "13.4513666629792" "14.4646921157837" ...
$ quantile_10 : chr "16.5799991607666" "16.5799991607666" "14.5896010398865" "15.6790790557862" ...
$ quantile_25 : chr "18.7607173919678" "18.7607173919678" "17.4301400184632" "18.8800129890442" ...
$ quantile_50 : chr "26.6600170135498" "26.6600170135498" "23.2173137664795" "27.3502235412598" ...
$ quantile_75 : chr "35.7472076416016" "35.7472076416016" "31.4587111473084" "37.0483531951904" ...
$ quantile_90 : chr "39.6790298461914" "39.6790298461914" "35.2621440887451" "41.677209854126" ...
$ quantile_95 : chr "41.0148834228516" "41.0148834228516" "35.9908561706543" "43.3350086212158" ...
$ quantile_99 : chr "42.3754103088379" "42.3754103088379" "37.8746026992798" "44.5432813644409" ...
$ quantile_100 : chr "43.2498626708984" "43.2498626708984" "38.5001335144043" "45.2242965698242" ...
Note: Multiple rows per firm may exist (panel / index structure), so rows ≠ firms.
4.2 Firm-level counts
n_firms <- length(unique(f$bvdid))
n_rows <- nrow(f)
data.frame(n_rows = n_rows, n_distinct_firms = n_firms) n_rows n_distinct_firms
1 9889 2651
4.3 Quick label sanity check
label_check <- DBI::dbGetQuery(
con,
sprintf(
"
SELECT naceccod2, nacecdes2, COUNT(*) AS n
FROM read_parquet('%s')
WHERE naceccod2 = '2720'
GROUP BY 1,2
ORDER BY n DESC
",
orbis_parquet_path
)
)
label_check naceccod2 nacecdes2 n
1 2720 Manufacture of batteries and accumulators 9889
5. Coverage diagnostics
5.1 Temporal coverage
table(f$CLOSDATE_year)
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
29 33 29 33 41 40 602 687 1020 919 601 879 983 528 628 732
2017 2018 2019
811 711 583
5.2 Country coverage
sort(table(f$COUNTRY), decreasing = TRUE)
China Germany India
6766 433 343
Italy Russian Federation Taiwan, China
299 231 228
United Kingdom France Ukraine
177 128 128
Poland Republic of Korea Japan
125 114 86
Malaysia Spain Netherlands
85 78 57
Belgium Hungary Austria
55 46 37
Sweden Czech Republic Philippines
34 33 32
Bulgaria Norway Turkey
30 24 24
Algeria Romania Serbia
23 22 22
Luxembourg Greece Portugal
21 20 19
Mexico Australia Colombia
17 15 15
Finland Singapore Slovakia
15 15 15
Morocco Brazil Slovenia
12 11 10
Croatia Denmark Estonia
7 7 7
North Macedonia Pakistan Uruguay
7 7 4
Hong Kong SAR, China Cyprus South Africa
2 1 1
Thailand
1
5.3 Location completeness
table(is.na(f$POSTCODE))
FALSE TRUE
9747 142
table(is.na(f$CITY_INTERNAT))
FALSE TRUE
9332 557
table(is.na(f$REGION_IN_COUNTRY))
FALSE TRUE
9217 672
table(is.na(f$geo_ind))
FALSE
9889
Observation: geo_ind appears fully populated; postcode/city/region are mostly available but not universal. The field geo_ind encodes a composite string (country–region–city–year) and can be parsed later for spatial aggregation.
6. Interpretation of key variables
6.1 Core economic variables (firm-level)
These are standard ORBIS-style quantities (when present and well-defined):
- FIAS — Fixed assets (capital stock proxy)
- TFAS — Tangible fixed assets
- TOAS — Total assets (firm size)
- DEPR — Depreciation (capital consumption)
- EBTA — Earnings before taxes (profitability proxy)
These are the economically meaningful firm variables to start with.
7. Methodological caveat (explicit, for reuse)
# ORBIS provides firm activity codes (NACE), not HS product output.
# Firms are screened using NACE Rev.2 = 2720 (Manufacture of batteries and accumulators),
# motivated by Eurostat HS/CN 850760–850780 → CPA 27.20.23 → NACE 27.20.
# The mapping is many-to-many: NACE activity ≠ HS production.
# Results identify likely battery manufacturers by activity classification, not definitive HS producers.8. Next steps (not executed here)
- Construct a firm-level table (one row per
bvdid) for reporting and deduplication. - Explore balance-sheet magnitudes and growth variables (after confirming units/deflation).
- Parse
geo_indfor spatial aggregation. - Only then: mapping (country map first; firm-level geocoding only if justified).