ECON 595 | Data Science for Social Scientists

Author

Oriol Vallès Codina

Published

25 February, 2025

Course Description

This course aims at introducing data science to graduate students. Data science is an interdisciplinary field focused on extracting knowledge from typically large data sets. It encompasses preparing data for analysis, formulating data science problems, analyzing data, detecting patterns, summarizing the results, and visualizing the main findings. It may involve both quantitative and qualitative data depending on the data problem. At the graduate level, data science and computational social science should balance core computational skills, statistical modeling, and domain-specific applications. This course will cover key computational techniques, including statistical modeling and visualization, and explore novel applications in social sciences and heterodox economics, such as input-output analysis as networks and data-driven dynamical systems. Some topics in statistical modeling covered may overlap with ECON 529 - Research Methods I: Econometrics.

Previously known as the “boring and menial” discipline of statistics, this challenging and rewarding field has gained popularity under its flashy re-branding as data science in the 1990s, recently hailed as the “sexiest job of the 21st century” by the Harvard Business Review. Yet, data science is a double-edged sword: on one hand, it is a tool for power that can be employed for mass surveillance, policing, or military applications. On the other hand, data science can also be wielded against power by those who seek to expose injustice, challenge corporate monopolies, and advocate for transparency, equality, and social change.

This course is based on the R language, to which it will provide students with an introduction to, covering fundamental data management and wrangling skills, in the first five weeks. While no prior programming knowledge is assumed, students with previous experience in R or other languages will certainly find some aspects easier. However, beginners should not feel disadvantaged, as the course is designed to build programming skills from the ground up.

The course is designed as a hands-on, practice-oriented workshop, seeking the active engagement of students in the learning process. My presentations and teaching notes will be shared for your reference.

Course Objectives

Develop proficiency in using R for data science applications, including data wrangling, visualization, and statistical modeling.
Understand how to implement data science research designs across a variety of settings.
Acquire the skills to design and complete independent and collaborative data science projects.
Critically engage with empirical research by assessing the strengths and limitations of statistical and computational methods.
Learn to communicate analytical results clearly and appropriately in both written and oral formats.
Gain exposure to modern computational tools, including machine learning, Quarto for reproducible research, and GitHub for version control.

Evaluation

35% Mid-Term Assignment: Students will complete a mid-term assignment at the end of the first 6 weeks. A data set of the student’s choice is to be loaded, cleaned, parsed, analyzed, and visualized.
55% Final Project: At the end of the course, students must have completed a final project based on the methods learned in the last 10 weeks. The final project may take the form of a short research paper, web application, or web scraping project. The project must demonstrate the application of data science techniques to a meaningful research question.
10% Participation: Active engagement in class discussions and project presentations.

Late assignments will be penalized by 10% per day unless prior arrangements have been made.

Grading Scale

A, A– (4.0, 3.7) Excellent work
B+, B, B– (3.3, 3.0, 2.7) Work that is more than satisfactory
C+, C (2.3, 2.0) Competent work
C–, D (1.7, 1.0) Performance that is poor, but deserving of credit
F Failure to reach the standard required in the course for credit

Software

R is an open-source statistical software developed by and for statisticians. As an open-source tool, it is widely used in academia and industry. This course employs RStudio as the primary integrated development environment (IDE) for R.

Additionally, we will use Quarto for documentation and GitHub for version control. Quarto, built on Markdown, facilitates code documentation and reproducible research. GitHub enables collaborative coding, version control, and project management.

Plagiarism

Academic integrity is taken seriously. Any acts of plagiarism or dishonesty will result in academic probation.

AI Policy

The use of generative AI models is encouraged as an essential tool for modern data scientists. AI can assist with coding, research design, output interpretation, and complex computational tasks.

Readings

Required

Wickham, H., & Grolemund, G. (2017). R for Data Science: Visualize, Model, Transform, Tidy and Import Data. O’Reilly Media. [Rdata]
Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science: Import, Tidy, Transform, Visualize and Model Data. O’Reilly Media. [Rdata2]
Dalgaard, P. (2008). Introductory Statistics with R. Springer Publishing. [IS]
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning. Springer. [ISL]

Optional [if interested in a particular topic]

Statistics

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. Chapman and Hall/CRC.
McElreath, R. (2018). Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC.
Golan, A. (2017). Foundations of info-metrics: Modeling, inference, and imperfect information. Oxford University Press.

Networks

Sargent, T. J., & Stachurski, J. (2024). Economic Networks: Theory and Computation. Cambridge University Press.
Jackson, M. O. (2008). Social and Economic Networks. Princeton University Press.
Klärner, A., Gamper, M., Keim-Klärner, S., Moor, I., von der Lippe, H., & Vonneilich, N. (2022). Social Networks and Health Inequalities: A New Perspective for Research. Springer Nature.

Machine Learning and Dynamical Systems

Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling. Packt Publishing Ltd.
Brunton, S. L., & Kutz, J. N. (2022). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge University Press.
Strogatz, S. H. (2014). Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. CRC Press.

Tentative Course Schedule

Part One: Foundations

Week 1: Introduction

The very basics of data science. Setup, software installation, R and Markdown syntax. GitHub and AI workflow.

Weeks 2-3: Programming in `R`

READING Rdata, ch. 19, 21

Data structures: Vectors, Data Frames, Matrices, Arrays, Lists
Control structures: Conditionals, loops, apply family of functions
Functions and modular programming
Introduction to functional programming using the purrr package

Week 4: Wrangling Data

READING Rdata, ch. 10-15

Importing datasets (Excel, CSV, RDS)
Data cleaning and transformation
Parsing data with the tidyverse pipeline
Overview of key packages: dplyr, tidyr, forcats, lubridate

Week 5: Exploring and Visualizing Data

READING ISR, ch. 4; Rdata, ch. 3

Exploratory Data Analysis (EDA)
Summary statistics (summary(), skimr::skim())
Handling missing values (na.omit(), mice)
Correlation analysis and pairwise scatter plots
Data transformations (log, standardization, etc.)
Data visualization in base R: histograms, bar plots, time series, scatter plots
Exporting figures
Data visualization using ggplot2: geom elements, aesthetics, annotations, faceting, coordinate systems, and themes

Weeks 6-7: Statistical Modeling

READING ISL, ch. 3

Introduction to regression models: linear and generalized linear models (GLMs)
Interaction effects, model interpretation, and visualization
Model selection: goodness-of-fit measures, information criteria
Bayesian statistics: introduction to Bayesian inference and entropy maximization

Mid-Term Assignment

Part Two: Methods

Weeks 8-9: Network Analysis

Introduction to graph theory and social network analysis
Measuring network centrality and connectivity
Application Interpreting an input-output table as a network
Application The standard commodity as a measure of network centrality
Mid-Term Assignment Due

Week 10: API & Web Scraping, Text Analysis

READING Rdata2, ch. 16, 25

Web scraping using the rvest package
Regular expressions for text processing
Application Example of literature review analysis
Citation networks and word clouds

Weeks 11-12: Machine Learning

READING ISL, ch. 2

Supervised learning: decision trees, random forests, and support vector machines
Unsupervised learning: k-means clustering, principal component analysis (PCA), and hierarchical clustering
Understanding the variance-bias trade-off, over-fitting, confusion matrices, cross-validation, and bootstrapping
Application analyzing clusters in Eurostat data

Week 13: Dynamical Systems

Introduction to dynamical systems and complex models
Directed acyclic graphs (DAGs)
Agent-based modeling
Applications Schelling model of urban segregation, Ising model of opinion polarization, and cellular automata

Weeks 14-15: Student Presentations

Students present their final projects.