ECON 595 | Data Science for Social Scientists
Course Description
This course aims at introducing data science to graduate students. Data science is an interdisciplinary field focused on extracting knowledge from typically large data sets. It encompasses preparing data for analysis, formulating data science problems, analyzing data, detecting patterns, summarizing the results, and visualizing the main findings. It may involve both quantitative and qualitative data depending on the data problem. At the graduate level, data science and computational social science should balance core computational skills, statistical modeling, and domain-specific applications. This course will cover key computational techniques, including statistical modeling and visualization, and explore novel applications in social sciences and heterodox economics, such as input-output analysis as networks and data-driven dynamical systems. Some topics in statistical modeling covered may overlap with ECON 529 - Research Methods I: Econometrics.
Previously known as the “boring and menial” discipline of statistics, this challenging and rewarding field has gained popularity under its flashy re-branding as data science in the 1990s, recently hailed as the “sexiest job of the 21st century” by the Harvard Business Review. Yet, data science is a double-edged sword: on one hand, it is a tool for power that can be employed for mass surveillance, policing, or military applications. On the other hand, data science can also be wielded against power by those who seek to expose injustice, challenge corporate monopolies, and advocate for transparency, equality, and social change.
This course is based on the R
language, to which it will provide students with an introduction to, covering fundamental data management and wrangling skills, in the first five weeks. While no prior programming knowledge is assumed, students with previous experience in R
or other languages will certainly find some aspects easier. However, beginners should not feel disadvantaged, as the course is designed to build programming skills from the ground up.
The course is designed as a hands-on, practice-oriented workshop, seeking the active engagement of students in the learning process. My presentations and teaching notes will be shared for your reference.
Course Objectives
- Develop proficiency in using
R
for data science applications, including data wrangling, visualization, and statistical modeling. - Understand how to implement data science research designs across a variety of settings.
- Acquire the skills to design and complete independent and collaborative data science projects.
- Critically engage with empirical research by assessing the strengths and limitations of statistical and computational methods.
- Learn to communicate analytical results clearly and appropriately in both written and oral formats.
- Gain exposure to modern computational tools, including machine learning, Quarto for reproducible research, and GitHub for version control.
Evaluation
- 35% Mid-Term Assignment: Students will complete a mid-term assignment at the end of the first 6 weeks. A data set of the student’s choice is to be loaded, cleaned, parsed, analyzed, and visualized.
- 55% Final Project: At the end of the course, students must have completed a final project based on the methods learned in the last 10 weeks. The final project may take the form of a short research paper, web application, or web scraping project. The project must demonstrate the application of data science techniques to a meaningful research question.
- 10% Participation: Active engagement in class discussions and project presentations.
Late assignments will be penalized by 10% per day unless prior arrangements have been made.
Grading Scale
A, A– (4.0, 3.7) Excellent work
B+, B, B– (3.3, 3.0, 2.7) Work that is more than satisfactory
C+, C (2.3, 2.0) Competent work
C–, D (1.7, 1.0) Performance that is poor, but deserving of credit
F Failure to reach the standard required in the course for credit
Software
R
is an open-source statistical software developed by and for statisticians. As an open-source tool, it is widely used in academia and industry. This course employs RStudio
as the primary integrated development environment (IDE) for R
.
Additionally, we will use Quarto
for documentation and GitHub
for version control. Quarto, built on Markdown, facilitates code documentation and reproducible research. GitHub enables collaborative coding, version control, and project management.
Plagiarism
Academic integrity is taken seriously. Any acts of plagiarism or dishonesty will result in academic probation.
AI Policy
The use of generative AI models is encouraged as an essential tool for modern data scientists. AI can assist with coding, research design, output interpretation, and complex computational tasks.
Readings
Required
- Wickham, H., & Grolemund, G. (2017). R for Data Science: Visualize, Model, Transform, Tidy and Import Data. O’Reilly Media. [Rdata]
- Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science: Import, Tidy, Transform, Visualize and Model Data. O’Reilly Media. [Rdata2]
- Dalgaard, P. (2008). Introductory Statistics with R. Springer Publishing. [IS]
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2023). An Introduction to Statistical Learning. Springer. [ISL]
Optional [if interested in a particular topic]
Statistics
- Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. Chapman and Hall/CRC.
- McElreath, R. (2018). Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC.
- Golan, A. (2017). Foundations of info-metrics: Modeling, inference, and imperfect information. Oxford University Press.
Networks
- Sargent, T. J., & Stachurski, J. (2024). Economic Networks: Theory and Computation. Cambridge University Press.
- Jackson, M. O. (2008). Social and Economic Networks. Princeton University Press.
- Klärner, A., Gamper, M., Keim-Klärner, S., Moor, I., von der Lippe, H., & Vonneilich, N. (2022). Social Networks and Health Inequalities: A New Perspective for Research. Springer Nature.
Machine Learning and Dynamical Systems
- Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling. Packt Publishing Ltd.
- Brunton, S. L., & Kutz, J. N. (2022). Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. Cambridge University Press.
- Strogatz, S. H. (2014). Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. CRC Press.
Tentative Course Schedule
Part One: Foundations
Week 1: Introduction
- The very basics of data science. Setup, software installation,
R
and Markdown syntax. GitHub and AI workflow.
Weeks 2-3: Programming in R
READING Rdata, ch. 19, 21
- Data structures: Vectors, Data Frames, Matrices, Arrays, Lists
- Control structures: Conditionals, loops, apply family of functions
- Functions and modular programming
- Introduction to functional programming using the
purrr
package
Week 4: Wrangling Data
READING Rdata, ch. 10-15
- Importing datasets (
Excel
,CSV
,RDS
) - Data cleaning and transformation
- Parsing data with the tidyverse pipeline
- Overview of key packages:
dplyr
,tidyr
,forcats
,lubridate
Week 5: Exploring and Visualizing Data
READING ISR, ch. 4; Rdata, ch. 3
- Exploratory Data Analysis (EDA)
- Summary statistics (
summary()
,skimr::skim()
) - Handling missing values (
na.omit()
, mice) - Correlation analysis and pairwise scatter plots
- Data transformations (log, standardization, etc.)
- Data visualization in base
R
: histograms, bar plots, time series, scatter plots - Exporting figures
- Data visualization using
ggplot2
:geom
elements, aesthetics, annotations, faceting, coordinate systems, and themes
Weeks 6-7: Statistical Modeling
READING ISL, ch. 3
- Introduction to regression models: linear and generalized linear models (GLMs)
- Interaction effects, model interpretation, and visualization
- Model selection: goodness-of-fit measures, information criteria
- Bayesian statistics: introduction to Bayesian inference and entropy maximization
Mid-Term Assignment
Part Two: Methods
Weeks 8-9: Network Analysis
- Introduction to graph theory and social network analysis
- Measuring network centrality and connectivity
- Application Interpreting an input-output table as a network
- Application The standard commodity as a measure of network centrality
- Mid-Term Assignment Due
Week 10: API & Web Scraping, Text Analysis
READING Rdata2, ch. 16, 25
- Web scraping using the
rvest
package - Regular expressions for text processing
- Application Example of literature review analysis
- Citation networks and word clouds
Weeks 11-12: Machine Learning
READING ISL, ch. 2
- Supervised learning: decision trees, random forests, and support vector machines
- Unsupervised learning: k-means clustering, principal component analysis (PCA), and hierarchical clustering
- Understanding the variance-bias trade-off, over-fitting, confusion matrices, cross-validation, and bootstrapping
- Application analyzing clusters in Eurostat data
Week 13: Dynamical Systems
- Introduction to dynamical systems and complex models
- Directed acyclic graphs (DAGs)
- Agent-based modeling
- Applications Schelling model of urban segregation, Ising model of opinion polarization, and cellular automata
Weeks 14-15: Student Presentations
- Students present their final projects.