Project Overview
A comprehensive time series forecasting project that predicts New York City taxi demand using historical data from the NYC Taxi and Limousine Commission. This project demonstrates advanced time series analysis techniques using R programming.
Business Impact: Accurate taxi demand forecasting helps optimize fleet management, reduce wait times, and improve overall transportation efficiency in urban environments.
Introduction
Problem Statement: New York City's taxi industry faces challenges in matching supply with fluctuating demand. Accurate forecasting enables better resource allocation and improved customer service.
Dataset: The dataset contains 10,320 observations of taxi demand recorded at 30-minute intervals, including timestamps and demand values ranging from 8 to 39,197 rides per interval.
Methodology: Implemented in R using tidyverse for data manipulation, forecast package for ETS modeling, and tseries for stationarity testing.
Data Preprocessing
Data Transformation
- Converted timestamp strings to datetime format
- Aggregated 30-minute interval data to daily level for trend analysis
- Handled time series conversion with proper frequency settings
Key Libraries Used
# Loading essential R libraries
library(tidyverse) # Data manipulation and visualization
library(lubridate) # Date-time operations
library(forecast) # Time series forecasting
library(tseries) # Stationarity testing
library(zoo) # Time series objects
Exploratory Data Analysis
The time series visualization revealed important patterns in NYC taxi demand:
- Long-term trends with demand fluctuations over time
- Seasonal patterns with repeating high and low demand periods
- Short-term spikes potentially due to external factors like weather or events
Visualization: Created comprehensive line plots using ggplot2 to identify trends, seasonality, and anomalies in the daily aggregated data.
Stationarity Testing
Augmented Dickey-Fuller (ADF) Test
- Null Hypothesis: Time series is non-stationary
- Result: p-value = 0.0334
- Conclusion: Reject null hypothesis - series is stationary
KPSS Test
- Null Hypothesis: Time series is stationary
- Result: p-value > 0.1
- Conclusion: Fail to reject null hypothesis - series is stationary
Key Finding: Both statistical tests confirmed the time series is stationary, eliminating the need for differencing or transformations before modeling.
Autocorrelation Analysis
ACF Plot Analysis
- Showed how current values correlate with past values at different lags
- Revealed significant autocorrelation at multiple lag periods
- Helped identify the memory structure of the time series
PACF Plot Analysis
- Measured direct correlations by removing intermediate lag effects
- Provided insights into the appropriate model order
- Supported the ETS model selection process
ETS Model Implementation
Model Selection: ETS(A,N,N)
- Error: Additive (A)
- Trend: None (N)
- Seasonality: None (N)
Model Parameters
# ETS Model Summary
ETS(A,N,N)
Smoothing parameters:
alpha = 0.9999
Initial states:
l = 685786.821
sigma: 79791.59
0.9999
Smoothing Parameter (alpha)
Results & Performance
Model Evaluation Metrics
- RMSE: 79,419.6 - Measures average prediction error magnitude
- MAPE: 8.93% - Percentage accuracy of forecasts
- AIC: 6012.162 - Model quality indicator (lower is better)
- ACF1: 0.0887 - Low residual autocorrelation indicates good fit
30-Day Forecast
Generated a 30-day forecast using the trained ETS model, providing valuable insights for:
- Fleet management and resource allocation
- Driver scheduling optimization
- Demand anticipation for special events
- Infrastructure planning
Interpretation: The high smoothing parameter (alpha = 0.9999) indicates the model places strong emphasis on recent observations, making it responsive to recent demand changes while maintaining overall trend capture.