View Code

Online Retail Business Intelligence

A comprehensive data warehouse and BI solution for UK-based online retail analysis using Python ETL and Power BI

Timeline 2 Months
Status Completed
Python Power BI PostgreSQL ETL Data Modeling Pandas Snowflake Schema OLAP

Project Overview

A comprehensive Business Intelligence solution for a UK-based online retail company, developing a data-driven solution that enables insightful analysis and informed decision-making. The project provides valuable insights into customer purchasing behavior, product performance, and sales trends across different regions and time periods.

Business Needs: Assist in deciding which country to expand into brick-and-mortar operations and identify which products to market more effectively.

Goals & Deliverables

Goals:

  • Data Warehouse Implementation: Create a robust data warehouse that facilitates efficient storage, retrieval, and analysis of transactional data
  • Enhanced Decision-Making: Provide actionable insights to stakeholders by analyzing historical sales data
  • Performance Monitoring: Track sales performance across various dimensions such as products, customers, and time

Deliverables:

  • A comprehensive data model
  • Implementation of an ETL process to populate the data warehouse
  • Interactive dashboards using Power BI for visualization

Implementation Phases

1. Data Collection & Preprocessing

  • Collected raw transactional data from two sources including invoices, product details, and customer information
  • Conducted data cleaning by removing inconsistencies, handling missing values, and standardizing formats
  • Addressed missing Customer ID values by filling them with unique identifiers

2. ETL Process Implementation

  • Extraction: Extracted data from Online_Retail and Customer Excel files
  • Transformation: Data type transformation, splitting date columns, data derivation (calculating total revenue)
  • Loading: Manual loading to PostgreSQL via CSV files

3. Data Modeling

  • Designed a Snowflake schema with central fact table (order_fact)
  • Dimension tables: customer_dim, customer_address_dim, time_dim, invoice_info_dim, product_dim, product_category_dim, product_info_dim
  • Identified key relationships for efficient querying and reporting

Snowflake Schema Diagram

Snowflake Schema Diagram

Data warehouse schema showing fact and dimension tables relationships

4. OLAP Implementation

  • Implemented using Power BI with data sourced from PostgreSQL
  • Used Import Mode following MOLAP (Multidimensional OLAP) model
  • Preloaded data into Power BI's in-memory engine for optimal performance

5. Dashboard Development

  • Built Power BI dashboards to visualize key performance indicators
  • Implemented drill-down capabilities for time analysis (year → month → day)
  • Added filtering capabilities for gender and country analysis

Technical Implementation

The ETL process was implemented using Python in Google Colab, handling data transformation and preparation:

# Key Data Processing Steps
import pandas as pd
import numpy as np

# Data extraction
customer = pd.read_excel('customers.xlsx')
online_retail = pd.read_excel('onlineretail.xlsx')

# Data cleaning and transformation
# Handling missing values, removing duplicates
# Data type conversions and standardization
# Creating derived columns (total revenue, profit calculations)

# Data modeling - Creating dimension tables
customer_dimension = joined_data[['Customer ID', 'Country', 'firstName', 'lastName', 'gender']]
product_dimension = joined_data[['StockCode', 'Description', 'Price']]
time_dimension = pd.DataFrame(index=pd.to_datetime(invoice_dates))

# Creating fact table with business metrics
order_fact['Total_Price'] = order_fact['Quantity'] * order_fact['Selling Price']
order_fact['Total_Profit'] = order_fact['Quantity'] * order_fact['Unit Profit']

Results & Impact

541,909
Transactions
4,372
Customers
100%
Data Quality
  • Successfully processed 541,909 transactions from UK-based online retailer
  • Analyzed data from 4,372 unique customers across multiple dimensions
  • Implemented complete data quality assurance with proper handling of missing values
  • Created comprehensive product categorization system with 15+ categories
  • Enabled real-time business intelligence through interactive Power BI dashboards

Challenges & Future Enhancements

Challenges Addressed:

  • Data Quality Issues: Extensive preprocessing required for missing and inconsistent data
  • Customer ID Missing Values: Solved by filling with unique identifiers
  • Data Integration: Successfully merged multiple data sources with proper mapping

Possible Enhancements:

  • Automating the ETL process for better scheduling and monitoring
  • Implementing HOLAP (Hybrid OLAP) for balance between storage efficiency and query performance
  • Enhancing dashboards with predictive analytics using machine learning models
  • Expanding data warehouse to include social media and customer feedback data for sentiment analysis