Skip to content

MrKoyee/Chocolate-Sales-Data-Analysis-Using-SQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🍫 Chocolate Sales Data Analysis Using SQL

Table of Contents

πŸ“– Project Overview

This project analyzes chocolate sales data using SQL and PostgreSQL to generate business insights related to sales performance, customer behavior, product trends, and store profitability. Queries were developed and executed in Visual Studio Code through a PostgreSQL database connection.

The analysis demonstrates a range of SQL techniques, including data exploration, joins, aggregations, filtering, subqueries, and KPI development. These methods were used to uncover patterns in the data and support data-driven business decision-making.

πŸ“Š Executive Summary

This analysis examined 1 million chocolate sales transactions across multiple products, brands, stores, and customers. The study revealed strong overall revenue performance exceeding $25M, with Praline products leading in sales volume and Ferrero generating the highest profit. Pricing across products was highly standardized, while store performance varied significantly by location, with Airport stores and Toronto-based stores showing strong results.

🎯 Project Objectives

  • Analyze overall sales performance.
  • Identify top-performing products.
  • Evaluate customer purchasing behavior.
  • Assess store and location performance.
  • Measure revenue and profitability.
  • Demonstrate core SQL analysis skills.

πŸ› οΈ Tools & Technologies

Tools

  • VS Code
  • SQLTools Extension
  • SQLTools PostgreSQL Driver
  • Bash
  • Git & GitHub

Technologies

  • PostgreSQL
  • SQL
  • CSV

πŸ“‚ Data Structure

This project is built on a relational database consisting of five interconnected tables: Calendar, Customers, Products, Sales, and Stores. The schema follows a star-like structure where the Sales table serves as the central fact table, linked to multiple dimension tables.

πŸ“… Calendar Table

Column Name Description
date Transaction or calendar date
year Year extracted from date
month Month extracted from date
day Day of month
week Week number of the year
day_of_week Day index (Mon–Sun)

πŸ‘€ Customers Table

Column Name Description
customer_id Unique customer identifier
gender Customer gender
customer_type Loyalty status (loyal/non-loyal)

πŸ“¦ Products Table

Column Name Description
product_id Unique identifier for each product
product_name Name of chocolate product
brand Manufacturer/brand name
category Product category (Milk, Dark, etc.)
cocoa_percentage Cocoa content of the chocolate
weight Product weight

🧾 Sales Table

Column Name Description
sale_id Unique transaction identifier
product_id Links to Products table
customer_id Links to Customers table
store_id Links to Stores table
date Transaction date
quantity Number of units sold
unit_price Price per unit at time of sale

πŸͺ Stores Table

Column Name Description
store_id Unique identifier for store
store_name Name of the store
store_city City where store is located
store_country Country of the store

πŸ—‚οΈ Database Setup

Create Database

-- Create database (run once)
CREATE DATABASE choco_sales;

Create Tables

Five separate tables were created to store information related to dates, customers, products, sales transactions, and store locations.

  • Calendar Table: Stores date-related attributes such as year, month, day, and week.

    CREATE TABLE calendar_table (
        date DATE,
        year INT,
        month INT,
        day INT,
        week INT,
        day_of_week INT
    );
  • Customers Table: Contains customer demographic and membership information.

    CREATE TABLE customers_table (
        customer_id VARCHAR(50),
        age INT,
        gender VARCHAR(50),
        loyalty_member INT,
        join_date DATE
    );
  • Products Table: Holds product information, including brand, category, weight, and cocoa content details.

    CREATE TABLE products_table (
        product_id VARCHAR(50),
        product_name VARCHAR(50),
        brand VARCHAR(50),
        category VARCHAR(50),
        cocoa_percent INT,
        weight_g INT
    );
  • Sales Table: Serves as the fact table containing transaction-level sales data.

    CREATE TABLE sales_table (
        order_id VARCHAR(50),
        order_date DATE,
        product_id VARCHAR(50),
        store_id VARCHAR(50),
        customer_id VARCHAR(50),
        quantity INT,
        unit_price DECIMAL(10,2),
        discount DECIMAL(5,2),
        revenue DECIMAL(12,2),
        cost DECIMAL(12,2),
        profit DECIMAL(12,2)
    );
  • Stores Table: Contains information about store locations and related attributes.

    CREATE TABLE stores_table (
        store_id VARCHAR(50),
        store_name VARCHAR(50),
        city VARCHAR(50),
        country VARCHAR(50),
        store_type VARCHAR(50)
    );

Load Data (ETL Process)

The original datasets were provided in CSV format and were successfully imported into their respective PostgreSQL tables using the \copy command in the VS Code terminal.

\copy calendar_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/calendar.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');

\copy customers_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/customers.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');

\copy products_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/products.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');

\copy sales_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/sales.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');

\copy stores_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/stores.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');

The \copy command is used to load data from local files into PostgreSQL tables and is executed from the client side (in this case, the terminal via VS Code). This makes it particularly useful for importing CSV files in local development environments, as it avoids server-side file path restrictions associated with the standard COPY command.


Data Validation

Data validation was conducted on each table to ensure the datasets were loaded correctly and were ready for analysis.

-- Row count verification
  • Calendar Table

    SELECT COUNT(*) 
    FROM calendar_table;

  • Customers Table

    SELECT COUNT(*) 
    FROM customers;

  • Product Table

    SELECT COUNT(*) 
    FROM products_table;

  • Sales Table

    SELECT COUNT(*) 
    FROM sales_table;

  • Stores Table

    SELECT COUNT(*) 
    FROM stores_table;

-- NULL value checks
  • Calendar Table

    SELECT *
    FROM calendar_table
    WHERE date IS NULL
    OR year IS NULL
    OR month IS NULL
    OR day IS NULL
    OR week IS NULL
    OR day_of_week IS NULL;

  • Customers Table

    SELECT * 
    FROM customers_table
    WHERE customer_id IS NULL;

  • Product Table

    SELECT *
    FROM products_table
    WHERE product_id IS NULL;

  • Sales Table

    SELECT *
    FROM sales_table
    WHERE customer_id IS NULL OR product_id IS NULL;

  • Stores Table

    SELECT * 
    FROM stores_table
    WHERE store_id IS NULL;

Data Exploration

  • Examined the structure of the datasets
  • Identified and validated relationships between tables
  • Assessed data consistency across all tables

Data Analysis

  • Revenue by Products

    SELECT
        product_name,
        SUM(revenue) AS total_revenue
    FROM products_table
    LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
    GROUP BY
        product_name
    ORDER BY revenue DESC;
  • Top 5 Products by Quantity Sold

      SELECT 
          product_name,
          SUM(quantity) AS quantity
      FROM products_table
      LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
      GROUP BY
          product_name
      ORDER BY quantity DESC
      LIMIT 5;

πŸ—ƒοΈ Data Model & Relationships

  • Products β†’ Sales (1-to-many)
  • Customers β†’ Sales (1-to-many)
  • Stores β†’ Sales (1-to-many)
  • Calendar β†’ Sales (1-to-many via date)

🧠 SQL Concepts Demonstrated

This project demonstrates the application of key SQL techniques for data analysis and business intelligence, including:

Data exploration

SELECT 

Data aggregation

 SUM()
 AVG()
 COUNT()
 MIN()
 MAX()

Grouping data

 GROUP BY 

Filtering aggregated results

  HAVING

Conditional filtering

  WHERE

Table joins

  INNER JOIN
  LEFT JOIN

Limiting results

  LIMIT

Conditional logic

    CASE
        WHEN    THEN
        ELSE
    END

Removing duplicates

    DISTINCT

Column aliasing

    AS

Subqueries for comparative analysis

  • Nested SELECT statements
      SELECT 
          product_name,
          AVG(unit_price) AS avg_unit_price
      FROM products_table p
      JOIN sales_table s 
          ON s.product_id = p.product_id
      GROUP BY product_name
      HAVING AVG(unit_price) > (
          SELECT AVG(unit_price)
          FROM sales_table
      )
      ORDER BY avg_unit_price DESC;   
  • Sorting results
    ORDER BY
  • Data validation
    COUNT(*)

πŸ“Š Business Questions with Results

This section presents twenty (20) business questions addressed using SQL queries. While the queries are documented here for reference, the actual results were executed and generated in PostgreSQL.

1. How many total orders are recorded in the sales table?

SELECT COUNT(*) AS total_orders
FROM sales_table;

Result: The sales table contains 1,000,000 total orders.


2. List all unique product categories from the products table.

SELECT DISTINCT category
FROM products_table;

Result: The dataset contains 5 unique product categories: Milk, Praline, Dark, White, and Truffle.


3. How many customers are registered in the dataset?

SELECT count(customer_id)
FROM customers_table;

Result: The dataset contains 50,000 registered customers.


4. List all stores and the cities they are located in.

SELECT 
    store_name,
    city
FROM stores_table;

Result: The dataset includes 100 stores located across 7 cities.


5. Which products have a cocoa percentage greater than 70%?

SELECT 
    product_id,
    product_name,
    cocoa_percent
FROM products_table
WHERE cocoa_percent > 70
ORDER BY
    cocoa_percent DESC;

Result: There are 76 products with a cocoa percentage greater than 70%.


6. How many products does each brand offer? List brands and their product count.

SELECT 
    brand,
    count(product_name) AS product_count
FROM products_table
GROUP BY brand
ORDER BY
    product_count DESC;
brand product_count
Cadbury 37
Ferrero 37
Lindt 35
Mars 33
Godiva 30
Hershey 28

Result: Cadbury and Ferrero had the widest product portfolios, with 37 products each.


7. Find all sales where a discount was applied.

SELECT 
    order_id, 
    product_name,
    discount
FROM sales_table
LEFT JOIN products_table ON products_table.product_id = sales_table.product_id
WHERE discount > 0.00
ORDER BY discount;

Result: Out of 1 million orders, 374,872 received discounts.


8. What is the total revenue generated across all sales?

SELECT
    SUM(revenue) AS revenue
FROM sales_table;

Result: Total revenue generated across all sales was $25,486,128.86.


9. What is the total profit made from all orders?

SELECT
    SUM(profit) AS total_profit
FROM sales_table;

Result: Total profit across all orders was $10,194,564.63.


10. What is the minimum and maximum profit from a single order?


SELECT
    MIN(profit) AS min_profit,
    MAX(profit) AS max_profit
FROM sales_table;

Result:

  • Minimum profit per order: $0.73
  • Maximum profit per order: $37.43

11. What is the total quantity sold per product category?

SELECT 
    category,
    SUM(quantity) AS total_quantity
FROM products_table
LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
GROUP BY category
ORDER BY
    total_quantity DESC;
category total_quantity
Praline 784,435
White 714,967
Dark 623,247
Truffle 462,202
Milk 385,619

Result: Praline products recorded the highest total quantity sold, while Milk products had the lowest.


12. Show the store name, city, and total revenue for each store.

SELECT 
    store_name,
    city AS store_city,
    SUM(sales_table.revenue) AS total_revenue
FROM stores_table
LEFT JOIN sales_table ON sales_table.store_id = stores_table.store_id
GROUP BY
    store_name,
    city
ORDER BY total_revenue DESC;

Result Preview: Top 5 stores by total revenue.

store_name store_city total_revenue
Chocolate Store 74 Sydney 261,393.77
Chocolate Store 33 Toronto 260,672.37
Chocolate Store 50 New York 259,526.62
Chocolate Store 85 Melbourne 259,512.15
Chocolate Store 98 New York 259,055.13

13. Which brand generates the highest total profit?

SELECT 
    brand,
    SUM(profit) AS highest_profit
FROM sales_table
LEFT JOIN products_table ON sales_table.product_id = products_table.product_id
WHERE profit > 0
GROUP BY
    brand
ORDER BY highest_profit DESC
LIMIT 1;

Result: Ferrero generated the highest total profit, amounting to $1,876,268.09.


14. Which country generates the most total revenue?

SELECT 
    country,
    SUM(revenue) AS total_revenue
FROM sales_table
LEFT JOIN stores_table ON sales_table.store_id = stores_table.store_id
GROUP BY
    country
ORDER BY
    total_revenue DESC
LIMIT 1;

Result: Canada generated the highest total revenue among all countries, with revenue of $5,085,319.05.


15. What is the total revenue generated by loyalty members vs non-loyalty members?

SELECT
    CASE
        WHEN customers_table.loyalty_member = 0 THEN 'non-loyal member'
        ELSE 'loyal member'
    END AS customer_type,
    SUM(revenue) AS total_revenue
FROM sales_table
LEFT JOIN customers_table
    ON customers_table.customer_id = sales_table.customer_id
GROUP BY
    customers_table.loyalty_member;
customer_type total_revenue
non-loyal member 12,713,088.39
loyal member 12,773,040.47

Result: Loyal members generated $59,952.08 more revenue than non-loyal members.


16. List the top 5 products by total quantity sold, including the product name and brand.

SELECT
    brand,
    product_name,
    SUM(quantity) AS total_quantity
FROM products_table
LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
GROUP BY
    product_name,
    brand
ORDER BY 
    total_quantity DESC
LIMIT 5;
brand product_name total_quantity
Mars White Chocolate 60% 74,066
Ferrero Praline Chocolate 50% 73,940
Ferrero Truffle Chocolate 50% 60,608
Cadbury Dark Chocolate 50% 60,240
Cadbury Dark Chocolate 60% 60,159

Result: Mars' White Chocolate 60% recorded the most quantity sold among all products.


17. Which store type generates the most revenue?

SELECT
    store_type,
    SUM(revenue) AS total_revenue
FROM stores_table
LEFT JOIN sales_table ON sales_table.store_id = stores_table.store_id
GROUP BY
    store_type
ORDER BY
    total_revenue DESC
LIMIT 1;

Result: Airport stores generate the biggest revenue, totaling $7,613,875.92.


18. What is the average order revenue for male vs female customers?

SELECT 
    gender,
    AVG(revenue) AS avg_revenue
FROM sales_table
LEFT JOIN customers_table ON customers_table.customer_id = sales_table.customer_id
GROUP BY gender;
gender avg_revenue
Female 25.48
Male 25.49

Result: Average order revenue was nearly identical for male and female customers.


19. Which city has the highest number of orders placed?

SELECT
    city,
    count(DISTINCT order_id) as order_count
FROM sales_table
LEFT JOIN stores_table ON sales_table.store_id = stores_table.store_id
GROUP BY
    city
ORDER BY
    order_count DESC
LIMIT 1;

Result: Toronto recorded the most number of orders, with 180,181 orders placed.


20. List the products that have a unit price greater than the average unit price across all products.

SELECT 
    product_name,
    AVG(unit_price) AS avg_unit_price
FROM products_table p
JOIN sales_table s 
    ON s.product_id = p.product_id
GROUP BY product_name
HAVING AVG(unit_price) > (
    SELECT AVG(unit_price)
    FROM sales_table
)
ORDER BY avg_unit_price DESC;

Analytical Approach: To identify premium-priced products, the query was adjusted to return products whose average unit price exceeds the overall average unit price across all products in the dataset.

product_name avg_unit_price
Milk Chocolate 70% 9.04
White Chocolate 80% 9.04
Milk Chocolate 90% 9.02
Dark Chocolate 80% 9.02
Milk Chocolate 50% 9.02
Truffle Chocolate 80% 9.01
Milk Chocolate 60% 9.01
Praline Chocolate 70% 9.01
Truffle Chocolate 70% 9.01
Truffle Chocolate 60% 9.01
Dark Chocolate 50% 9.01
Praline Chocolate 50% 9.01
Dark Chocolate 60% 9.00

Result: The products listed above have average unit prices that exceed the overall average unit price across the dataset, indicating a premium pricing position relative to other products.


πŸ’‘ Key Insights

  • The dataset contains 1 million sales orders generated by 50,000 customers across 100 stores in 7 cities.
  • Total revenue exceeded $25.4 million, generating more than $10.1 million in profit.
  • Approximately 37.5% of all orders received discounts.
  • Praline products recorded the most sales volume, while Milk products recorded the lowest.
  • Cadbury and Ferrero offered the largest product portfolios, with 37 products each.
  • Ferrero generated the highest total profit among all brands.
  • Canada recorded the largest total revenue among all countries.
  • Loyal members generated slightly more revenue than non-loyal members.
  • Airport stores generated the biggest revenue among all store types.
  • Toronto recorded the highest number of orders, while Sydney hosted the highest-revenue store.
  • Mars' White Chocolate 60% was the best-selling product by quantity sold.
  • Average order revenue was nearly identical across male and female customers.
  • Product prices were highly standardized, with only a small number of products priced above the overall average unit price.

πŸ“£ Recommendations

  • Optimize discount strategies by targeting high-impact products, since over one-third of orders already involve discounts, which may be reducing potential revenue.

  • Strengthen marketing efforts for lower-performing categories such as Milk products to improve their sales volume relative to higher-performing categories like Praline.

  • Leverage top-performing brands such as Ferrero and Cadbury by expanding their presence in high-revenue locations and promoting their best-selling products.

  • Investigate pricing strategy consistency, as product prices are highly standardized, to determine whether selective premium pricing could increase profitability.

  • Focus on improving performance in high-traffic cities such as Toronto while replicating successful strategies from top-performing stores like those in Sydney.

  • Enhance loyalty programs slightly, as loyal customers generate marginally higher revenue, indicating potential for stronger customer retention strategies.

  • Prioritize high-performing store types such as Airport locations by increasing inventory and promotional efforts in these channels.

  • Use product-level sales insights to identify and scale best-selling SKUs like Mars White Chocolate 60% across more stores and regions.

πŸ“‰ Limitations of the Analysis

  • The dataset does not include external factors such as marketing spend or seasonal effects, which may influence sales patterns.
  • Product pricing appears highly uniform, limiting deeper price elasticity analysis.
  • Customer behavior is limited to basic attributes, restricting deeper segmentation.
  • No time-based trend analysis was included in this version of the analysis.

πŸ“ˆ Future Improvements

  • Add window functions (RANK, ROW_NUMBER) for advanced ranking analysis.
  • Build customer segmentation.
  • Introduce cohort or time-series analysis using the calendar table.
  • Create interactive dashboards in Power BI.
  • Optimize SQL queries using indexing and query performance tuning.

πŸ‘¨β€πŸ’» Author

Segun Olakoyenikan | Data Analyst and Storyteller

SQL | Power BI | Advanced Excel | Python

About

A SQL-based data analysis project exploring chocolate sales performance across customers, products, stores, and time. The project uses a structured relational database to uncover trends through efficient SQL queries and aggregations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors