- π Project Overview
- π Executive Summary
- π― Project Objectives
- π οΈ Tools & Technologies
- π Data Structure
- ποΈ Database Setup
- ποΈ Data Model & Relationships
- π§ SQL Concepts Demonstrated
- π Business Questions with Results
- π‘ Key Insights
- π£ Recommendations
- π Limitations of the Analysis
- π Future Improvements
- π¨βπ» Author
This project analyzes chocolate sales data using SQL and PostgreSQL to generate business insights related to sales performance, customer behavior, product trends, and store profitability. Queries were developed and executed in Visual Studio Code through a PostgreSQL database connection.
The analysis demonstrates a range of SQL techniques, including data exploration, joins, aggregations, filtering, subqueries, and KPI development. These methods were used to uncover patterns in the data and support data-driven business decision-making.
This analysis examined 1 million chocolate sales transactions across multiple products, brands, stores, and customers. The study revealed strong overall revenue performance exceeding $25M, with Praline products leading in sales volume and Ferrero generating the highest profit. Pricing across products was highly standardized, while store performance varied significantly by location, with Airport stores and Toronto-based stores showing strong results.
- Analyze overall sales performance.
- Identify top-performing products.
- Evaluate customer purchasing behavior.
- Assess store and location performance.
- Measure revenue and profitability.
- Demonstrate core SQL analysis skills.
- VS Code
- SQLTools Extension
- SQLTools PostgreSQL Driver
- Bash
- Git & GitHub
- PostgreSQL
- SQL
- CSV
This project is built on a relational database consisting of five interconnected tables: Calendar, Customers, Products, Sales, and Stores. The schema follows a star-like structure where the Sales table serves as the central fact table, linked to multiple dimension tables.
| Column Name | Description |
|---|---|
| date | Transaction or calendar date |
| year | Year extracted from date |
| month | Month extracted from date |
| day | Day of month |
| week | Week number of the year |
| day_of_week | Day index (MonβSun) |
| Column Name | Description |
|---|---|
| customer_id | Unique customer identifier |
| gender | Customer gender |
| customer_type | Loyalty status (loyal/non-loyal) |
| Column Name | Description |
|---|---|
| product_id | Unique identifier for each product |
| product_name | Name of chocolate product |
| brand | Manufacturer/brand name |
| category | Product category (Milk, Dark, etc.) |
| cocoa_percentage | Cocoa content of the chocolate |
| weight | Product weight |
| Column Name | Description |
|---|---|
| sale_id | Unique transaction identifier |
| product_id | Links to Products table |
| customer_id | Links to Customers table |
| store_id | Links to Stores table |
| date | Transaction date |
| quantity | Number of units sold |
| unit_price | Price per unit at time of sale |
| Column Name | Description |
|---|---|
| store_id | Unique identifier for store |
| store_name | Name of the store |
| store_city | City where store is located |
| store_country | Country of the store |
-- Create database (run once)
CREATE DATABASE choco_sales;Five separate tables were created to store information related to dates, customers, products, sales transactions, and store locations.
-
Calendar Table: Stores date-related attributes such as year, month, day, and week.
CREATE TABLE calendar_table ( date DATE, year INT, month INT, day INT, week INT, day_of_week INT );
-
Customers Table: Contains customer demographic and membership information.
CREATE TABLE customers_table ( customer_id VARCHAR(50), age INT, gender VARCHAR(50), loyalty_member INT, join_date DATE );
-
Products Table: Holds product information, including brand, category, weight, and cocoa content details.
CREATE TABLE products_table ( product_id VARCHAR(50), product_name VARCHAR(50), brand VARCHAR(50), category VARCHAR(50), cocoa_percent INT, weight_g INT );
-
Sales Table: Serves as the fact table containing transaction-level sales data.
CREATE TABLE sales_table ( order_id VARCHAR(50), order_date DATE, product_id VARCHAR(50), store_id VARCHAR(50), customer_id VARCHAR(50), quantity INT, unit_price DECIMAL(10,2), discount DECIMAL(5,2), revenue DECIMAL(12,2), cost DECIMAL(12,2), profit DECIMAL(12,2) );
-
Stores Table: Contains information about store locations and related attributes.
CREATE TABLE stores_table ( store_id VARCHAR(50), store_name VARCHAR(50), city VARCHAR(50), country VARCHAR(50), store_type VARCHAR(50) );
The original datasets were provided in CSV format and were successfully imported into their respective PostgreSQL tables using the \copy command in the VS Code terminal.
\copy calendar_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/calendar.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');
\copy customers_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/customers.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');
\copy products_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/products.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');
\copy sales_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/sales.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');
\copy stores_table FROM '/Users/segun/Desktop/Data Analysis/SQL/30-days/Chocolate Sales Dataset 2023 - 2024 (1)/stores.csv' WITH (FORMAT csv, HEADER true, DELIMITER ',', ENCODING 'UTF8');The \copy command is used to load data from local files into PostgreSQL tables and is executed from the client side (in this case, the terminal via VS Code). This makes it particularly useful for importing CSV files in local development environments, as it avoids server-side file path restrictions associated with the standard COPY command.
Data validation was conducted on each table to ensure the datasets were loaded correctly and were ready for analysis.
-- Row count verification-
Calendar Table
SELECT COUNT(*) FROM calendar_table;
-
Customers Table
SELECT COUNT(*) FROM customers;
-
Product Table
SELECT COUNT(*) FROM products_table;
-
Sales Table
SELECT COUNT(*) FROM sales_table;
-
Stores Table
SELECT COUNT(*) FROM stores_table;
-- NULL value checks-
Calendar Table
SELECT * FROM calendar_table WHERE date IS NULL OR year IS NULL OR month IS NULL OR day IS NULL OR week IS NULL OR day_of_week IS NULL;
-
Customers Table
SELECT * FROM customers_table WHERE customer_id IS NULL;
-
Product Table
SELECT * FROM products_table WHERE product_id IS NULL;
-
Sales Table
SELECT * FROM sales_table WHERE customer_id IS NULL OR product_id IS NULL;
-
Stores Table
SELECT * FROM stores_table WHERE store_id IS NULL;
- Examined the structure of the datasets
- Identified and validated relationships between tables
- Assessed data consistency across all tables
-
Revenue by Products
SELECT product_name, SUM(revenue) AS total_revenue FROM products_table LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id GROUP BY product_name ORDER BY revenue DESC;
-
Top 5 Products by Quantity Sold
SELECT product_name, SUM(quantity) AS quantity FROM products_table LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id GROUP BY product_name ORDER BY quantity DESC LIMIT 5;
- Products β Sales (1-to-many)
- Customers β Sales (1-to-many)
- Stores β Sales (1-to-many)
- Calendar β Sales (1-to-many via date)
This project demonstrates the application of key SQL techniques for data analysis and business intelligence, including:
SELECT SUM()
AVG()
COUNT()
MIN()
MAX() GROUP BY HAVING WHERE INNER JOIN
LEFT JOIN LIMIT CASE
WHEN THEN
ELSE
END DISTINCT AS- Nested
SELECTstatementsSELECT product_name, AVG(unit_price) AS avg_unit_price FROM products_table p JOIN sales_table s ON s.product_id = p.product_id GROUP BY product_name HAVING AVG(unit_price) > ( SELECT AVG(unit_price) FROM sales_table ) ORDER BY avg_unit_price DESC;
- Sorting results
ORDER BY - Data validation
COUNT(*)
This section presents twenty (20) business questions addressed using SQL queries. While the queries are documented here for reference, the actual results were executed and generated in PostgreSQL.
SELECT COUNT(*) AS total_orders
FROM sales_table;Result: The sales table contains 1,000,000 total orders.
SELECT DISTINCT category
FROM products_table;Result: The dataset contains 5 unique product categories: Milk, Praline, Dark, White, and Truffle.
SELECT count(customer_id)
FROM customers_table;Result: The dataset contains 50,000 registered customers.
SELECT
store_name,
city
FROM stores_table;Result: The dataset includes 100 stores located across 7 cities.
SELECT
product_id,
product_name,
cocoa_percent
FROM products_table
WHERE cocoa_percent > 70
ORDER BY
cocoa_percent DESC;Result: There are 76 products with a cocoa percentage greater than 70%.
SELECT
brand,
count(product_name) AS product_count
FROM products_table
GROUP BY brand
ORDER BY
product_count DESC;| brand | product_count |
|---|---|
| Cadbury | 37 |
| Ferrero | 37 |
| Lindt | 35 |
| Mars | 33 |
| Godiva | 30 |
| Hershey | 28 |
Result: Cadbury and Ferrero had the widest product portfolios, with 37 products each.
SELECT
order_id,
product_name,
discount
FROM sales_table
LEFT JOIN products_table ON products_table.product_id = sales_table.product_id
WHERE discount > 0.00
ORDER BY discount;Result: Out of 1 million orders, 374,872 received discounts.
SELECT
SUM(revenue) AS revenue
FROM sales_table;Result: Total revenue generated across all sales was $25,486,128.86.
SELECT
SUM(profit) AS total_profit
FROM sales_table;Result: Total profit across all orders was $10,194,564.63.
SELECT
MIN(profit) AS min_profit,
MAX(profit) AS max_profit
FROM sales_table;Result:
- Minimum profit per order:
$0.73 - Maximum profit per order:
$37.43
SELECT
category,
SUM(quantity) AS total_quantity
FROM products_table
LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
GROUP BY category
ORDER BY
total_quantity DESC;| category | total_quantity |
|---|---|
| Praline | 784,435 |
| White | 714,967 |
| Dark | 623,247 |
| Truffle | 462,202 |
| Milk | 385,619 |
Result: Praline products recorded the highest total quantity sold, while Milk products had the lowest.
SELECT
store_name,
city AS store_city,
SUM(sales_table.revenue) AS total_revenue
FROM stores_table
LEFT JOIN sales_table ON sales_table.store_id = stores_table.store_id
GROUP BY
store_name,
city
ORDER BY total_revenue DESC;Result Preview: Top 5 stores by total revenue.
| store_name | store_city | total_revenue |
|---|---|---|
| Chocolate Store 74 | Sydney | 261,393.77 |
| Chocolate Store 33 | Toronto | 260,672.37 |
| Chocolate Store 50 | New York | 259,526.62 |
| Chocolate Store 85 | Melbourne | 259,512.15 |
| Chocolate Store 98 | New York | 259,055.13 |
SELECT
brand,
SUM(profit) AS highest_profit
FROM sales_table
LEFT JOIN products_table ON sales_table.product_id = products_table.product_id
WHERE profit > 0
GROUP BY
brand
ORDER BY highest_profit DESC
LIMIT 1;Result: Ferrero generated the highest total profit, amounting to $1,876,268.09.
SELECT
country,
SUM(revenue) AS total_revenue
FROM sales_table
LEFT JOIN stores_table ON sales_table.store_id = stores_table.store_id
GROUP BY
country
ORDER BY
total_revenue DESC
LIMIT 1;Result: Canada generated the highest total revenue among all countries, with revenue of $5,085,319.05.
SELECT
CASE
WHEN customers_table.loyalty_member = 0 THEN 'non-loyal member'
ELSE 'loyal member'
END AS customer_type,
SUM(revenue) AS total_revenue
FROM sales_table
LEFT JOIN customers_table
ON customers_table.customer_id = sales_table.customer_id
GROUP BY
customers_table.loyalty_member;| customer_type | total_revenue |
|---|---|
| non-loyal member | 12,713,088.39 |
| loyal member | 12,773,040.47 |
Result: Loyal members generated $59,952.08 more revenue than non-loyal members.
SELECT
brand,
product_name,
SUM(quantity) AS total_quantity
FROM products_table
LEFT JOIN sales_table ON sales_table.product_id = products_table.product_id
GROUP BY
product_name,
brand
ORDER BY
total_quantity DESC
LIMIT 5;| brand | product_name | total_quantity |
|---|---|---|
| Mars | White Chocolate 60% | 74,066 |
| Ferrero | Praline Chocolate 50% | 73,940 |
| Ferrero | Truffle Chocolate 50% | 60,608 |
| Cadbury | Dark Chocolate 50% | 60,240 |
| Cadbury | Dark Chocolate 60% | 60,159 |
Result: Mars' White Chocolate 60% recorded the most quantity sold among all products.
SELECT
store_type,
SUM(revenue) AS total_revenue
FROM stores_table
LEFT JOIN sales_table ON sales_table.store_id = stores_table.store_id
GROUP BY
store_type
ORDER BY
total_revenue DESC
LIMIT 1;Result: Airport stores generate the biggest revenue, totaling $7,613,875.92.
SELECT
gender,
AVG(revenue) AS avg_revenue
FROM sales_table
LEFT JOIN customers_table ON customers_table.customer_id = sales_table.customer_id
GROUP BY gender;| gender | avg_revenue |
|---|---|
| Female | 25.48 |
| Male | 25.49 |
Result: Average order revenue was nearly identical for male and female customers.
SELECT
city,
count(DISTINCT order_id) as order_count
FROM sales_table
LEFT JOIN stores_table ON sales_table.store_id = stores_table.store_id
GROUP BY
city
ORDER BY
order_count DESC
LIMIT 1;Result: Toronto recorded the most number of orders, with 180,181 orders placed.
20. List the products that have a unit price greater than the average unit price across all products.
SELECT
product_name,
AVG(unit_price) AS avg_unit_price
FROM products_table p
JOIN sales_table s
ON s.product_id = p.product_id
GROUP BY product_name
HAVING AVG(unit_price) > (
SELECT AVG(unit_price)
FROM sales_table
)
ORDER BY avg_unit_price DESC;Analytical Approach: To identify premium-priced products, the query was adjusted to return products whose average unit price exceeds the overall average unit price across all products in the dataset.
| product_name | avg_unit_price |
|---|---|
| Milk Chocolate 70% | 9.04 |
| White Chocolate 80% | 9.04 |
| Milk Chocolate 90% | 9.02 |
| Dark Chocolate 80% | 9.02 |
| Milk Chocolate 50% | 9.02 |
| Truffle Chocolate 80% | 9.01 |
| Milk Chocolate 60% | 9.01 |
| Praline Chocolate 70% | 9.01 |
| Truffle Chocolate 70% | 9.01 |
| Truffle Chocolate 60% | 9.01 |
| Dark Chocolate 50% | 9.01 |
| Praline Chocolate 50% | 9.01 |
| Dark Chocolate 60% | 9.00 |
Result: The products listed above have average unit prices that exceed the overall average unit price across the dataset, indicating a premium pricing position relative to other products.
- The dataset contains 1 million sales orders generated by 50,000 customers across 100 stores in 7 cities.
- Total revenue exceeded $25.4 million, generating more than $10.1 million in profit.
- Approximately 37.5% of all orders received discounts.
- Praline products recorded the most sales volume, while Milk products recorded the lowest.
- Cadbury and Ferrero offered the largest product portfolios, with 37 products each.
- Ferrero generated the highest total profit among all brands.
- Canada recorded the largest total revenue among all countries.
- Loyal members generated slightly more revenue than non-loyal members.
- Airport stores generated the biggest revenue among all store types.
- Toronto recorded the highest number of orders, while Sydney hosted the highest-revenue store.
- Mars' White Chocolate 60% was the best-selling product by quantity sold.
- Average order revenue was nearly identical across male and female customers.
- Product prices were highly standardized, with only a small number of products priced above the overall average unit price.
-
Optimize discount strategies by targeting high-impact products, since over one-third of orders already involve discounts, which may be reducing potential revenue.
-
Strengthen marketing efforts for lower-performing categories such as Milk products to improve their sales volume relative to higher-performing categories like Praline.
-
Leverage top-performing brands such as Ferrero and Cadbury by expanding their presence in high-revenue locations and promoting their best-selling products.
-
Investigate pricing strategy consistency, as product prices are highly standardized, to determine whether selective premium pricing could increase profitability.
-
Focus on improving performance in high-traffic cities such as Toronto while replicating successful strategies from top-performing stores like those in Sydney.
-
Enhance loyalty programs slightly, as loyal customers generate marginally higher revenue, indicating potential for stronger customer retention strategies.
-
Prioritize high-performing store types such as Airport locations by increasing inventory and promotional efforts in these channels.
-
Use product-level sales insights to identify and scale best-selling SKUs like Mars White Chocolate 60% across more stores and regions.
- The dataset does not include external factors such as marketing spend or seasonal effects, which may influence sales patterns.
- Product pricing appears highly uniform, limiting deeper price elasticity analysis.
- Customer behavior is limited to basic attributes, restricting deeper segmentation.
- No time-based trend analysis was included in this version of the analysis.
- Add window functions (RANK, ROW_NUMBER) for advanced ranking analysis.
- Build customer segmentation.
- Introduce cohort or time-series analysis using the calendar table.
- Create interactive dashboards in Power BI.
- Optimize SQL queries using indexing and query performance tuning.
Segun Olakoyenikan | Data Analyst and Storyteller
SQL | Power BI | Advanced Excel | Python