Final Project 2: Exercise Python — Business Analysis Using Python

Ramadhian Ekaputra
14 min read2 days ago

--

What is Python?

Python is a versatile and powerful programming language renowned for its simplicity and readability. It is widely used across various industries and disciplines, including web development, data analysis, artificial intelligence, scientific computing, and more. Python’s syntax emphasizes clarity and productivity, making it accessible for beginners yet robust enough for complex applications. Its extensive libraries and frameworks support diverse functionalities, making Python a preferred choice for developers aiming to build scalable solutions efficiently.

Photo by Hitesh Choudhary on Unsplash

Why Python?

  • Versatility: Python is versatile, used in web development, data analysis, AI, scientific computing, and more.
  • Ease of Learning: Known for its simple and readable syntax, Python is beginner-friendly.
  • Rich Ecosystem: Python offers a vast array of libraries and frameworks, enhancing productivity and capability.
  • Community Support: It boasts a large, active community, providing ample resources and support.
  • Scalability: Python is scalable, capable of handling small scripts to large-scale applications effectively.
Photo by Rubaitul Azad on Unsplash

Google Colab Link:

Dataset Details for Python

The data used is data from Tokopedia (not real data).The explanation of the dataset is as follows:

Data sources used

#Sumber data yang digunakan
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_pd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/payment_detail.csv"
path_cd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/customer_detail.csv"
path_sd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/sku_detail.csv"
df_od = pd.read_csv(path_od)
df_pd = pd.read_csv(path_pd)
df_cd = pd.read_csv(path_cd)
df_sd = pd.read_csv(path_sd)

1 CASE 1: The Top 5 Products Within The “Mobiles & Tablets” Category

From this Case

Dear Data Analyst,

Akhir tahun ini, perusahaan akan memberikan hadiah bagi pelanggan yang memenangkan kompetisi Festival Akhir Tahun. Tim Marketing membutuhkan bantuan untuk menentukan perkiraan hadiah yang akan diberikan pada pemenang kompetisi nantinya. Hadiah tersebut akan diambil dari TOP 5 Produk dari Kategori Mobiles & Tablets selama tahun 2022, dengan jumlah kuantitas penjualan (valid = 1) paling tinggi.

Mohon bantuan, untuk mengirimkan data tersebut sebelum akhir bulan ini ke Tim Marketing. Atas bantuan yang diberikan, kami mengucapkan terima kasih.

Regards

Tim Marketing

Tips:

  1. Memfilter data dengan valid = 1
  2. Memfilter data dengan Category = Mobiles & Tablets
  3. Memfilter data pada transaksi selama 2022
  4. Gunakan groupby berdasarkan sku_name
  5. Gunakan sort_values untuk mengurutkan data
  6. Gunakan head untuk menampilkan top 5 produk
# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd

# Memuat dataset
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_cd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/customer_detail.csv"
path_sd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/sku_detail.csv"

df_od = pd.read_csv(path_od)
df_cd = pd.read_csv(path_cd)
df_sd = pd.read_csv(path_sd)

# Mengonversi kolom order_date menjadi datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Memfilter data dengan valid = 1
df_valid_orders = df_od[df_od['is_valid'] == 1]

# Memfilter data pada transaksi selama 2022
df_valid_orders_2022 = df_valid_orders[df_valid_orders['order_date'].dt.year == 2022]

# Menggabungkan data order_detail dengan sku_detail berdasarkan sku_id
df_merged = df_valid_orders_2022.merge(df_sd, left_on='sku_id', right_on='id')

# Memfilter data dengan Category = Mobiles & Tablets
df_mobiles_tablets = df_merged[df_merged['category'] == 'Mobiles & Tablets']

# Mengelompokkan data berdasarkan sku_name dan menjumlahkan qty_ordered
df_grouped = df_mobiles_tablets.groupby('sku_name').agg({'qty_ordered': 'sum'}).reset_index()

# Mengurutkan data berdasarkan qty_ordered secara menurun
df_sorted = df_grouped.sort_values(by='qty_ordered', ascending=False)

# Menampilkan top 5 produk
top_5_products = df_sorted.head(5)

# Menampilkan hasil
print(top_5_products)

# Visualization
plt.figure(figsize=(10, 6))
plt.bar(df_top5['sku_name'], df_top5['qty_ordered'], color='skyblue')
plt.xlabel('Product Name')
plt.ylabel('Quantity Ordered')
plt.title('Top 5 Mobiles & Tablets Products by Quantity Sold in 2022')
plt.xticks(rotation=45)
plt.show()

Objective: Determine the top 5 products from the Mobiles & Tablets category with the highest quantity sold in 2022 for the upcoming year-end festival competition.

Analysis Process:

  • Filter data for valid transactions (is_valid = 1, is_net = 1).
  • Select products only from the Mobiles & Tablets category.
  • Group by SKU and calculate the total quantity sold.
  • Sort to identify the top 5 products based on quantity sold.

Outcome: Provide a list of the top 5 products in the Mobiles & Tablets category with the highest sales quantities, suitable for the year-end festival competition awards.

2 CASE 2: Sales Data Comparison Between 2021 & 2022

From this Case

Dear Data Analyst,

Menindaklanjuti meeting gabungan Tim Werehouse dan Tim Marketing, kami menemukan bahwa ketersediaan stock produk dengan Kategori Others pada akhir 2022 kemarin masih banyak.

1. Kami mohon bantuan untuk melakukan pengecekan data penjualan kategori tersebut dengan tahun 2021 secara kuantitas penjualan. Dugaan sementara kami, telah terjadi penurunan kuantitas penjualan pada 2022 dibandingkan 2021. (Mohon juga menampilkan data ke-15 kategori)

2. Apabila memang terjadi penurunan kuantitas penjualan pada kategori Others, kami mohon bantuan untuk menyediakan data TOP 20 nama produk yang mengalami penurunan paling tinggi pada 2022 jika dibanding dengan 2021. Hal ini kami gunakan sebagai bahan diskusi pada meeting selanjutnya.

Mohon bantuan untuk mengirimkan data tersebut paling lambat 4 hari dari hari ini. Atas bantuan yang diberikan, kami mengucapkan terima kasih.

Regards

Tim Warehouse

Tips:

  1. Memfilter data dengan valid = 1
  2. Memfilter data pada transaksi selama 2021
  3. Memfilter data pada transaksi selama 2022
  4. Gunakan groupby berdasarkan sku_name masing-masing tahun
  5. Menggabungkan kedua data dengan merge
  6. Lakukan pengurangan kolom qty 2022 dengan qty 2021

CASE 2.1: Evaluating Sales Quantity Decline in Others Category Compared to 2021

Data penjualan kategori ‘Others’ pada tahun 2021:
# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd

# Memuat dataset
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_sd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/sku_detail.csv"

df_od = pd.read_csv(path_od)
df_sd = pd.read_csv(path_sd)

# Mengonversi kolom order_date menjadi datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Memfilter data dengan valid = 1
df_valid_orders = df_od[df_od['is_valid'] == 1]

# Memfilter data pada transaksi selama 2021 dan 2022
df_orders_2021 = df_valid_orders[df_valid_orders['order_date'].dt.year == 2021]
df_orders_2022 = df_valid_orders[df_valid_orders['order_date'].dt.year == 2022]

# Menggabungkan data order_detail dengan sku_detail berdasarkan sku_id
df_merged_2021 = df_orders_2021.merge(df_sd, left_on='sku_id', right_on='id')

# Memfilter data dengan Category = Others
df_others_2021 = df_merged_2021[df_merged_2021['category'] == 'Others']

# Mengelompokkan data berdasarkan category dan menjumlahkan qty_ordered
df_grouped_2021 = df_others_2021.groupby('category').agg({'qty_ordered': 'sum'}).reset_index()

# Menampilkan data ke-15 kategori
top_15_categories_2021 = df_grouped_2021.sort_values(by='qty_ordered', ascending=False).head(15)

# Menampilkan hasil
print("Data penjualan kategori 'Others' pada tahun 2021:")
print(top_15_categories_2021)

# Display the comparison
print(df_top15)

# Visualization for 'Others' category
labels = ['2021', '2022']
values = [total_qty_others_2021, total_qty_others_2022]

plt.figure(figsize=(8, 6))
plt.bar(labels, values, color=['blue', 'orange'])
plt.xlabel('Year')
plt.ylabel('Quantity Ordered')
plt.title("Quantity Ordered for 'Others' Category in 2021 and 2022")
plt.show()

# Visualization for top 15 categories
plt.figure(figsize=(14, 8))
categories = df_top15['category']
qty_2021 = df_top15['qty_ordered_2021']
qty_2022 = df_top15['qty_ordered_2022']

x = range(len(categories))

plt.bar(x, qty_2021, width=0.4, label='2021', align='center')
plt.bar(x, qty_2022, width=0.4, label='2022', align='edge')
plt.xlabel('Category')
plt.ylabel('Quantity Ordered')
plt.title('Top 15 Categories by Quantity Ordered in 2021 and 2022')
plt.xticks(x, categories, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

Objective:

  • Goal: Evaluate the decline in sales quantities for the Others category from 2021 to 2022.
  • Initial Observation: There is a suspected decrease in sales quantities in 2022 compared to 2021.

Analysis Process:

  • Data Comparison: Compare the total sales quantities of products categorized under Others between the years 2021 and 2022.
  • Identification of Top 15 Categories: Identify the top 15 product categories within the Others segment that have experienced the most significant decline in sales.
Data penjualan kategori ‘Others’ pada tahun 2021:

Outcome:

  • Present data showing the comparative sales quantities for the Others category between 2021 and 2022.
  • Identify and list the top 15 product categories within Others that have shown the largest decline in sales. This information will be crucial for further discussion in the next meeting, guiding strategies to address the sales decline effectively.

CASE 2.2: Identifying Top 20 Products with Highest Decline in Others Category

Top 20 produk dengan penurunan kuantitas penjualan paling tinggi pada kategori ‘Others’ dari tahun 2021 ke 2022:

Top 20 Products with Highest Decline in Others Category
# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd

# Memuat dataset
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_sd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/sku_detail.csv"

df_od = pd.read_csv(path_od)
df_sd = pd.read_csv(path_sd)

# Mengonversi kolom order_date menjadi datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Memfilter data dengan valid = 1
df_valid_orders = df_od[df_od['is_valid'] == 1]

# Memfilter data pada transaksi selama 2021 dan 2022
df_orders_2021 = df_valid_orders[df_valid_orders['order_date'].dt.year == 2021]
df_orders_2022 = df_valid_orders[df_valid_orders['order_date'].dt.year == 2022]

# Menggabungkan data order_detail dengan sku_detail berdasarkan sku_id
df_merged_2021 = df_orders_2021.merge(df_sd, left_on='sku_id', right_on='id')
df_merged_2022 = df_orders_2022.merge(df_sd, left_on='sku_id', right_on='id')

# Memfilter data dengan Category = Others
df_others_2021 = df_merged_2021[df_merged_2021['category'] == 'Others']
df_others_2022 = df_merged_2022[df_merged_2022['category'] == 'Others']

# Mengelompokkan data berdasarkan sku_name dan menjumlahkan qty_ordered
df_grouped_2021 = df_others_2021.groupby('sku_name').agg({'qty_ordered': 'sum'}).reset_index()
df_grouped_2022 = df_others_2022.groupby('sku_name').agg({'qty_ordered': 'sum'}).reset_index()

# Menggabungkan data 2021 dan 2022
df_compare = df_grouped_2021.merge(df_grouped_2022, on='sku_name', suffixes=('_2021', '_2022'), how='outer').fillna(0)

# Menghitung penurunan qty_ordered dari 2021 ke 2022
df_compare['qty_diff'] = df_compare['qty_ordered_2022'] - df_compare['qty_ordered_2021']

# Menampilkan top 20 produk dengan penurunan kuantitas penjualan paling tinggi
top_20_decline = df_compare.sort_values(by='qty_diff').head(20)

# Menampilkan hasil
print("Top 20 produk dengan penurunan kuantitas penjualan paling tinggi pada kategori 'Others' dari tahun 2021 ke 2022:")
print(top_20_decline)

# Visualization
plt.figure(figsize=(14, 8))
products = df_top20_decrease['sku_name']
qty_diff = df_top20_decrease['qty_diff']

plt.barh(products, qty_diff, color='red')
plt.xlabel('Quantity Difference (2022 - 2021)')
plt.ylabel('Product Name')
plt.title('Top 20 Products with Highest Decrease in Quantity Ordered (2022 vs 2021)')
plt.gca().invert_yaxis() # Invert y-axis to have the highest decrease at the top
plt.show()

Objective:

  • Goal: Provide data on the top 20 products within the Others category that have experienced the highest decline in sales from 2021 to 2022.
  • Purpose: Use this data as a basis for discussion in upcoming meetings to strategize and address the decline in sales.

Analysis Process:

  • Detailed Comparison: Compare sales data of individual products within the Others category between 2021 and 2022.
  • Ranking by Decline: Rank and present the top 20 products that have shown the most significant decline in sales quantities.

Outcome:

  • Provide a list of the top 20 products in the Others category with the highest decline in sales from 2021 to 2022.
  • This dataset will serve as valuable input for discussions and decision-making in subsequent meetings, helping teams focus on specific products that require attention to improve sales performance
Top 20 Products with Highest Decline in Others Category

3 CASE 3: Customer Registration Trends in 2022

From this Case

Dear Data Analyst,

Terkait ulang tahun perusahaan pada 2 bulan mendatang, Tim Digital Marketing akan memberikan informasi promo bagi pelanggan pada akhir bulan ini. Kriteria pelanggan yang akan kami butuhkan adalah mereka yang sudah melakukan check-out namun belum melakukan pembayaran (is_gross = 1) selama tahun 2022. Data yang kami butuhkan adalah ID Customer dan Registered Date.

Mohon bantuan, untuk mengirimkan data tersebut sebelum akhir bulan ini ke Tim Digital Marketing. Atas bantuan yang diberikan, kami mengucapkan terima kasih.

Regards

Tim Digital Marketing

Tips:

  1. Memfilter data dengan gross = 1
  2. Memfilter data dengan valid = 0
  3. Memfilter data dengan net = 0
  4. Memfilter data pada transaksi selama 2022
Data pelanggan yang sudah melakukan check-out namun belum melakukan pembayaran (is_gross = 1) selama tahun 2022
# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd

# Memuat dataset
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_cd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/customer_detail.csv"

df_od = pd.read_csv(path_od)
df_cd = pd.read_csv(path_cd)

# Mengonversi kolom order_date menjadi datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Memfilter data dengan gross = 1 (check-out belum bayar)
df_gross_1 = df_od[df_od['is_gross'] == 1]

# Memfilter data dengan valid = 0 dan net = 0 (belum valid dan belum selesai)
df_pending_payments = df_gross_1[(df_gross_1['is_valid'] == 0) & (df_gross_1['is_net'] == 0)]

# Memfilter data pada transaksi selama 2022
df_pending_payments_2022 = df_pending_payments[df_pending_payments['order_date'].dt.year == 2022]

# Menggabungkan dengan data customer_detail untuk mendapatkan ID Customer dan Registered Date
df_needed_data = df_pending_payments_2022.merge(df_cd, left_on='customer_id', right_on='id')[['customer_id', 'registered_date']]

# Menghapus duplikat data ID Customer jika ada
df_needed_data = df_needed_data.drop_duplicates(subset=['customer_id'])

# Menampilkan hasil
print("Data pelanggan yang sudah melakukan check-out namun belum melakukan pembayaran (is_gross = 1) selama tahun 2022:")
print(df_needed_data)

# Visualization
import matplotlib.pyplot as plt

# Convert registered_date to datetime
df_result['registered_date'] = pd.to_datetime(df_result['registered_date'])

# Plot the distribution of registration dates
plt.figure(figsize=(10, 6))
df_result['registered_date'].dt.year.value_counts().sort_index().plot(kind='bar', color='skyblue')
plt.xlabel('Year of Registration')
plt.ylabel('Number of Customers')
plt.title('Distribution of Customer Registration Years for Unpaid Orders in 2022')
plt.xticks(rotation=45)
plt.show()

Objective: Identify customers who have checked out but have not completed the payment process during 2022.

Analysis Process:

  • Data Filtering: Filter transactions where customers have initiated the checkout (is_gross = 1) but have not yet finalized payment (is_net = 0).
  • Customer Information: Retrieve and present customer IDs and their registration dates (registered_date).

Outcome:

  • Provide a dataset containing customer IDs and registration dates for those who have pending payments. This insight enables the Digital Marketing team to target these customers with specific campaigns or reminders to complete their purchases, potentially improving conversion rates and reducing abandoned carts.

4 CASE 4: Analysing The Impact of Weekend Vs. Weekday Sales During The October-December 2022 Campaign

From this Case

Dear Data Analyst,

Pada bulan October hingga Desember 2022, kami melakukan campaign setiap hari Sabtu dan Minggu. Kami hendak menilai, apakah campaign tersebut cukup berdampak pada kenaikan penjualan (before_discount). Mohon bantuan untuk menampilkan data:

1. Rata-rata harian penjualan weekends (Sabtu dan Minggu) vs rata-rata harian penjualan weekdays (Senin-Jumat) per bulan tersebut. Apakah ada peningkatan penjualan pada masing-masing bulan tersebut.

2. Rata-rata harian penjualan weekends (Sabtu dan Minggu) vs rata-rata harian penjualan weekdays (Senin-Jumat) keseluruhan 3 bulan tersebut.

Mohon bantuan untuk mengirimkan data tersebut paling lambat minggu depan. Atas bantuan yang diberikan, kami mengucapkan terima kasih.

Regards

Tim Digital Marketing

Tips:

  1. Buatlah kolom baru untuk day, month, dan month number
  2. Memfilter data dengan valid = 1
  3. Memfilter data dengan day termasuk ‘Saturday’, ‘Sunday’ dengan menggunakan fungsi isin dan simpan dalam variable ‘weekends’
  4. Memfilter data dengan day termasuk ‘Monday’, ’Tuesday’, ’Wednesday’, ’Thusday’, ’Friday’ dengan menggunakan fungsi isin dan simpan dengan variabel ‘weekday’
  5. Memfilter data pada transaksi selama 2022 bulan Oktober sampai dengan Desember
  6. Gunakan groupby berdasarkan pada masing-masing variable
  7. Menggabungkan kedua data dengan merge
  8. Tampilkan pada grafik batang untuk melihat perbedaananya

CASE 4.1: Daily Sales Comparison between Weekends and Weekdays per Month

Monthly Average Sales (October to December 2022):
# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd
import matplotlib.pyplot as plt

# Load data
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
df_od = pd.read_csv(path_od)

# Convert order_date to datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Create new columns for day, month, and year
df_od['day'] = df_od['order_date'].dt.day_name()
df_od['month'] = df_od['order_date'].dt.month_name()
df_od['month_number'] = df_od['order_date'].dt.month
df_od['year'] = df_od['order_date'].dt.year

# Filter data for valid transactions in 2022 during October to December
df_filtered = df_od[(df_od['is_valid'] == 1) &
(df_od['year'] == 2022) &
(df_od['month_number'].isin([10, 11, 12]))]

# Filter data for weekends and weekdays
weekends = df_filtered[df_filtered['day'].isin(['Saturday', 'Sunday'])]
weekdays = df_filtered[df_filtered['day'].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])]

# Group by month and day type and calculate mean before_discount
weekends_avg = weekends.groupby(['month', 'day'])['before_discount'].mean().reset_index()
weekdays_avg = weekdays.groupby(['month', 'day'])['before_discount'].mean().reset_index()

# Calculate daily average for weekends and weekdays for each month
weekends_avg_monthly = weekends_avg.groupby('month')['before_discount'].mean().reset_index()
weekdays_avg_monthly = weekdays_avg.groupby('month')['before_discount'].mean().reset_index()

# Merge the results for comparison
comparison_monthly = pd.merge(weekends_avg_monthly, weekdays_avg_monthly, on='month', suffixes=('_weekends', '_weekdays'))

# Calculate overall daily average for weekends and weekdays for the entire period
overall_weekends_avg = weekends['before_discount'].mean()
overall_weekdays_avg = weekdays['before_discount'].mean()

# Display the results
print("Monthly Average Sales (October to December 2022):")
print(comparison_monthly)
print("\nOverall Daily Average Sales:")
print(f"Weekends: {overall_weekends_avg}")
print(f"Weekdays: {overall_weekdays_avg}")

# Visualization
# Bar plot for monthly comparison
comparison_monthly.plot(kind='bar', x='month', figsize=(10, 6))
plt.ylabel('Average Before Discount Sales')
plt.title('Monthly Average Sales: Weekends vs Weekdays (Oct-Dec 2022)')
plt.xticks(rotation=0)
plt.show()

# Bar plot for overall comparison
labels = ['Weekends', 'Weekdays']
values = [overall_weekends_avg, overall_weekdays_avg]

plt.figure(figsize=(8, 6))
plt.bar(labels, values, color=['blue', 'green'])
plt.ylabel('Average Before Discount Sales')
plt.title('Overall Average Sales: Weekends vs Weekdays (Oct-Dec 2022)')
plt.show()

Objective:

  • Goal: Compare the average daily sales (before_discount) between weekends (Saturday and Sunday) and weekdays (Monday to Friday) for each month (October, November, December 2022).
  • Purpose: Determine if there are noticeable increases in sales on weekends compared to weekdays within each month.

Analysis Process:

  • Data Segmentation: Separate the dataset to isolate weekends (Saturday and Sunday) and weekdays (Monday to Friday) for each month.
  • Average Calculation: Compute the average daily sales (before_discount) separately for weekends and weekdays for October, November, and December 2022.
  • Comparison: Compare the average sales figures to identify any patterns or trends indicating higher sales on weekends versus weekdays within each month.

Outcome:

  • Present the comparative analysis of average daily sales for weekends and weekdays across October, November, and December 2022.
  • Provide insights into whether there are consistent trends of increased sales on weekends relative to weekdays within each month. This information will help in understanding the effectiveness of weekend campaigns on a month-by-month basis.

CASE 4.2: Overall Daily Sales Comparison between Weekends and Weekdays for 3 Months

# Tulis kode Anda di bawah ini. Dapat menggunakan lebih dari 1 blok kode
import pandas as pd
import numpy as np

# Load the datasets
path_od = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/order_detail.csv"
path_cd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/customer_detail.csv"
path_sd = "https://raw.githubusercontent.com/dataskillsboost/FinalProjectDA11/main/sku_detail.csv"

df_od = pd.read_csv(path_od)
df_cd = pd.read_csv(path_cd)
df_sd = pd.read_csv(path_sd)

# Convert order_date to datetime
df_od['order_date'] = pd.to_datetime(df_od['order_date'])

# Extract year, month, and day of the week
df_od['year'] = df_od['order_date'].dt.year
df_od['month'] = df_od['order_date'].dt.month
df_od['day_of_week'] = df_od['order_date'].dt.day_name()

# Filter data for valid transactions in October, November, and December 2022
df_filtered = df_od[(df_od['year'] == 2022) &
(df_od['is_valid'] == 1) &
(df_od['month'].isin([10, 11, 12]))]

# Separate weekends and weekdays
weekends = df_filtered[df_filtered['day_of_week'].isin(['Saturday', 'Sunday'])]
weekdays = df_filtered[df_filtered['day_of_week'].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])]

# Calculate average daily sales for weekends and weekdays
weekend_sales = weekends.groupby('order_date')['before_discount'].sum().mean()
weekday_sales = weekdays.groupby('order_date')['before_discount'].sum().mean()

# Calculate percentage difference
percentage_difference = ((weekend_sales - weekday_sales) / weekday_sales) * 100

print(f"Average daily sales on weekends: {weekend_sales}")
print(f"Average daily sales on weekdays: {weekday_sales}")
print(f"Percentage difference: {percentage_difference:.2f}%")

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate average daily sales for weekends and weekdays by month
weekends_monthly = weekends.groupby(['month', 'order_date'])['before_discount'].sum().groupby('month').mean()
weekdays_monthly = weekdays.groupby(['month', 'order_date'])['before_discount'].sum().groupby('month').mean()

# Combine the data into a single DataFrame for plotting
monthly_sales = pd.DataFrame({
'Month': ['October', 'November', 'December'],
'Weekend Sales': weekends_monthly.values,
'Weekday Sales': weekdays_monthly.values
})

# Set vibrant color palette
sns.set_palette("husl")

# Plotting
plt.figure(figsize=(12, 6))

# Bar plot for monthly average sales
bar_width = 0.35
index = np.arange(len(monthly_sales))

bar1 = plt.bar(index, monthly_sales['Weekend Sales'], bar_width, label='Weekends', color='dodgerblue')
bar2 = plt.bar(index + bar_width, monthly_sales['Weekday Sales'], bar_width, label='Weekdays', color='orange')

# Adding labels and title
plt.xlabel('Month', fontsize=14)
plt.ylabel('Average Daily Sales', fontsize=14)
plt.title('Average Daily Sales: Weekends vs. Weekdays (Oct-Dec 2022)', fontsize=16)
plt.xticks(index + bar_width / 2, monthly_sales['Month'], fontsize=12)
plt.yticks(fontsize=12)
plt.legend(fontsize=12)

# Add value labels on bars
def add_labels(bars):
for bar in bars:
height = bar.get_height()
plt.annotate(f'{height:.2f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha='center', va='bottom', fontsize=12, color='black')

add_labels(bar1)
add_labels(bar2)

plt.tight_layout()
plt.show()

# Overall percentage difference visualization
overall_data = pd.DataFrame({
'Period': ['Weekends', 'Weekdays'],
'Average Sales': [weekend_sales, weekday_sales]
})

plt.figure(figsize=(8, 5))
sns.barplot(x='Period', y='Average Sales', data=overall_data, palette=['dodgerblue', 'orange'])
plt.title('Overall Average Daily Sales: Weekends vs. Weekdays (Oct-Dec 2022)', fontsize=16)
plt.xlabel('', fontsize=14)
plt.ylabel('Average Daily Sales', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Add percentage difference annotation
plt.annotate(f'{percentage_difference:.2f}% difference',
xy=(0.5, max(weekend_sales, weekday_sales)),
xytext=(0, 10),
textcoords='offset points',
ha='center', va='bottom', color='red', fontsize=12)

plt.tight_layout()
plt.show()
Percentage difference: -29.93%

Objective:

  • Goal: Compare the average daily sales (before_discount) between weekends (Saturday and Sunday) and weekdays (Monday to Friday) across the entire period of October to December 2022.
  • Purpose: Assess the overall impact of weekend campaigns on sales compared to weekdays over the three months.

Analysis Process:

  • Data Segmentation: Segment the dataset to distinguish between weekends (Saturday and Sunday) and weekdays (Monday to Friday) for the entire period of October to December 2022.
  • Average Calculation: Calculate the average daily sales (before_discount) separately for weekends and weekdays over the three months.
  • Comparison: Analyze and compare the average sales figures to understand the overall impact of weekend campaigns versus weekday sales performance across the specified months.

Outcome:

  • Present a comprehensive comparison of average daily sales for weekends and weekdays from October to December 2022.
  • Visualize the comparison using a bar chart or similar visualization tool to highlight any significant differences in sales patterns between weekends and weekdays over the entire campaign period.
  • Provide insights to the Campaign team regarding the overall effectiveness of weekend campaigns in driving sales compared to weekdays, aiding in strategic planning and campaign optimization for future initiatives.

--

--