Below is a team work by my team and I at Georgia Tech for our Data Analytics for Business summer 2022 Project. The data for the project is fictional and all information provided is for our project purpose only and not to be used for any decision making.
Authors of the project are :
Vincent Amedekah https://www.linkedin.com/in/vincentamedekah
Mohamed Gamal https://www.linkedin.com/in/mohamed-gamal-hamed-2b5b60163/
John Goodman https://www.linkedin.com/in/john-p-goodman/
Elise Yang https://www.linkedin.com/in/eliseyang0625
Background
The supply chain industry is the engine that runs every economy in the world. We’ve seen firsthand the impact in the disruption of the supply chain industry when covid 19 hit the world in late 2019. These supply chain disruptions have caused manufacturing shortages in areas such as computer microchips which have impacted IT organizations and have had downstream effects in areas such as car manufacturing. These disruptions aren’t just limited to hardware and digital goods as we’ve seen shortages in many household items such as toilet paper, baby formula and common grocery items. With the massive amount of data available from point of sales to online transactions, an advanced approach for analyses and prediction of the future state of any supply chain operation will be beneficial to both businesses and customers. In this project, we analyze data from a fictional sporting goods supplier to identity what are the key variables that impact the smooth delivery of products to their order destination.
Problem Statement
The modern supply chain must evolve to meet increasing demands and supply chain managers need to plan to keep everything flowing smoothly. A combination of consumer expectations, more routes to market, international complexities and other factors creates significant challenges throughout the supply chain network. In this project, we will explore key variables from the fictional sporting goods supplier to identify how those variables can be used to explain whether an order for goods and services will be delivered to the retail store on time.
Business Justification
Customer satisfaction is a top priority of any retailer. An efficient supply chain will provide businesses insight into what to customers are ordering and the subsequent delivery time to customers. Data analysis of supply chain operations can save businesses millions of dollars by predicting late deliveries and allowing them to be proactive in providing services that maintain customer delivery satisfaction.
Hypothesis
For this project, we believe we can predict whether a shipment will be late or not based on various factors such as location, distance, time of year and shipping method.
Data Description
Dataset Citation:
Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS”, Mendeley Data, V5, doi: 10.17632/8gx2fvg2k6.5
Link: https://data.mendeley.com/datasets/8gx2fvg2k6/5 (https://data.mendeley.com/datasets/8gx2fvg2k6/5)
This dataset comes with a codebook that describes the structure, contents, and layout of our csv data file. From the codebook, we can see that we have 53 features ranging from ordering and shipping information to sales information. When we load the data we see that the dataset is roughly 180,519 rows and that the features consist of a combination of text data and numeric data such as order locations and numerical sales data. Specifically there are 24 character columns and 28 numeric columns. In loading the data and checking the structure of the data frame, we were able to verify that content of the code book matched the actual data.
Data Preparation
In exploring this dataset, we became aware that many of the city, state and country names were in Spanish. To help ease the process of feature engineering (See Feature Engineering section below), we spent time translating city, state and country names from Spanish to English. It’s believed that by having all the city, state and country names in English, it will make things easier if/when we need to create additional lititude/longitude features.
library(ggmap)
library(tmaptools)
library(RCurl)
library(jsonlite)
library(leaflet)
library(tidyverse)
library(RColorBrewer)
library(ggthemes)
library(lubridate)
library(geosphere)
library(splitTools)
library(ROCR)
library(plotROC)
library(DataExplorer)
library(WVPlots)
library(ggthemes)
library(dplyr)
library(ggplot2)
library(caret)
library(geosphere)
library(readr)
library(glmnet)
library(plotmo)
dataco <- read.csv('DataCoSupplyChainDataset.csv')
Translate the Country Names
es <- c("Japón","EE. UU.", "Estados Unidos", "Corea del Sur", "Singapur",
"Turquía", "República Democrática del Congo", "Marruecos","Alemania",
"Francia", "Países Bajos", "Reino Unido", "Panamá",
"República Dominicana", "México", "Sudán", "Costa de Marfil", "Egipto",
"Italia", "España", "Perú", "Argelia", "SudAfrica", "Ruanda",
"Nueva Zelanda", "Bangladés", "Tailandia", "Irak", "Filipinas",
"Kazajistán", "Irán", "Myanmar (Birmania)", "Uzbekistán", "Benín",
"Camerún", "Kenia", "Ucrania", "Polonia", "Rumania",
"Trinidad y Tobago", "Afganistán", "Pakistán", "Malasia", "Guatemala",
"Arabia Saudí", "Myanmar \\(Birmania\\)", "Finlandia", "Rusia", "Suecia",
"Brasil", "Irlanda", "Eslovaquia", "Noruega", "Bélgica", "Kirguistán")
en <- c("Japan", "US", "US","South Korea", "Singapore", "Turkey",
"Democratic Republic of Congo", "Morocco","Germany","France",
"Netherlands", "United Kingdom", "Panama", "Dominican Republic",
"Mexico", "Sudan", "Ivory Coast", "Egypt", "Italy", "Spain", "Peru",
"Algeria", "South Africa", "Rwanda", "New Zealand", "Bangladesh",
"Thailand", "Iraq", "Philippines", "Kazakhstan", "Iran",
"Myanmar (Burma)", "Uzbekistan", "Benin", "Cameroon", "Kenya", "Ukraine",
"Poland", "Romania", "Trinidad and Tobago", "Afghanistan", "Pakistan",
"Malaysia", "Guatemala", "Saudi Arabia", "Myanmar (Burma)", "Finland",
"Russia", "Sweden", "Brazil", "Ireland", "Slovakia", "Norway",
"Belgium", "Kyrgyzstan")
for (i in 1:length(es)){
dataco$`Customer.Country` <- dataco$`Customer.Country` %>% str_replace(es[i], en[i])
dataco$`Order.Country` <- dataco$`Order.Country` %>% str_replace(es[i], en[i])
}
Translate the state names
es <- c("Java Occidental", "Rajastán", "Tokio", "Célebes Septentrional", "Seúl",
"Australia Occidental", "Guangxi", "Singapur", "Nueva Gales del Sur",
"Sumatra Septentrional", "Territorio de la Capital Australiana",
"Australia del Sur", "Estambul", "Ulán Bator", "Gran Casablanca",
"Renania del Norte-Westfalia", "Isla de Francia", "Países del Loira",
"Inglaterra", "Alsacia-Champaña-Ardenas-Lorena",
"Provenza-Alpes-Costa Azul", "Jartum", "Sajonia-Anhalt", "Sicilia",
"Norte-Paso de Calais-Picardía", "Aquitania-Lemosín-Poitou-Charentes",
"Escocia", "Baja Sajonia", "Estocolmo", "Holanda Septentrional", "Viena",
"Alger", "Fez-Bulmán", "El Cairo", "Mar Rojo", "Cabo Oriental",
"Jerusalén", "Cataluña", "Daca", "Bagdad", "Riad", "Capital Nacional",
"Karagandá", "Java Central", "Rangún", "Ouest", "Marítima",
"Cabo Occidental", "Languedoc-Rosellón-Mediodía-Pirineos",
"Centro-Valle de Loira", "Kalimantan Oriental", "Punyab",
"Java Oriental", "Ciudad Ho Chi Minh", "Guyarat", "Yakarta",
"Nueva York", "Carolina del Norte", "Noroeste", "Hamburgo",
"Holanda Meridional", "Güeldres", "Emilia-Romaña",
"Finlandia del Sudoeste", "Bretaña", "Toscana", "Baden-Wurtemberg",
"Lombardía", "Piamonte", "Lieja", "Mindanao Septentrional",
"Dar es Salaam", "Nilo Blanco", "Limburgo", "Sofía-Ciudad", "Normandía",
"Cerdeña", "Abruzos", "Henao", "Normandía", "Bengala Occidental",
"Visayas Occidental", "Luzón Central", "Meca", "Pekín", "Célebes Suroriental",
"Territorio del Norte", "Sajonia", "Moscú", "Amberes", "Daguestán",
"Cheliábinsk", "Auvernia", "Astracán", "Mazovia", "Daguestán")
en <- c("West Java", "Rajasthan", "Tokyo", "North Sulawesi", "Seoul",
"Western Australia", "Guangzhou", "Singapore", "New South Wales",
"North Sumatra", "Australian Capital Territory", "South Australia",
"Istanbul", "Ulaanbaatar", "Grand Casablanca", "North Rhine-Westphalia",
"Île-de-France", "Pays de la Loire", "England",
"Alsace-Champagne-Ardenne-Lorraine", "Provence-Alpes-Côte d'Azur",
"Khartoum", "Saxony-Anhalt", "Sicily", "Nord-Pas de Calais-Picardy",
"Aquitaine-Limousin-Poitou-Charentes", "Scotland", "Lower Saxony",
"Stockholm", "North Holland", "Vienna", "Algiers", "Fez-Boulemane",
"Cairo", "Red Sea", "Eastern Cape", "Jerusalem", "Catalonia", "Dhaka",
"Baghdad", "Riyadh", "National Capital", "Karaganda", "Central Java",
"Yangon", "West", "Maritime", "Western Cape",
"Languedoc-Roussillon-Mid-Pyrenees", "Centre-Loire Valley",
"East Kalimantan", "Punjab", "East Java", "Ho Chi Minh City",
"Gujarat", "Jakarta", "New York", "North Carolina", "Northwest",
"Hamburg", "South Holland", "Gelderland", "Emilia-Romagna",
"Southwestern Finland", "Brittany", "Tuscany", "Baden-Württemberg",
"Lombardy", "Piedmont", "Liege", "Northern Mindanao", "Dar es Salaam",
"White Nile", "Limburg", "Sofia-City", "Normandy", "Sardinia",
"Abruzzo", "Hainaut", "Normandy", "West Bengal", "Western Visayas",
"Central Luzon", "Mecca", "Beijing", "South East Sulawesi",
"Northern Territory", "Saxony", "Moscow", "Antwerp", "Dagestan",
"Chelyabinsk", "Auvergne", "Astrakhan", "Masovia", "Dagestan")
for (i in 1:length(es)){
dataco$`Order.State` <- dataco$`Order.State` %>% str_replace(es[i], en[i])
}
Translate the City Names
es <- c("Tokio", "Seúl", "Singapur", "Ulan Bator", "Estambul", "Jartum",
"Reims", "Estocolmo", "Viena", "Marrakech", "Jerusalén", "Daca",
"Rangún", "Bagdad", "Hamburgo", "Kramators'k", "Yakarta", "K'ut'aisi",
"Brandenburgo", "Reggio nell'Emilia")
en <- c("Tokyo", "Seoul", "Singapore", "Ulaanbaatar", "Istanbul", "Khartoum",
"Rheims", "Stockholm", "Vienna", "Marrakesh", "Jerusalem", "Dhaka",
"Yangon", "Baghdad", "Hamburg", "Kramatorsk", "Jakarta", "Kutaisi",
"Brandenburg", "Reggio Emilia")
for (i in 1:length(es)){
dataco$`Customer.City` <- dataco$`Customer.City` %>% str_replace(es[i], en[i])
dataco$`Order.City` <- dataco$`Order.City` %>% str_replace(es[i], en[i])
}
Exploratory Data Analysis
First, as we are interested in late delivery risk, let’s take a look at the distribution of the delivery status in our data.
dataco %>% group_by(Delivery.Status) %>% count(name = 'counts') %>%
ggplot(data=.,aes(x=Delivery.Status,y=counts))+geom_bar(stat="identity")
We can see that we have a lot more late delivery items than any other class.
This is good for our analysis since it will reduce the need for any imbalanced data techniques when trying to predict late delivery.
Now, let’s see if there’s a relationship between market and late delivery
dataco %>% group_by(Market,Late_delivery_risk) %>%
count() %>%
mutate(Late_delivery_risk = ifelse(Late_delivery_risk ==1,'Risk','No Risk')) %>%
group_by(Market)%>% mutate(percent = n/sum(n)) %>%
ggplot(data=.,aes(x=Market,y=percent,fill = Late_delivery_risk)) +
geom_bar(position="fill", stat="identity")
Obviously, the late risk is not associated with any specific market. This means we can safely ignore this variable in our analysis.
Let’s try the same analysis for different departments to see if being late is associated with any specific department
dataco %>% group_by(Department.Name,Late_delivery_risk) %>%
count() %>% mutate(Late_delivery_risk = ifelse(Late_delivery_risk ==1,'Risk','No Risk')) %>%
group_by(Department.Name)%>% mutate(percent = n/sum(n)) %>%
ggplot(data=.,aes(x=Department.Name,y=percent,fill = Late_delivery_risk)) +
geom_bar(position="fill", stat="identity")
As we can see, not much difference exists between different departments. However, pet shops and bookshops are a bit higher in that risk.
Lets take a look at the shipping mode
dataco %>% group_by(Shipping.Mode,Late_delivery_risk) %>%
count() %>% mutate(Late_delivery_risk = ifelse(Late_delivery_risk ==1,'Risk','No Risk')) %>%
group_by(Shipping.Mode)%>% mutate(percent = n/sum(n)) %>%
ggplot(data=.,aes(x=Shipping.Mode,y=percent,fill = Late_delivery_risk)) +
geom_bar(position="fill", stat="identity")
Interesting, shipping mode seems to have a big effect on the risk of late delivery! This would be a useful variable in building our models.
Now let’s see where our orders come from:
dataco %>% select(Latitude,Longitude)%>%
leaflet( width = 900) %>%
addTiles() %>%
addMarkers(clusterOptions = markerClusterOptions(), popup = "Hi")
We can clearly see that we are mainly focused in the US. This would help us focus our analysis on the US market.
Finally, let’s take a further look at the US market since it is our focus point. We look at the top 10 deliveries to US cities and check their risk to no-risk orders ratio.
# Late delivery and order city
Late_delivery_risk <- dataco$Late_delivery_risk[dataco$Order.Region == "West of USA" |
dataco$Order.Region == "US Center" |
dataco$Order.Region == "East of USA" |
dataco$Order.Region == "South of USA"]
Order.City <- dataco$Order.City[dataco$Order.Region == "West of USA" |
dataco$Order.Region == "US Center" |
dataco$Order.Region == "East of USA" |
dataco$Order.Region == "South of USA"]
dataco4 <- data.frame(Late_delivery_risk, Order.City)
dataco4 <- dataco4 %>%
count(Late_delivery_risk, Order.City, name= "count")
dataco4 <- dataco4 %>% arrange(desc(count)) %>% slice(1:20)
ggplot(data = dataco4, aes(x = Order.City, y = count, fill = Late_delivery_risk)) +
geom_bar(stat='identity') +
labs(x = "Order City", y="Number of Orders", title="Top 10 Deliveries in the US")
This concludes our EDA. Now let’s move to building the model!
Model Building
To begin building some preliminary models we needed to identify what variables we thought would help predict late delivery status.
For our initial models, we thought we could use the following variables:
- Shipping Mode
- Order Status
- Order Item Quantity
- Late_delivery_risk (response)
In looking at the other variables in the data, many of them were just not useful for building a model to predict shipping late times. Examples would be customer names, department ids, as well as sales data.
We did not include the global market as we found nearly identical data in our EDA in the percentages, means, and variances between markets. It was determined that they would not have been a good predictor for our models.
Finally, it was hoped that product status (in stock vs stock outs) would be a good predictor for our model, but when we looked at the data, products were always stocked and there were never any stock outs.
This didn’t leave us with much to go on to build a model and because of this, we began to think about creating new features around what we knew about the data. (See Feature Engineering below)
Initial Models
To gain further insights into our data set, we began by building an intercept only model.
By fitting a model with no features, we see that the odds of being late is 1.21 with a probability of 54.83%
# Estimated log odds of being late
fit <- glm(Late_delivery_risk ~ 1, data = dataco, family = "binomial")
s <- summary(fit)
s
##
## Call:
## glm(formula = Late_delivery_risk ~ 1, family = "binomial", data = dataco)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.261 -1.261 1.096 1.096 1.096
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.193769 0.004729 40.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 248566 on 180518 degrees of freedom
## Residual deviance: 248566 on 180518 degrees of freedom
## AIC: 248568
##
## Number of Fisher Scoring iterations: 3
odds <- exp(s$coefficients[1])
paste("Odds of a shipment being late:", round(odds,2))
## [1] "Odds of a shipment being late: 1.21"
p <- odds/(1+odds)
paste0("Probability of a shipment being late: ", round(p*100, 2), "%")
## [1] "Probability of a shipment being late: 54.83%"
Next, we looked at Order Status and Order Item Quantity as a potential predictor for late delivery. Unfortunately with this model, none of the factors were significant.
# Estimated log odds of being late
fit <- glm(Late_delivery_risk ~ `Order.Item.Quantity` + `Order.Status`, data = dataco, family = "binomial")
s <- summary(fit)
s
##
## Call:
## glm(formula = Late_delivery_risk ~ Order.Item.Quantity + Order.Status,
## family = "binomial", data = dataco)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.316 -1.308 1.045 1.052 1.084
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.657e+01 3.949e+01 -0.419 0.675
## Order.Item.Quantity -2.989e-04 3.346e-03 -0.089 0.929
## Order.StatusCLOSED 1.683e+01 3.949e+01 0.426 0.670
## Order.StatusCOMPLETE 1.687e+01 3.949e+01 0.427 0.669
## Order.StatusON_HOLD 1.679e+01 3.949e+01 0.425 0.671
## Order.StatusPAYMENT_REVIEW 1.685e+01 3.949e+01 0.427 0.670
## Order.StatusPENDING 1.688e+01 3.949e+01 0.428 0.669
## Order.StatusPENDING_PAYMENT 1.687e+01 3.949e+01 0.427 0.669
## Order.StatusPROCESSING 1.685e+01 3.949e+01 0.427 0.670
## Order.StatusSUSPECTED_FRAUD -3.883e-06 5.456e+01 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 248566 on 180518 degrees of freedom
## Residual deviance: 235797 on 180509 degrees of freedom
## AIC: 235817
##
## Number of Fisher Scoring iterations: 15
Things were looking grim for our feature set, but we forged on and looked at Shipping Mode. Interestingly, shipping mode seemed to play a significant role in whether a package was on time or not. Logically, this makes sense.
# Estimated log odds of being late
fit <- glm(Late_delivery_risk ~ Shipping.Mode, data = dataco, family = "binomial")
s <- summary(fit)
s
##
## Call:
## glm(formula = Late_delivery_risk ~ Shipping.Mode, family = "binomial",
## data = dataco)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4748 -0.9790 0.3095 1.2507 1.3898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.01450 0.02839 106.17 <2e-16 ***
## Shipping.ModeSame Day -3.18519 0.03493 -91.19 <2e-16 ***
## Shipping.ModeSecond Class -1.82681 0.03106 -58.81 <2e-16 ***
## Shipping.ModeStandard Class -3.50101 0.02908 -120.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 248566 on 180518 degrees of freedom
## Residual deviance: 205412 on 180515 degrees of freedom
## AIC: 205420
##
## Number of Fisher Scoring iterations: 5
Feature Engineering
With only shipping mode as a predictor, we felt this wasn’t enough to build a highly performing model. Because of that, we began to think how we could create new features with greater predictive power.
From the data, it was observed that the store locations were encoded as latitude and longitude coordinates instead of a store address. (Note: It was observed that Customer Address roughly corresponded to the store latitdue and longitude locations)
Knowing that the store location was encoded as lat/long we surmised we could create a distance variable that corresponded to the the distance from the store to the order destination. The thinking being that longer distances from the store may lead to late deliveries.
Using the Order Address information, we were able to use the googleLanguageR package to query Google Maps and generate lat/long data for our Order addresses.
Once we created our order location as lat/long coordinates, we were then able to create a haversine distance feature (in meters) using the R geosphere package.
Finally, during team discussions we observed that it was possible to create an order delay feature that indicated the number of days from when an order was placed to when it was shipped. Using the lubridate package we were able to encode the order date and shipping date into datetime format then subtract the two to create our order delay feature.
Create the order delay feature
Using lubridate to set order date and ship date to datetime objects then create an order delay variable.
dataco <- dataco %>% mutate(OrderDate = mdy_hm(order.date..DateOrders.))
dataco <- dataco %>% mutate(ShippingDate = mdy_hm(shipping.date..DateOrders.))
dataco$OrderDelay <- as.numeric(dataco$ShippingDate - dataco$OrderDate)
Geocoding using Google Maps API
Note: This code has been commented out as to avoid calling the Google API again.
# register_google(key = Sys.getenv("GAPI_KEY"))
# dataco <- dataco %>% mutate(OrderAddress = paste0(`Order City`, ", ", `Order State`, ", ", `Order Country`))
# dataco_oa <- dataco %>% select('OrderAddress') %>% unique()
# dim(dataco_oa)
#
# dataco <- dataco %>% mutate(CustomerAddress = paste0(`Customer City`, ", ", `Customer State`, ", ", `Customer Country`))
# dataco_ca <- dataco %>% select('CustomerAddress') %>% unique()
# dim(dataco_ca)
# oa <- geocode(location = dataco_oa$OrderAddress, output = "more", source = "google")
# df_oa_latlong <- cbind(dataco_oa, oa)
#
# ca <- geocode(location = dataco_ca$CustomerAddress, output = "more", source = "google")
# df_ca_latlong <- cbind(dataco_ca, ca)
#
# write.csv(df_oa_latlong, "OrderAddresses.csv", row.names=FALSE)
# write.csv(df_ca_latlong, "CustomerAddresses.csv", row.names=FALSE)
#
# write.csv(oa, "OA.csv", row.names=FALSE)
# write.csv(ca, "CA.csv", row.names=FALSE)
# write.csv(df, "DataCoSupplyChainDataset-JPG-Cleaned.csv", row.names=FALSE)
# write.csv(df_oa, "OrderAddresses.csv", row.names=FALSE)
# write.csv(df_ca, "CustomerAddresses.csv", row.names=FALSE)
Create the distance feature
Creating the haversine distance variable from two lat/log points using the geosphere package.
dataco$Late_delivery_risk = as.factor(dataco$Late_delivery_risk)
getHaversine <- function(lon1, lat1, lon2, lat2){
pt1 <- c(lon1, lat1)
pt2 <- c(lon2, lat2)
return(distHaversine(pt1, pt2))
}
df_oa_latlong <- read_csv("OrderAddresses.csv", show_col_types = FALSE, locale = readr::locale(encoding = "latin1"))
df_ca_latlong <- read_csv("CustomerAddresses.csv", show_col_types = FALSE, locale = readr::locale(encoding = "latin1"))
df_oa_latlong <- df_oa_latlong %>% rename(olon = lon, olat=lat, otype=type, oloctype=loctype, oaddress=address, onorth=north, osouth=south, oeast=east, owest=west)
df_ca_latlong <-df_ca_latlong %>% rename(clon = lon, clat=lat, ctype=type, cloctype=loctype, caddress=address, cnorth=north, csouth=south, ceast=east, cwest=west)
dataco <- left_join(dataco, df_ca_latlong, by ='CustomerAddress')
dataco <- left_join(dataco, df_oa_latlong, by ='OrderAddress')
dataco <- dataco %>% mutate(OrderDistance = distHaversine(cbind(Longitude, Latitude), cbind(olon, olat)))
dataco <- dataco %>% drop_na(OrderDistance)
Create Dummy Variables
Creating indicator/dummy variables and a special Risk variable for use with the caret package.
dataco <- dataco %>% mutate(Risk = factor(Late_delivery_risk, labels = make.names(levels(Late_delivery_risk))))
dataco <- dataco %>% mutate(ShipModeSD = ifelse(Shipping.Mode=="Same Day", 1,0))
dataco <- dataco %>% mutate(ShipModeFstC = ifelse(Shipping.Mode=="First Class", 1,0))
dataco <- dataco %>% mutate(ShipModeStdC = ifelse(Shipping.Mode=="Standard Class", 1,0))
Split Data into train and test sets and train models
set.seed(100)
inds <- partition(dataco$Late_delivery_risk, p = c(train = 0.8, test = 0.2))
train <- dataco[inds$train, ]
test <- dataco[inds$test, ]
Variable Selection : Lasso
set.seed(100)
fulldata = dataco %>% filter(Order.Country=='US') %>%
select(Late_delivery_risk,ShipModeSD,ShipModeFstC,ShipModeStdC,OrderDelay,OrderDistance)
Xpred = as.matrix(fulldata[,-c(1)])
Y = fulldata[,c(1)]
lasso.cv=cv.glmnet(Xpred,Y, alpha=1, nfolds=10, family ='binomial')
lasso.model = glmnet(Xpred,Y, alpha=1, nlambda = 100, family = 'binomial')
plot_glmnet(lasso.model, label = TRUE)
abline(v= log(lasso.cv$lambda.min) , col='black', lty =2)
lasso.model = glmnet(Xpred,Y, alpha=1,lambda = lasso.cv$lambda.min, family = 'binomial')
coef(lasso.model)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.052930e+01
## ShipModeSD 8.436925e+00
## ShipModeFstC 5.603538e+00
## ShipModeStdC -7.783560e+00
## OrderDelay 1.666001e-01
## OrderDistance -3.167493e-08
Variable Selection: Elastic Net
set.seed(100)
fulldata = dataco %>% filter(Order.Country=='US') %>%
select(Late_delivery_risk,ShipModeSD,ShipModeFstC,ShipModeStdC,OrderDelay,OrderDistance)
Xpred = as.matrix(fulldata[,-c(1)])
Y = fulldata[,c(1)]
elastic.cv=cv.glmnet(Xpred,Y, alpha=1, nfolds=10, family ='binomial')
elastic.model = glmnet(Xpred,Y, alpha=1, nlambda = 100, family = 'binomial')
plot_glmnet(elastic.model, label = TRUE)
abline(v= log(elastic.cv$lambda.min) , col='black', lty =2)
elastic.model = glmnet(Xpred,Y, alpha=1,lambda = elastic.cv$lambda.min, family = 'binomial')
coef(elastic.model)
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.052930e+01
## ShipModeSD 8.436925e+00
## ShipModeFstC 5.603538e+00
## ShipModeStdC -7.783560e+00
## OrderDelay 1.666001e-01
## OrderDistance -3.167493e-08
Both Lasso and Elastic Net selected the shipping mode, OrderDelay and Order Distance as having explanatory power.
Final Model Selected
final.model <- train(Risk ~ OrderDelay + OrderDistance + ShipModeSD + ShipModeFstC + ShipModeStdC,
data = train,
method = "glm",
family=binomial(link = "logit"),
trControl = trainControl("cv", number = 10,
summaryFunction=twoClassSummary,
classProbs=T,savePredictions = T)
)
## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
summary(final.model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0826 -0.0639 0.0155 0.3151 1.2487
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.036e+01 8.951e-02 -115.737 <2e-16 ***
## OrderDelay 1.614e-01 1.237e-03 130.468 <2e-16 ***
## OrderDistance 1.787e-09 2.625e-09 0.681 0.496
## ShipModeSD 8.257e+00 7.787e-02 106.036 <2e-16 ***
## ShipModeFstC 5.577e+00 5.181e-02 107.629 <2e-16 ***
## ShipModeStdC -7.474e+00 6.632e-02 -112.688 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 155078 on 112663 degrees of freedom
## Residual deviance: 44053 on 112658 degrees of freedom
## AIC: 44065
##
## Number of Fisher Scoring iterations: 7
final.model$results
## parameter ROC Sens Spec ROCSD SensSD SpecSD
## 1 none 0.9711444 0.9474701 0.9536277 0.00173703 0.003023774 0.002261609
We find that Based on the final model, it has ROC of 0.9711416.
Creating Help functions to evaluate our models.
getAuc=function(model,threshold){
df_test <- test %>% mutate(pred_prob = predict(model, newdata = ., type = "prob")['X1']) %>%
mutate(pred_outcome = ifelse(pred_prob >= threshold,1,0))
pred <- prediction(df_test$pred_prob, df_test$Risk)
auc <- performance(pred, measure = "auc")
auc_value = [email protected][[1]]
return (auc_value)
}
PlotModelRoc= function(model,threshold){
df_test <- test %>% mutate(pred_prob = predict(model, newdata = ., type = "prob")['X1']) %>%
mutate(pred_outcome = ifelse(pred_prob >= threshold,1,0))
pred <- prediction(df_test$pred_outcome, df_test$Risk)
perf <- performance(pred, "tpr", "fpr") # tpr and fpr are true and false positive rates
plot(perf, colorize=T,main='RoC Curve ')
}
getAcc=function(model,threshold){
prob_pred <- predict(model,test)
t <- table(prob_pred, test$Risk)
acc <- (t[1,1] + t[2,2]) / sum(t)
return(acc)
}
Testing
confusion matrix for the final model – test data
prob_pred <- predict(final.model,test)
confusionMatrix(test$Risk,prob_pred,positive = 'X1')
## Confusion Matrix and Statistics
##
## Reference
## Prediction X0 X1
## X0 12058 631
## X1 684 14794
##
## Accuracy : 0.9533
## 95% CI : (0.9508, 0.9557)
## No Information Rate : 0.5476
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9057
##
## Mcnemar's Test P-Value : 0.1516
##
## Sensitivity : 0.9591
## Specificity : 0.9463
## Pos Pred Value : 0.9558
## Neg Pred Value : 0.9503
## Prevalence : 0.5476
## Detection Rate : 0.5252
## Detection Prevalence : 0.5495
## Balanced Accuracy : 0.9527
##
## 'Positive' Class : X1
##
ROC plot for the final model – test data
PlotModelRoc(final.model,0.5)
AUC for the final model – test data
threshold = 0.5
final.acc=getAcc(final.model)
final.auc=getAuc(final.model,threshold)
df_acc = data.frame(Model=c('final.Model'),
Auc =c(final.auc))
df_acc
## Model Auc
## 1 final.Model 0.9733868
Prediction for sample data
options(scipen=999)
neworder = data.frame(OrderDelay=c(5,10,100,20),
OrderDistance =c(100,200,300,400),
ShipModeSD=c(1,0,0,0),
ShipModeFstC =c(0,1,0,0),
ShipModeStdC =c(0,0,0,1))
head(neworder)
## OrderDelay OrderDistance ShipModeSD ShipModeFstC ShipModeStdC
## 1 5 100 1 0 0
## 2 10 200 0 1 0
## 3 100 300 0 0 0
## 4 20 400 0 0 1
predict(final.model,neworder,type='prob')
## X0 X1
## 1 0.785168856 0.2148311439320
## 2 0.959648437 0.0403515633301
## 3 0.003072117 0.9969278829857
## 4 0.999999546 0.0000004540954
X0 = Probability of order being on time X1 = Probability of order being late
Conclusion
Model overview
The key takeaways from our modeling are that the final model includes the following features: Order Delay, Order Distance, and Shipping Mode. All of these are significant except Order Distance. However, we kept Order Distance in the final model because it was determined to be useful for explaining the model and also chosen by the variable selection methods. Reducing the number of predictors in the model via variable selection is particularly important in shrinking the impact of overfitting and in simplifying the model; a reduced model is simpler to interpret, however we were not able to find any of the variables dropped.
Model features and interpretations
The model features and its respective interpretations include the Default (Intercept) as Second Class Delivery, the Order Delay as the days between order date and shipping date, Order Distance as the distance between the order store and customer address, Ship Mode SD as Same Day Delivery, Ship Mode Fstc as First-Class Delivery, and the Ship Mode Stdc as the Standard Class Delivery option.
For the base case of shipping by Second Class, the odds of being late is e^-10.359. Following this, a one percent change in Order Delay leads to an 18 percent chance for a late delivery, and a one percent change in the Order Distance would lead to almost no change in the probability for late delivery.
Recommendations
Based on our modeling efforts, it was observed that shipping method and Order Delay were the two biggest factors in determining whether a package would be delayed. The distance from the store to the order destination did not seem to affect the model performance as much as we might have hoped. Because it was observed that Order Delay did affect our models, it is recommended that a program implemented to minimize the amount of time from when the order was placed to the time it was shipped. Further operational studies may be needed to determine the best method to minimize Order Delay.
Our initial recommendation would be to investigate the rate at which orders are picked and packaged as well as the overall operational ordering process. This may involve creating simulation studies of the picking and packaging process to help determine the best path forward. Even though this dataset indicated there were no stock outs, this new program ensures that products with longer ship delay times are always stocked. This may involve determining which products are most frequently ordered and making sure those products are given more attention in inventory operations.
Since shipping method also played a large role in determining whether a package would be late, a cost/benefit analysis of either a subscription-based model or a customer loyalty program that prioritizes subscribers or loyal customer shipments. As an example, loyal customers or subscribers might be given First Class or same day shipping options by default.
Our final recommendation is that the business implement a production version of our model so that the order status can be monitored in real-time. If the model predicts that a package will be late, then the risk could be mitigated by either expediting the shipment or by setting the shipment method to First Class or same day. If a late delivery is unpreventable, the marketing team could employ a customer loyalty discount program such that the customer received a discount on their next purchase.
As a final note, if the model is deployed into production, it is imperative that the model be updated periodically, on average every three to six months. Additionally, it is recommended that further work be performed around building and implementing control charts to ensure that the order delays do not exceed a certain amount.