Pricing Diamonds with the 4 C’s
Contents
Pricing Diamonds with the 4 C’s¶
By: Kyle W. Brown
Problem Statement¶
Pricing diamonds is notoriously difficult, especially for brokers. A diamond can be the same size and same weight but priced thousands of dollars differently. This poses significant challenges to setting the fair market value of diamonds.
Value Proposition¶
Give diamond brokers insights into how diamonds are priced. The objective is to provide a tool such as a dashboard that will give greater understanding of how diamonds may be priced.
Problems¶
The problems faced during this analysis include:
Determining the relationship to the 4 C’s and pricing, or any identifiable patterns?
How are the 4 C’s distributed across the data?
How to address the
cut
,color
, andclarity
categorical variables?How accurate can the price of diamonds be predicted?
Solutions¶
There appears to be an inverse pricing pattern with the pricing of diamonds with the 4 C’s. When comparing the relationship of best
color
with the bestclarity
diamonds, we see that the averageprice
($8,307) significantly higher to the rest of the pivot table.There is correlations among the features, and as a whole the data demostrates not normal distributions.
Addressed the
cut
,color
, andclarity
categorical variables with ordinal encoding of 1-5 (cut
), 1-7 (color
), and 1-8 (clarity
) from best to worst across the variables.Based on the best performing model,
price
can be predicted quite accurately with a 99% predicted performance.
Suggestable patterns include:
The inverse pricing pattern is first observed with the average
price
of diamonds bycolor
going from lowest to highest, similarities withcut
andclarity
continue as well.The inverse pricing is due the
carat
size increase from best to worst diamonds acrosscut
,color
, andclarity
.The worst
cut
,color
, andclarity
diamonds have the highest prices.The best
cut
,color
, andclarity
diamonds are among the smallestcarat
in the dataset.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
diamonds = "https://raw.githubusercontent.com/kyle-w-brown/diamonds-prediction/main/data/diamonds.csv"
df_diamonds = pd.read_csv(diamonds)
df_diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
df_diamonds.shape
(53940, 10)
Almost 54,000 rows in the dataset.
df_diamonds.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null object
2 color 53940 non-null object
3 clarity 53940 non-null object
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB
Consolidating x
, y
, and z
into volume
.
df_diamonds['volume'] = round(df_diamonds['x'] * df_diamonds['y'] * df_diamonds['z'], 2)
df_diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | volume | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | 38.20 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | 34.51 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | 38.08 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | 46.72 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | 51.92 |
Data Cleansing¶
df_diamonds[['x','y','z','volume']] = df_diamonds[['x','y','z','volume']].replace(0, np.NaN)
df_diamonds.isnull().sum()
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
price 0
x 8
y 7
z 20
volume 20
dtype: int64
Removing missing data
df_diamonds.dropna(inplace=True)
df_diamonds.isnull().sum()
carat 0
cut 0
color 0
clarity 0
depth 0
table 0
price 0
x 0
y 0
z 0
volume 0
dtype: int64
Outliers¶
Removing the outliers
df_diamonds = df_diamonds[(df_diamonds["carat"] <= 5)]
df_diamonds = df_diamonds[(df_diamonds["depth"] < 75) & (df_diamonds["depth"] > 45)]
df_diamonds = df_diamonds[(df_diamonds["table"] < 75) & (df_diamonds["table"] > 45)]
df_diamonds = df_diamonds[(df_diamonds["x"] < 30) & (df_diamonds["x"] > 2)]
df_diamonds = df_diamonds[(df_diamonds["y"] < 30) & (df_diamonds["y"] > 2)]
df_diamonds = df_diamonds[(df_diamonds["z"] < 30) & (df_diamonds["z"] > 2)]
df_diamonds = df_diamonds[(df_diamonds["volume"] < 3500)]
df_diamonds.shape
(53902, 11)
Exploration¶
df = df_diamonds.describe()
heading_properties = [('font-size', '11px')]
cell_properties = [('font-size', '11px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
df.style.set_table_styles(dfstyle)
carat | depth | table | price | x | y | z | volume | |
---|---|---|---|---|---|---|---|---|
count | 53902.000000 | 53902.000000 | 53902.000000 | 53902.000000 | 53902.000000 | 53902.000000 | 53902.000000 | 53902.000000 |
mean | 0.797555 | 61.749434 | 57.455694 | 3930.426793 | 5.731398 | 5.733239 | 3.539387 | 129.790389 |
std | 0.473433 | 1.419670 | 2.221249 | 3986.883678 | 1.119202 | 1.111083 | 0.691292 | 76.399565 |
min | 0.200000 | 50.800000 | 49.000000 | 326.000000 | 3.730000 | 3.680000 | 2.060000 | 31.710000 |
25% | 0.400000 | 61.000000 | 56.000000 | 949.000000 | 4.710000 | 4.720000 | 2.910000 | 65.190000 |
50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 | 114.840000 |
75% | 1.040000 | 62.500000 | 59.000000 | 5322.000000 | 6.540000 | 6.540000 | 4.040000 | 170.840000 |
max | 4.500000 | 73.600000 | 73.000000 | 18823.000000 | 10.230000 | 10.160000 | 6.720000 | 698.460000 |
Looking at the data we see that the average carat
size is 0.8 and the largest carat
is 4.5. The average price
per diamond is almost $4,000, while the most expensive diamond is priced at $18,823.
Exploring the Categorical Variables¶
df_diamonds['cut'].unique()
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
df_diamonds['clarity'].unique()
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'],
dtype=object)
df_diamonds['color'].unique()
array(['E', 'I', 'J', 'H', 'F', 'G', 'D'], dtype=object)
df_diamonds.describe(include=object)
cut | color | clarity | |
---|---|---|---|
count | 53902 | 53902 | 53902 |
unique | 5 | 7 | 8 |
top | Ideal | G | SI1 |
freq | 21542 | 11281 | 13058 |
Counting the values per unique feature
df_diamonds['cut'].value_counts()
Ideal 21542
Premium 13779
Very Good 12079
Good 4902
Fair 1600
Name: cut, dtype: int64
According to this printout, the total number of diamonds decrease from best (Ideal) to worst (Fair).
df_diamonds['color'].value_counts()
G 11281
E 9791
F 9535
H 8296
D 6774
I 5419
J 2806
Name: color, dtype: int64
df_diamonds['clarity'].value_counts()
SI1 13058
VS2 12250
SI2 9183
VS1 8165
VVS2 5066
VVS1 3654
IF 1790
I1 736
Name: clarity, dtype: int64
Reordering cut
, color
, and clarity
categorical variables from best to worst¶
df_diamonds['cut'] = pd.Categorical(df_diamonds['cut'], ["Ideal", "Premium", "Very Good", "Good", "Fair"])
df_diamonds = df_diamonds.sort_values('cut')
df_diamonds['color'] = pd.Categorical(df_diamonds['color'], ["D", "E", "F", "G", "H", "I", "J"])
df_diamonds = df_diamonds.sort_values('color')
df_diamonds['clarity'] = pd.Categorical(df_diamonds['clarity'], ["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1"])
df_diamonds = df_diamonds.sort_values('clarity')
Average price for cut
, color
, and clarity
cut_avg = round(df_diamonds.groupby('cut')['price'].mean().reset_index(), 2)
cut_avg
cut | price | |
---|---|---|
0 | Ideal | 3456.21 |
1 | Premium | 4579.13 |
2 | Very Good | 3982.12 |
3 | Good | 3926.40 |
4 | Fair | 4350.67 |
The best cut
(Ideal) diamonds have the lowest average price.
color_avg = round(df_diamonds.groupby('color', as_index=False)['price'].mean(), 2)
color_avg
color | price | |
---|---|---|
0 | D | 3168.11 |
1 | E | 3077.52 |
2 | F | 3723.99 |
3 | G | 3997.05 |
4 | H | 4479.38 |
5 | I | 5089.27 |
6 | J | 5319.49 |
The worst color
(J) diamonds have the highest average price.
clarity_avg = round(df_diamonds.groupby('clarity', as_index=False)['price'].mean(), 2)
clarity_avg
clarity | price | |
---|---|---|
0 | IF | 2864.84 |
1 | VVS1 | 2519.51 |
2 | VVS2 | 3283.74 |
3 | VS1 | 3839.14 |
4 | VS2 | 3923.01 |
5 | SI1 | 3993.02 |
6 | SI2 | 5059.96 |
7 | I1 | 3910.66 |
Comparing the 4’C’s with Pivot Tables¶
Comparing cut
, color
, and clarity
variables with price
and carat
in pivot tables.
Tables of Cut
and Clarity
¶
cut_clarity = df_diamonds.pivot_table('price', index='cut', columns='clarity')
heading_properties = [('font-size', '11px')]
cell_properties = [('font-size', '11px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
cut_clarity.style.set_table_styles(dfstyle)
clarity | IF | VVS1 | VVS2 | VS1 | VS2 | SI1 | SI2 | I1 |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 2272.913366 | 2468.129458 | 3250.290100 | 3487.972393 | 3281.928374 | 3750.394860 | 4755.876396 | 4335.726027 |
Premium | 3856.143478 | 2810.304065 | 3795.122989 | 4485.462041 | 4546.261919 | 4448.621886 | 5539.047910 | 3958.881773 |
Very Good | 4396.216418 | 2459.441065 | 3037.765182 | 3808.267343 | 4215.403089 | 3932.391049 | 4988.688095 | 4078.226190 |
Good | 4098.323944 | 2254.774194 | 3079.108392 | 3801.445988 | 4262.236196 | 3689.533333 | 4571.627087 | 3584.694737 |
Fair | 1912.333333 | 3871.352941 | 3349.768116 | 4152.029586 | 4187.647287 | 4191.592593 | 5180.094624 | 3646.451923 |
We see that the best cut
(Ideal) and the best color
(IF) diamonds are priced at the third lowest across the entire table.
cut_clarity_ct = df_diamonds.pivot_table('carat', index='cut', columns='clarity')
heading_properties = [('font-size', '13px')]
cell_properties = [('font-size', '13px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
cut_clarity_ct.style.set_table_styles(dfstyle)
clarity | IF | VVS1 | VVS2 | VS1 | VS2 | SI1 | SI2 | I1 |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 0.455041 | 0.495960 | 0.586213 | 0.674453 | 0.670213 | 0.801575 | 1.007901 | 1.222671 |
Premium | 0.603478 | 0.533740 | 0.654724 | 0.793308 | 0.833421 | 0.907865 | 1.143252 | 1.289212 |
Very Good | 0.618769 | 0.494588 | 0.566389 | 0.733683 | 0.811108 | 0.845978 | 1.064338 | 1.281905 |
Good | 0.616338 | 0.502312 | 0.614930 | 0.757685 | 0.850787 | 0.830397 | 1.034193 | 1.199895 |
Fair | 0.474444 | 0.664706 | 0.691594 | 0.878284 | 0.887791 | 0.962395 | 1.204688 | 1.345048 |
This table indicates that carat
’s are increasing from the best clarity
to the worst clarity
diamonds. What’s interesting is we see this pattern across all cut
diamonds.
Tables of cut
and color
¶
cut_color = df_diamonds.pivot_table('price', index='cut', columns='color')
heading_properties = [('font-size', '12px')]
cell_properties = [('font-size', '12px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
cut_color.style.set_table_styles(dfstyle)
color | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|
cut | |||||||
Ideal | 2629.094566 | 2597.684008 | 3373.863755 | 3718.469070 | 3887.452152 | 4449.548541 | 4918.343017 |
Premium | 3623.767790 | 3538.914420 | 4325.099571 | 4502.207806 | 5195.375531 | 5939.557814 | 6294.591584 |
Very Good | 3470.467284 | 3216.314012 | 3778.820240 | 3872.753806 | 4535.059243 | 5255.879568 | 5103.513274 |
Good | 3405.382175 | 3423.644159 | 3498.761852 | 4105.907940 | 4276.254986 | 5078.532567 | 4574.172638 |
Fair | 4291.061350 | 3703.248869 | 3801.087097 | 4241.022581 | 5135.683168 | 4685.445714 | 4865.127119 |
An inverse pricing pattern is emerging with the best cut
and best color
diamonds being priced the lowest. As we see there is an increase in price almost among all the features going from best to worst (one would expect to see the opposite).
df_diamonds.pivot_table('carat', index='cut', columns='color')
color | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|
cut | |||||||
Ideal | 0.565766 | 0.578419 | 0.655612 | 0.700447 | 0.799329 | 0.912673 | 1.063564 |
Premium | 0.721248 | 0.717745 | 0.826957 | 0.841250 | 1.014087 | 1.144163 | 1.293094 |
Very Good | 0.696424 | 0.676547 | 0.740961 | 0.766799 | 0.915902 | 1.046952 | 1.133215 |
Good | 0.744517 | 0.745134 | 0.776075 | 0.847906 | 0.914729 | 1.057222 | 1.099544 |
Fair | 0.920123 | 0.859050 | 0.901452 | 1.024355 | 1.219175 | 1.198057 | 1.310085 |
As we see that with the best cut
(Ideal) and the best color
(D) diamonds have an average price lower than the worst cut
diamonds. This is due to the best cut
and color
diamonds are around a half (0.5) carat, while the worst cut
(Fair) and the worst color
(J) diamonds have the highest average carat at 1.31.
Tables of color
and clarity
¶
color_clarity = df_diamonds.pivot_table('price', index='color', columns='clarity')
heading_properties = [('font-size', '11px')]
cell_properties = [('font-size', '11px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
color_clarity.style.set_table_styles(dfstyle)
clarity | IF | VVS1 | VVS2 | VS1 | VS2 | SI1 | SI2 | I1 |
---|---|---|---|---|---|---|---|---|
color | ||||||||
D | 8307.369863 | 2897.163347 | 3351.128391 | 3030.158865 | 2587.225692 | 2976.146423 | 3931.101460 | 3863.023810 |
E | 3668.506329 | 2219.820122 | 2499.674067 | 2859.463224 | 2751.081037 | 3161.838005 | 4173.826036 | 3510.465347 |
F | 2750.836364 | 2804.276567 | 3475.512821 | 3796.717742 | 3756.795093 | 3708.651480 | 4476.996259 | 3342.181818 |
G | 2558.033774 | 2866.820821 | 3845.283437 | 4130.314392 | 4412.354096 | 3775.574468 | 5014.848544 | 3545.540541 |
H | 2287.869565 | 1845.658120 | 2649.067434 | 3775.576199 | 4713.943327 | 5027.035620 | 6084.141667 | 4461.403727 |
I | 1994.937063 | 2034.861972 | 2968.232877 | 4632.805411 | 5690.505560 | 5345.414909 | 7002.649123 | 4302.184783 |
J | 3363.882353 | 4034.175676 | 5142.396947 | 4884.461255 | 5311.789041 | 5186.048000 | 6520.958246 | 4993.571429 |
The best color
and best clarity
diamonds have an average price that is significantly higher than the rest of the variables. Beyond this observation we begin to see that average price
are among the highest with SI2 diamonds.
color_clarity_ct = df_diamonds.pivot_table('carat', index='color', columns='clarity')
heading_properties = [('font-size', '13px')]
cell_properties = [('font-size', '13px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
cut_clarity_ct.style.set_table_styles(dfstyle)
clarity | IF | VVS1 | VVS2 | VS1 | VS2 | SI1 | SI2 | I1 |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 0.455041 | 0.495960 | 0.586213 | 0.674453 | 0.670213 | 0.801575 | 1.007901 | 1.222671 |
Premium | 0.603478 | 0.533740 | 0.654724 | 0.793308 | 0.833421 | 0.907865 | 1.143252 | 1.289212 |
Very Good | 0.618769 | 0.494588 | 0.566389 | 0.733683 | 0.811108 | 0.845978 | 1.064338 | 1.281905 |
Good | 0.616338 | 0.502312 | 0.614930 | 0.757685 | 0.850787 | 0.830397 | 1.034193 | 1.199895 |
Fair | 0.474444 | 0.664706 | 0.691594 | 0.878284 | 0.887791 | 0.962395 | 1.204688 | 1.345048 |
The trend continues with the increase of carat
size from best to worst diamonds. Except when compared with the best color
and the best clarity
diamonds, the carat
size among the best clarity
(IF) diamonds are almost equal to the highest carat
across the color
diamonds. In other words, the best color
(D) diamond is only a fraction less than the largest diamond among the best clarity
(IF) category.
Visualization¶
Barplots¶
Barplot of cut
¶
%matplotlib inline
sns.set_theme(rc = {'figure.figsize':(10,6)})
ax = sns.barplot(x="cut",
y="price",
data=df_diamonds,
palette="Blues_d",
capsize=.2)
Barplot of color
¶
%matplotlib inline
sns.set_theme(rc = {'figure.figsize':(10,6)})
ax = sns.barplot(x="color",
y="price",
data=df_diamonds,
palette="Blues_d",
capsize=.2)
Barplot of clarity
¶
%matplotlib inline
sns.set_theme(rc = {'figure.figsize':(10,6)})
ax = sns.barplot(x="clarity",
y="price",
data=df_diamonds,
palette="Blues_d",
capsize=.2)
Pairplot¶
ax = sns.pairplot(df_diamonds,
hue= "cut",
palette = 'viridis')
Historgram¶
df_diamonds.hist(layout=(3,3), figsize=(15,10))
plt.show()
Histogram carat
¶
Taking a closer look at carat
’s distribution.
import plotly.express as px
fig = px.histogram(df_diamonds,
x="carat",
marginal="violin",
color_discrete_sequence=['rgb(115, 185, 238)'],
hover_data=df_diamonds.columns)
fig.show()
Notice that the carat
’s are distributed in increments?
Normalizing carat
import plotly.express as px
fig = px.histogram(df_diamonds,
x="carat",
histnorm='probability',
color_discrete_sequence=['rgb(115, 185, 238)']
)
fig.show()
The highest probability of carat
’s fall between 0.3 and 1.1
Boxplots¶
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
vars = ['carat', 'depth', 'table', 'price', 'x', 'y', 'z', 'volume']
fig = make_subplots(rows=1, cols=len(vars))
for i, var in enumerate(vars):
fig.add_trace(
go.Box(y=df_diamonds[var],
name=var),
row=1, col=i+1
)
fig.update_traces(showlegend=False)
Boxplot of cut
¶
import plotly.express as px
fig = px.box(data_frame = df_diamonds,
x = 'cut',
y = 'price',
color='cut',
category_orders={"cut": ["Ideal", "Premium", "Very Good", "Good", "Fair"]},
color_discrete_sequence=clr)
fig.update_layout(showlegend=False)
fig.show()
Boxplot of color
¶
import plotly.express as px
fig = px.box(data_frame = df_diamonds,
x = 'color',
y = 'price',
color='color',
category_orders={"color": ["D", "E", "F", "G", "H", "I", "J"]},
color_discrete_sequence=color)
fig.update_layout(showlegend=False)
fig.show()
Boxplot of clarity
¶
import plotly.express as px
fig = px.box(data_frame = df_diamonds,
x = 'clarity',
y = 'price',
color='clarity',
category_orders={"clarity": ["IF", "VVS1", "VVS2", "VS1", "VS2", "SI1", "SI2", "I1"]},
color_discrete_sequence=colors)
fig.update_layout(showlegend=False)
fig.show()
It’s unique that VS1 and VS2 have the same exact inner quartile ranges considering they may be priced thousands of dollars differently.
Ordinal Encoding¶
Creating a rank system for cut
, color
, and clarity
.
# Cut rank
cut_two = pd.DataFrame(df_diamonds['cut'])
df_diamonds['cut_rk']= cut_two.replace({'cut':{'Ideal' : 1, 'Premium' : 2, 'Very Good' : 3, 'Good' : 4, 'Fair' : 5}})
# Color rank
color_two = pd.DataFrame(df_diamonds['color'])
df_diamonds['color_rk']= color_two.replace({'color':{'D' : 1, 'E' : 2, 'F' : 3, 'G' : 4, 'H' : 5, 'I' : 6, 'J' : 7}})
# Clarity rank
clarity_two = pd.DataFrame(df_diamonds['clarity'])
df_diamonds['clarity_rk']= clarity_two.replace({'clarity':{'IF' : 1, 'VVS1' : 2, 'VVS2' : 3, 'VS1' : 4, 'VS2' : 5, 'SI1' : 6, 'SI2' : 7, 'I1' : 8}})
Examining the Ranks of cut
, color
, and clarity
¶
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
vars = ['cut_rk', 'color_rk', 'clarity_rk']
fig = make_subplots(rows=1, cols=len(vars))
for i, var in enumerate(vars):
fig.add_trace(
go.Box(y=df_diamonds[var],
name=var),
row=1, col=i+1
)
fig.update_layout()
fig.update_traces(showlegend=False)
Correlation Heatmap¶
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
df_corr = df_diamonds.corr()
fig = go.Figure()
fig.add_trace(
go.Heatmap(
x = df_corr.columns,
y = df_corr.index,
z = np.array(df_corr),
colorscale='Viridis'
)
)
Models¶
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
Slicing the data for numeric columns and removing highly correlated x
, y
, and z
.
df = df_diamonds.drop(df_diamonds.columns[[1, 2, 3, 7, 8, 9]], axis=1)
df.head()
carat | depth | table | price | volume | cut_rk | color_rk | clarity_rk | |
---|---|---|---|---|---|---|---|---|
0 | 0.23 | 61.5 | 55.0 | 326 | 38.20 | 1 | 2 | 7 |
1 | 0.21 | 59.8 | 61.0 | 326 | 34.51 | 2 | 2 | 6 |
2 | 0.23 | 56.9 | 65.0 | 327 | 38.08 | 4 | 2 | 4 |
3 | 0.29 | 62.4 | 58.0 | 334 | 46.72 | 2 | 6 | 5 |
4 | 0.31 | 63.3 | 58.0 | 335 | 51.92 | 4 | 7 | 7 |
With price
reaching as high as $18,823 and carat
as low as 0.21, we will need to scale the features.
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
df_diamonds_scaled = pd.DataFrame(standard_scaler.fit_transform(df),
columns=df.columns)
df_scaled = df_diamonds_scaled.head()
heading_properties = [('font-size', '13px')]
cell_properties = [('font-size', '13px')]
dfstyle = [dict(selector="th", props=heading_properties),\
dict(selector="td", props=cell_properties)]
df_scaled.style.set_table_styles(dfstyle)
carat | depth | table | price | volume | cut_rk | color_rk | clarity_rk | |
---|---|---|---|---|---|---|---|---|
0 | -1.198817 | -0.175700 | -1.105557 | -0.904080 | -1.198845 | -0.981476 | -0.936934 | 1.245681 |
1 | -1.241062 | -1.373172 | 1.595652 | -0.904080 | -1.247144 | -0.085459 | -0.936934 | 0.638536 |
2 | -1.198817 | -3.415919 | 3.396457 | -0.903829 | -1.200416 | 1.706574 | -0.936934 | -0.575752 |
3 | -1.072082 | 0.458256 | 0.245047 | -0.902073 | -1.087325 | -0.085459 | 1.414295 | 0.031392 |
4 | -1.029837 | 1.092211 | 0.245047 | -0.901822 | -1.019261 | 1.706574 | 2.002102 | 1.245681 |
Linear Regression¶
from sklearn.linear_model import LinearRegression
from numpy import *
X = df_diamonds[['carat']]
y = df_diamonds['price']
lr = LinearRegression()
lr.fit(X, y)
y_pred = lr.predict(X)
plt.plot(X, y, 'o', color = 'k', label='training data')
plt.plot(X, y_pred, color='#42a5f5ff', label='model prediction')
plt.xlabel('Price per Volume (mass)')
plt.ylabel('Diamond Price ($)')
plt.legend();
from sklearn import metrics
# Using scaled features
X = df_diamonds_scaled[['carat']]
y = df_diamonds_scaled['price']
lr = LinearRegression()
lr.fit(X, y)
y_pred = lr.predict(X)
print("Mean absolute error (MAE):", metrics.mean_absolute_error(y, y_pred))
print("Mean squared error (MSE):", metrics.mean_squared_error(y, y_pred))
print("Root Mean squared error (RMSE):", np.sqrt(metrics.mean_squared_error(y, y_pred)))
print("R^2:", metrics.r2_score(y, y_pred))
Mean absolute error (MAE): 0.252631690580942
Mean squared error (MSE): 0.15029971977032106
Root Mean squared error (RMSE): 0.3876850780857074
R^2: 0.8497002802296789
Multiple Linear Regression¶
features = ['carat', 'depth','table', 'volume', 'cut_rk', 'color_rk', 'clarity_rk']
X = df_diamonds_scaled[features]
y = df_diamonds_scaled['price']
lr_many_features = LinearRegression()
lr_many_features.fit(X, y);
print(('prediction = ' +
'{} +\n'.format(lr_many_features.intercept_) +
' +\n'.join(['{} * {}'.format(n, f) for f, n in zip(features, lr_many_features.coef_)])))
prediction = -1.0455101613761209e-16 +
0.11935791009746087 * carat +
0.0036623212292048457 * depth +
-0.0006422984576671814 * table +
0.9229942229753505 * volume +
-0.03152197675314965 * cut_rk +
-0.1359159833047961 * color_rk +
-0.21507245231196492 * clarity_rk
print('Multiple features linear model R^2 on training data set: {}'.format(lr_many_features.score(X, y)))
Multiple features linear model R^2 on training data set: 0.9059466891034228
Random Forest¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=321)
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state = random.seed(1234))
model = forest.fit(X_train,y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import mean_squared_error
print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2 : {}".format(np.sqrt(metrics.r2_score((y_test),(y_pred)))))
RMSE: 0.13225034160421884
R2 : 0.9911725406670384
import time
start_time = time.time()
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")
Elapsed time to compute the importances: 0.470 seconds
plt.rcParams["figure.figsize"] = (8,5.5)
feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
The feature volume
appears to the highest importance among the Random Forest model.
from sklearn.inspection import permutation_importance
start_time = time.time()
result = permutation_importance(
forest, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")
forest_importances = pd.Series(result.importances_mean, index=feature_names)
Elapsed time to compute the importances: 45.103 seconds
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()
The volume
feature remains the highest with permutation feature importance which indicates do not have a bias toward high-cardinality features and can be computed on a left-out test set. This demonstrates the volume
overcomes limitations of the impurity-based feature importance.
Hyperparameter Tuning Random Forest¶
n_estimators = [int(x) for x in np.linspace(10,200,10)]
max_depth = [int(x) for x in np.linspace(10,100,10)]
min_samples_split = [2,3,4,5,10]
min_samples_leaf = [1,2,4,10,15,20]
random_grid = {'n_estimators':n_estimators,'max_depth':max_depth,
'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf}
random_grid
{'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'min_samples_leaf': [1, 2, 4, 10, 15, 20],
'min_samples_split': [2, 3, 4, 5, 10],
'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200]}
from sklearn.model_selection import RandomizedSearchCV
from numpy import *
rf = RandomForestRegressor(random_state = random.seed(1234))
rf_random = RandomizedSearchCV(estimator=rf,
param_distributions=random_grid,
cv = 3)
rf_random.fit(X_train,y_train)
y_pred = rf_random.predict(X_test)
print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2 : {}".format(np.sqrt(metrics.r2_score((y_test),(y_pred)))))
RMSE: 0.13627217704239236
R2 : 0.9909164510584496
rf_random.best_params_
{'max_depth': 30,
'min_samples_leaf': 2,
'min_samples_split': 3,
'n_estimators': 94}
rf = RandomForestRegressor(max_depth = 30,
min_samples_leaf = 2,
min_samples_split = 3,
n_estimators = 94,
random_state = random.seed(1234))
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print("RMSE: {}".format(np.sqrt(mean_squared_error((y_test),(y_pred)))))
print("R2 : {}".format(np.sqrt(metrics.r2_score((y_test),(y_pred)))))
RMSE: 0.1371532785575916
R2 : 0.9907980602434069
start_time = time.time()
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")
Elapsed time to compute the importances: 0.269 seconds
feature_names = [f"feature {i}" for i in range(X.shape[1])]
forest_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=std, ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
start_time = time.time()
result = permutation_importance(
rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=2
)
elapsed_time = time.time() - start_time
print(f"Elapsed time to compute the importances: {elapsed_time:.3f} seconds")
forest_importances = pd.Series(result.importances_mean, index=feature_names)
Elapsed time to compute the importances: 23.421 seconds
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()
plt.show()
AutoML using H20¶
import h2o
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "11.0.15" 2022-04-19; OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1); OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.18.04.1, mixed mode)
Starting server from /usr/local/lib/python3.7/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpcavk4j6w
JVM stdout: /tmp/tmpcavk4j6w/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpcavk4j6w/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: | 03 secs |
H2O_cluster_timezone: | Etc/UTC |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.36.1.2 |
H2O_cluster_version_age: | 1 day |
H2O_cluster_name: | H2O_from_python_unknownUser_vyy7c7 |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 3.172 Gb |
H2O_cluster_total_cores: | 2 |
H2O_cluster_allowed_cores: | 2 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://127.0.0.1:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
Python_version: | 3.7.13 final |
diamonds = h2o.import_file("/content/diamonds_new.csv")
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
diamonds.describe()
Rows:53902
Cols:14
carat | cut | color | clarity | depth | table | price | x | y | z | volume | cut_rk | color_rk | clarity_rk | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | real | enum | enum | enum | real | real | int | real | real | real | real | int | int | int |
mins | 0.2 | 50.8 | 49.0 | 326.0 | 3.73 | 3.68 | 2.06 | 31.71 | 1.0 | 1.0 | 1.0 | |||
mean | 0.7975546361916077 | 61.74943415828728 | 57.45569366628322 | 3930.426793068903 | 5.731397907313265 | 5.733238655337458 | 3.539386850209639 | 129.79038885384568 | 2.0953767949241198 | 3.593948276501803 | 4.948295053986866 | |||
maxs | 4.5 | 73.6 | 73.0 | 18823.0 | 10.23 | 10.16 | 6.72 | 698.46 | 5.0 | 7.0 | 8.0 | |||
sigma | 0.47343323947185206 | 1.4196703364355991 | 2.221248682189408 | 3986.8836775360587 | 1.1192018720191774 | 1.111083096761398 | 0.691292416429462 | 76.39956459996696 | 1.1160610424871082 | 1.7012537757016888 | 1.6470707425751165 | |||
zeros | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326.0 | 3.95 | 3.98 | 2.43 | 38.2 | 1.0 | 2.0 | 7.0 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326.0 | 3.89 | 3.84 | 2.31 | 34.51 | 2.0 | 2.0 | 6.0 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327.0 | 4.05 | 4.07 | 2.31 | 38.08 | 4.0 | 2.0 | 4.0 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334.0 | 4.2 | 4.23 | 2.63 | 46.72 | 2.0 | 6.0 | 5.0 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335.0 | 4.34 | 4.35 | 2.75 | 51.92 | 4.0 | 7.0 | 7.0 |
5 | 0.24 | Very Good | J | VVS2 | 62.8 | 57.0 | 336.0 | 3.94 | 3.96 | 2.48 | 38.69 | 3.0 | 7.0 | 3.0 |
6 | 0.24 | Very Good | I | VVS1 | 62.3 | 57.0 | 336.0 | 3.95 | 3.98 | 2.47 | 38.83 | 3.0 | 6.0 | 2.0 |
7 | 0.26 | Very Good | H | SI1 | 61.9 | 55.0 | 337.0 | 4.07 | 4.11 | 2.53 | 42.32 | 3.0 | 5.0 | 6.0 |
8 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337.0 | 3.87 | 3.78 | 2.49 | 36.43 | 5.0 | 2.0 | 5.0 |
9 | 0.23 | Very Good | H | VS1 | 59.4 | 61.0 | 338.0 | 4.0 | 4.05 | 2.39 | 38.72 | 3.0 | 5.0 | 4.0 |
diamonds = diamonds[:, ["carat", "depth", "table", "price", "volume", "cut_rk", "color_rk", "clarity_rk"]]
print(diamonds)
carat | depth | table | price | volume | cut_rk | color_rk | clarity_rk |
---|---|---|---|---|---|---|---|
0.23 | 61.5 | 55 | 326 | 38.2 | 1 | 2 | 7 |
0.21 | 59.8 | 61 | 326 | 34.51 | 2 | 2 | 6 |
0.23 | 56.9 | 65 | 327 | 38.08 | 4 | 2 | 4 |
0.29 | 62.4 | 58 | 334 | 46.72 | 2 | 6 | 5 |
0.31 | 63.3 | 58 | 335 | 51.92 | 4 | 7 | 7 |
0.24 | 62.8 | 57 | 336 | 38.69 | 3 | 7 | 3 |
0.24 | 62.3 | 57 | 336 | 38.83 | 3 | 6 | 2 |
0.26 | 61.9 | 55 | 337 | 42.32 | 3 | 5 | 6 |
0.22 | 65.1 | 61 | 337 | 36.43 | 5 | 2 | 5 |
0.23 | 59.4 | 61 | 338 | 38.72 | 3 | 5 | 4 |
GBM Model¶
from h2o.estimators.gbm import H2OGradientBoostingEstimator
# set the predictor names and the response column name
predictors = ["carat", "depth", "table", "volume", "cut_rk", "color_rk", "clarity_rk"]
response = "price"
# split into train and validation sets
train, valid = diamonds.split_frame(ratios = [.8], seed = 1234)
# train a GBM model
diamonds_gbm = H2OGradientBoostingEstimator(distribution = "poisson", seed = 1234)
diamonds_gbm.train(x = predictors,
y = response,
training_frame = train,
validation_frame = valid)
# retrieve the model performance
perf = diamonds_gbm.model_performance(valid)
perf
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
ModelMetricsRegression: gbm
** Reported on test data. **
MSE: 297175.9346014002
RMSE: 545.1384545245365
MAE: 298.17434419493014
RMSLE: 0.12287040514285516
Mean Residual Deviance: -59961.81794308857
print('R^2:', diamonds_gbm.r2())
print('R^2 on validation data:', diamonds_gbm.r2(valid=True))
R^2: 0.9814291623069304
R^2 on validation data: 0.9812236569623808
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
import matplotlib.cbook
warnings.filterwarnings("ignore", category = matplotlib.cbook.mplDeprecation)
diamonds_gbm.varimp_plot();
<Figure size 432x288 with 0 Axes>
AutoML Search¶
from h2o.automl import H2OAutoML
y = "price"
splits = diamonds.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "diamonds_lb_frame")
aml.train(y = y, training_frame = train, leaderboard_frame = test)
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_BestOfFamily_2_AutoML_1_20220527_213545
No model summary for this model
ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **
MSE: 163376.48303073455
RMSE: 404.198568813318
MAE: 221.74431915028975
RMSLE: 0.08811328250901963
R^2: 0.9896274225811342
Mean Residual Deviance: 163376.48303073455
Null degrees of freedom: 10063
Residual degrees of freedom: 10060
Null deviance: 158518164125.60925
Residual deviance: 1644220925.2213125
AIC: 149376.76408165455
ModelMetricsRegressionGLM: stackedensemble
** Reported on validation data. **
MSE: 293111.1409579319
RMSE: 541.3973965193514
MAE: 278.7109268472651
RMSLE: 0.10113636742699597
R^2: 0.9823457864836054
Mean Residual Deviance: 293111.1409579319
Null degrees of freedom: 4424
Residual degrees of freedom: 4421
Null deviance: 73479405051.87427
Residual deviance: 1297016798.7388484
AIC: 68270.86509898382
aml2 = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "diamonds_full_data")
aml2.train(y = y, training_frame = diamonds)
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator : Stacked Ensemble
Model Key: StackedEnsemble_BestOfFamily_3_AutoML_2_20220527_213720
No model summary for this model
ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **
MSE: 143032.15302056298
RMSE: 378.1959188311833
MAE: 207.24916737663557
RMSLE: 0.08235619094213531
R^2: 0.9910408992276571
Mean Residual Deviance: 143032.15302056298
Null degrees of freedom: 10060
Residual degrees of freedom: 10057
Null deviance: 160625531552.78574
Residual deviance: 1439046491.539884
AIC: 147994.2487763461
ModelMetricsRegressionGLM: stackedensemble
** Reported on validation data. **
MSE: 283749.631509217
RMSE: 532.6815479338636
MAE: 272.7611077737793
RMSLE: 0.09844370443884244
R^2: 0.9828274276559079
Mean Residual Deviance: 283749.631509217
Null degrees of freedom: 5459
Residual degrees of freedom: 5456
Null deviance: 90264416715.4349
Residual deviance: 1549272988.040325
AIC: 84059.73640085015
aml.leaderboard.head()
model_id | rmse | mse | mae | rmsle | mean_residual_deviance |
---|---|---|---|---|---|
StackedEnsemble_BestOfFamily_2_AutoML_1_20220527_213545 | 532.634 | 283699 | 275.046 | 0.100773 | 283699 |
StackedEnsemble_AllModels_1_AutoML_1_20220527_213545 | 533.505 | 284627 | 273.456 | 0.0994565 | 284627 |
GBM_2_AutoML_1_20220527_213545 | 537.035 | 288407 | 280.385 | 0.103165 | 288407 |
GBM_3_AutoML_1_20220527_213545 | 539.86 | 291449 | 277.797 | 0.0996754 | 291449 |
StackedEnsemble_BestOfFamily_1_AutoML_1_20220527_213545 | 547.002 | 299211 | 278.922 | 0.104543 | 299211 |
GBM_4_AutoML_1_20220527_213545 | 551.679 | 304349 | 292.328 | 0.119531 | 304349 |
GBM_1_AutoML_1_20220527_213545 | 556.845 | 310076 | 283.87 | 0.117353 | 310076 |
XGBoost_1_AutoML_1_20220527_213545 | 567.898 | 322508 | 293.092 | 0.106398 | 322508 |
DRF_1_AutoML_1_20220527_213545 | 590.945 | 349216 | 303.673 | 0.112203 | 349216 |
XGBoost_2_AutoML_1_20220527_213545 | 618.187 | 382155 | 345.353 | nan | 382155 |
best_model_aml = h2o.get_model(aml.leaderboard[2,'model_id'])
best_model_aml.varimp_plot();
<Figure size 432x288 with 0 Axes>
print('GBM_2_AutoML_1 R^2:', best_model_aml.r2())
print('GBM_2_AutoML_1 R^2 on validation data:', best_model_aml.r2(valid=True))
XGBoost_1_AutoML_1 R^2: 0.9868201213956783
XGBoost_1_AutoML_1 R^2 on validation data: 0.9822494320805603
aml2.leaderboard.head()
model_id | rmse | mse | mae | rmsle | mean_residual_deviance |
---|---|---|---|---|---|
StackedEnsemble_BestOfFamily_3_AutoML_2_20220527_213720 | 532.682 | 283750 | 272.761 | 0.0984437 | 283750 |
StackedEnsemble_BestOfFamily_2_AutoML_2_20220527_213720 | 532.682 | 283750 | 272.761 | 0.0984437 | 283750 |
StackedEnsemble_AllModels_2_AutoML_2_20220527_213720 | 532.766 | 283839 | 270.812 | 0.096823 | 283839 |
StackedEnsemble_AllModels_1_AutoML_2_20220527_213720 | 532.766 | 283839 | 270.812 | 0.096823 | 283839 |
GBM_2_AutoML_2_20220527_213720 | 537.9 | 289337 | 278.588 | 0.101383 | 289337 |
StackedEnsemble_BestOfFamily_1_AutoML_2_20220527_213720 | 542.102 | 293874 | 276.413 | 0.0994625 | 293874 |
GBM_3_AutoML_2_20220527_213720 | 542.484 | 294289 | 276.929 | 0.0984264 | 294289 |
GBM_1_AutoML_2_20220527_213720 | 542.522 | 294330 | 279.228 | 0.104282 | 294330 |
GBM_4_AutoML_2_20220527_213720 | 555.048 | 308078 | 290.994 | 0.114923 | 308078 |
XGBoost_2_AutoML_2_20220527_213720 | 561.576 | 315368 | 287.206 | 0.101077 | 315368 |
best_model_aml2 = h2o.get_model(aml2.leaderboard[9,'model_id'])
best_model_aml2.varimp_plot();
<Figure size 432x288 with 0 Axes>
print('XGBoost_2_AutoML_2 R^2:', best_model_aml2.r2())
print('XGBoost_2_AutoML_2 R^2 on validation data:', best_model_aml2.r2(valid=True))
XGBoost_2_AutoML_2 R^2: 0.9909054467211048
XGBoost_2_AutoML_2 R^2 on validation data: 0.98091389302047
h2o.cluster().shutdown()
H2O session _sid_b971 closed.
Model Results¶
Model |
r2 |
---|---|
Linear_Regression |
84.97% |
Multiple_Linear_Regression |
90.59% |
XGBoost_2_AutoML_2 |
98.09% |
GBM_Estimator |
98.12% |
StackedEnsemble_BestOfFamily_2_AutoML_1 |
98.23% |
GBM_2_AutoML_1 |
98.22% |
StackedEnsemble_BestOfFamily_3_AutoML_2 |
98.28% |
Random_Forest |
99.07% |
Conclusion¶
An analysis was performed using the classic Diamonds
dataset, in which the objective was determining the relationship of 4 C’s to price, any identifiable patterns, and how to best price diamonds for brokers.
Through exploration and visualization of the data, observed generalized pattern of inverse pricing accompanied with not normal distributions.
The clearest indication is the combination of best
color
and bestclarity
diamonds are priced significantly higher, whilecut
,color
,clarity
are priced highest from the worst diamonds.Of the 4 C’s
carat
’s coefficient level in the multiple linear regression, and among the variable importance compared favorable against the other 4 C’s.After scaling the features, the baseline linear regression captured a modest 84.97% accuracy, while the Random Forest model scored the highest with 99.07%.
The final deliverable was a Tableau Dashboard, to assist brokers with visualizations and a potential pricing mechanism to the
Diamonds
dataset. The dashboard can be viewed here