Diamonds Dataset
Contents
Diamonds Dataset¶
This is the classic Diamonds dataset.
Data Description¶
This dataset contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Format
A data frame with 53940 rows and 12 variables:
carat
: Weight of the diamond (0.2–5.01).cut
: Quality of the cut (Fair
,Good
,Very Good
,Premium
,Ideal
).color
: Diamond color, fromD
(best) toJ
(worst).clarity
: A measurement of how clear the diamond is (I1
(worst),SI2
,SI1
,VS2
,VS1
,VVS2
,VVS1
,IF
(best)).depth
: Width of top of diamond relative to widest point (43–95).table
: Total depth percentage =z
/ mean(x
,y
) = 2 *z
/ (x
+y
) (43–79).price
: Price in US dollars ($326–$18,823).x
: Length in mm (0–10.74).y
: Width in mm (0–58.9).z
: Depth in mm (0–31.8).
Original Diamonds Dataset¶
The original Diamonds dataset was built for R
, but has since found a home in Python
. Seaborn now stores the dataset as a built in to its package.
Loading the Data¶
import seaborn as sns
sns.get_dataset_names()
sns.load_dataset('diamonds')
or
diamonds = "https://raw.githubusercontent.com/kyle-w-brown/diamonds-prediction/main/data/diamonds.csv"
df_diamonds = pd.read_csv(diamonds)
df_diamonds.head()
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
|
---|---|---|---|---|---|---|---|---|---|---|
0 |
0.23 |
Ideal |
E |
SI2 |
61.5 |
55.0 |
326 |
3.95 |
3.98 |
2.43 |
1 |
0.21 |
Premium |
E |
SI1 |
59.8 |
61.0 |
326 |
3.89 |
3.84 |
2.31 |
2 |
0.23 |
Good |
E |
VS1 |
56.9 |
65.0 |
327 |
4.05 |
4.07 |
2.31 |
3 |
0.29 |
Premium |
I |
VS2 |
62.4 |
58.0 |
334 |
4.20 |
4.23 |
2.63 |
4 |
0.31 |
Good |
J |
SI2 |
63.3 |
58.0 |
335 |
4.34 |
4.35 |
2.75 |
Updated Data¶
Column volume
, cut_rk
, color_rk
, and clarity_rk
were created for ordinal encoding and model features.
volume
: The amount of space the diamond occupies (x
xy
xz
).cut_rk
: The cut rank was ordinal encoded ofcut
1-5 from best (Ideal) to worst (Fair) diamonds.color_rk
: The color rank was ordinal encoded ofcolor
1-7 from best (D) to worst (J) diamondsclarity_rk
: The clarity rank was ordinal encoded ofclarity
1-8 from best (I1) to worst (IF) diamonds.
Loading Updated Data¶
diamonds = "https://raw.githubusercontent.com/kyle-w-brown/diamonds-prediction/main/data/diamonds-new.csv"
df_diamonds = pd.read_csv(diamonds)
df_diamonds.head()
carat |
cut |
color |
clarity |
depth |
table |
price |
x |
y |
z |
volume |
cut_rk |
color_rk |
clarity_rk |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
0.23 |
Ideal |
E |
SI2 |
61.5 |
55.0 |
326 |
3.95 |
3.98 |
2.43 |
38.202 |
1 |
2 |
7 |
1 |
0.21 |
Premium |
E |
SI1 |
59.8 |
61.0 |
326 |
3.89 |
3.84 |
2.31 |
34.506 |
2 |
2 |
6 |
2 |
0.23 |
Good |
E |
VS1 |
56.9 |
65.0 |
327 |
4.05 |
4.07 |
2.31 |
38.077 |
4 |
2 |
4 |
3 |
0.29 |
Premium |
I |
VS2 |
62.4 |
58.0 |
334 |
4.20 |
4.23 |
2.63 |
46.725 |
2 |
6 |
5 |
4 |
0.31 |
Good |
J |
SI2 |
63.3 |
58.0 |
335 |
4.34 |
4.35 |
2.75 |
51.917 |
4 |
7 |
7 |