Skip to content

Commit 2bd8192

Browse files
committed
Merge branch 'master' into test_python333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333.11
2 parents e22d678 + 69695e1 commit 2bd8192

21 files changed

+2965
-218
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,4 +283,4 @@ With the introduction of projects like Numba, Python gained new ways to provide
283283
#### Citation
284284
Cite our code:
285285

286-
`makepath/xarray-spatial, https://github.com/makepath/xarray-spatial, ©2020-2022.`
286+
`makepath/xarray-spatial, https://github.com/makepath/xarray-spatial, ©2020-2023.`
Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
{
2+
"nbformat": 4,
3+
"nbformat_minor": 0,
4+
"metadata": {
5+
"colab": {
6+
"name": "housing_price_feature_engineering",
7+
"provenance": [],
8+
"collapsed_sections": []
9+
},
10+
"kernelspec": {
11+
"name": "python3",
12+
"display_name": "Python 3"
13+
},
14+
"language_info": {
15+
"name": "python"
16+
}
17+
},
18+
"cells": [
19+
{
20+
"cell_type": "markdown",
21+
"source": [
22+
"# Housing price prediction feature engineering\n",
23+
"\n",
24+
"In general, price of a house is determined under many factors and the location always plays a paramount role in making value of the property. In this notebook, we will discover how geo-related aspects affect housing price in the USA. We will consider a house based on the information provided in ***an existing dataset*** with some addtional spatial attributes extracted from its location using xarray-spatial ***(and probably some elevation dataset, and census-parquet as well?)***.\n",
25+
"\n",
26+
"Existing features:\n",
27+
"- ...\n",
28+
"\n",
29+
"New features:\n",
30+
"- ***Slope?*** (from an elevation dataset)\n",
31+
"- ***Population density, ...?*** (from Census if none available in the existing features)\n",
32+
"- Distance to nearest hospital (or grocery store / university / pharmacy)\n",
33+
"\n",
34+
"We'll first build a machine learning model and train it with all existing features. For each newly added feature, we'll retrain it and compare the results to find out which features help enrich the model."
35+
],
36+
"metadata": {
37+
"id": "WEWW_CD4owNL"
38+
}
39+
},
40+
{
41+
"cell_type": "markdown",
42+
"source": [
43+
"## Imports\n",
44+
"\n",
45+
"First, let's import all neccesary libraries."
46+
],
47+
"metadata": {
48+
"id": "LWdCWgs31tjo"
49+
}
50+
},
51+
{
52+
"cell_type": "code",
53+
"execution_count": null,
54+
"metadata": {
55+
"id": "VoNuwihJovVG"
56+
},
57+
"outputs": [],
58+
"source": [
59+
"import numpy as np\n",
60+
"import pandas as pd\n",
61+
"import rasterio\n",
62+
"\n",
63+
"import datashader as ds\n",
64+
"from datashader.transfer_functions import shade\n",
65+
"from datashader.transfer_functions import stack\n",
66+
"from datashader.transfer_functions import dynspread\n",
67+
"from datashader.transfer_functions import set_background\n",
68+
"from datashader.colors import Elevation\n",
69+
"\n",
70+
"from xrspatial import slope\n",
71+
"from xrspatial import proximity"
72+
]
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"source": [
77+
"## Load the existing dataset"
78+
],
79+
"metadata": {
80+
"id": "eHISReTx1zmb"
81+
}
82+
},
83+
{
84+
"cell_type": "code",
85+
"source": [
86+
"# assume the data contain lat lon coords with some additional values\n",
87+
"df = pd.DataFrame({\n",
88+
" 'id': [0, 1, 2, 3, 4],\n",
89+
" 'x': [0, 1, 2, 0, 4],\n",
90+
" 'y': [2, 0, 1, 3, 1],\n",
91+
" 'column_1': [2, 3, 4, 2, 6],\n",
92+
" 'price': [1, 3, 4, 3, 7]\n",
93+
"})"
94+
],
95+
"metadata": {
96+
"id": "o492hk5B5_mM"
97+
},
98+
"execution_count": null,
99+
"outputs": []
100+
},
101+
{
102+
"cell_type": "markdown",
103+
"source": [
104+
"## Build and train housing price model\n",
105+
"\n",
106+
"We'll split the data into train data and test data."
107+
],
108+
"metadata": {
109+
"id": "NoXJhz-7153G"
110+
}
111+
},
112+
{
113+
"cell_type": "code",
114+
"source": [
115+
""
116+
],
117+
"metadata": {
118+
"id": "VrvvjQ0v6FME"
119+
},
120+
"execution_count": null,
121+
"outputs": []
122+
},
123+
{
124+
"cell_type": "markdown",
125+
"source": [
126+
"Now let's build the model to predict housing price. "
127+
],
128+
"metadata": {
129+
"id": "mIm7gBOf9Mi-"
130+
}
131+
},
132+
{
133+
"cell_type": "code",
134+
"source": [
135+
""
136+
],
137+
"metadata": {
138+
"id": "rSEWxAWh9Q7j"
139+
},
140+
"execution_count": null,
141+
"outputs": []
142+
},
143+
{
144+
"cell_type": "markdown",
145+
"source": [
146+
"After tuning the hyper parameters, we selected the best model as below."
147+
],
148+
"metadata": {
149+
"id": "t7KiSyr29ROE"
150+
}
151+
},
152+
{
153+
"cell_type": "code",
154+
"source": [
155+
""
156+
],
157+
"metadata": {
158+
"id": "hmLkxoO-963c"
159+
},
160+
"execution_count": null,
161+
"outputs": []
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"source": [
166+
"Prediction accuracy on the test set."
167+
],
168+
"metadata": {
169+
"id": "AL8mMHzI9-Hz"
170+
}
171+
},
172+
{
173+
"cell_type": "code",
174+
"source": [
175+
""
176+
],
177+
"metadata": {
178+
"id": "m4Pg2hXv99tf"
179+
},
180+
"execution_count": null,
181+
"outputs": []
182+
},
183+
{
184+
"cell_type": "markdown",
185+
"source": [
186+
"## Calculated spatial attributes\n",
187+
"\n",
188+
"As stated above, we'll calculate spatial attributes (slope, ...?) of each location and its proximities to some nearest services. \n",
189+
"\n",
190+
"**TBD**: What is the format of additional data? Is it in vector or raster format?\n",
191+
"- If raster (preferred), load directly as 2D xarray DataArrays\n",
192+
"- If vector, load into a pandas/geopandas DataFrame and rasterize with datashader.\n",
193+
"\n",
194+
"Assume that the data is in vector format."
195+
],
196+
"metadata": {
197+
"id": "18o3Swi-6Fjg"
198+
}
199+
},
200+
{
201+
"cell_type": "code",
202+
"source": [
203+
"# bounding box of the raster\n",
204+
"xmin, xmax, ymin, ymax = (\n",
205+
" df.x.min(),\n",
206+
" df.x.max(),\n",
207+
" df.x.min(),\n",
208+
" df.x.max()\n",
209+
")\n",
210+
"xrange = (xmin, xmax)\n",
211+
"yrange = (ymin, ymax)\n",
212+
"\n",
213+
"# width and height of the raster image\n",
214+
"W, H = 800, 600\n",
215+
"\n",
216+
"# canvas object to rasterize the houses\n",
217+
"cvs = ds.Canvas(plot_width=W, plot_height=H, x_range=xrange, y_range=yrange)\n",
218+
"raster = cvs.points(df, x='x', y='y', agg=ds.min('id'))\n",
219+
"\n",
220+
"# visualize the raster\n",
221+
"points_shaded = dynspread(shade(raster, cmap='salmon', min_alpha=0, span=(0,1), how='linear'), threshold=1, max_px=5)\n",
222+
"set_background(points_shaded, 'black')"
223+
],
224+
"metadata": {
225+
"id": "mtQu5wSd6V_Y"
226+
},
227+
"execution_count": null,
228+
"outputs": []
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"source": [
233+
"Identify location in pixel space of houses."
234+
],
235+
"metadata": {
236+
"id": "6k9E72q9Ge_d"
237+
}
238+
},
239+
{
240+
"cell_type": "code",
241+
"source": [
242+
""
243+
],
244+
"metadata": {
245+
"id": "EW3-QIoQLRN_"
246+
},
247+
"execution_count": null,
248+
"outputs": []
249+
},
250+
{
251+
"cell_type": "markdown",
252+
"source": [
253+
"Calculate new feature value."
254+
],
255+
"metadata": {
256+
"id": "kmkbu2e2LrQo"
257+
}
258+
},
259+
{
260+
"cell_type": "code",
261+
"source": [
262+
""
263+
],
264+
"metadata": {
265+
"id": "NJ6Djtu1Lqsv"
266+
},
267+
"execution_count": null,
268+
"outputs": []
269+
},
270+
{
271+
"cell_type": "markdown",
272+
"source": [
273+
"Retrain the model with new feature and compute test accuracy."
274+
],
275+
"metadata": {
276+
"id": "b15dhXjXLxMj"
277+
}
278+
},
279+
{
280+
"cell_type": "code",
281+
"source": [
282+
""
283+
],
284+
"metadata": {
285+
"id": "bVjatg3KLvvz"
286+
},
287+
"execution_count": null,
288+
"outputs": []
289+
},
290+
{
291+
"cell_type": "markdown",
292+
"source": [
293+
"## Feature selection"
294+
],
295+
"metadata": {
296+
"id": "YEFwSg5Q6W1Z"
297+
}
298+
},
299+
{
300+
"cell_type": "code",
301+
"source": [
302+
""
303+
],
304+
"metadata": {
305+
"id": "KQB3kmlg6Zh4"
306+
},
307+
"execution_count": null,
308+
"outputs": []
309+
}
310+
]
311+
}

examples/user_guide/3_Zonal.ipynb

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@
157157
"outputs": [],
158158
"source": [
159159
"from xrspatial import zonal_stats\n",
160+
"\n",
160161
"zones_agg.values = np.nan_to_num(zones_agg.values, copy=False).astype(int)\n",
161162
"zonal_stats(zones_agg, terrain)"
162163
]
@@ -191,6 +192,54 @@
191192
"source": [
192193
"Here the zones are defined by line segments, but they can be any spatial pattern or, more specifically, any region computable as a Datashader aggregate."
193194
]
195+
},
196+
{
197+
"cell_type": "markdown",
198+
"metadata": {},
199+
"source": [
200+
"## Zonal crosstab\n",
201+
"\n",
202+
"Zonal crosstab function can be used when we want to calculate cross-tabulated (categorical stats) areas between two datasets. As an example, assume we have gender data (male/female), and we want to see how all the gender groups distributed over all the zones defined above.\n",
203+
"\n",
204+
"First, let's create a mock dataset for gender: 1 for male, and 2 for female. "
205+
]
206+
},
207+
{
208+
"cell_type": "code",
209+
"execution_count": null,
210+
"metadata": {},
211+
"outputs": [],
212+
"source": [
213+
"# define a mask where 0s are water, and 1s are land areas\n",
214+
"mask = terrain.data.astype(bool).astype(int)\n",
215+
"\n",
216+
"# gender data where 0s are nodata, 1s are male, and 2s are female\n",
217+
"# assume that the population only takes 10% overall, and the probability of male and female are equal.\n",
218+
"genders = np.random.choice([0, 1, 2], p=[0.9, 0.05, 0.05], size=zones_agg.shape) * mask\n",
219+
"genders_agg = xr.DataArray(genders, coords=terrain.coords, dims=terrain.dims)\n",
220+
"\n",
221+
"# visualize the results\n",
222+
"genders_shaded = shade(genders_agg, cmap=['white', 'blue', 'red'])\n",
223+
"stack(genders_shaded, terrain_shaded)"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"Now, let's calculate the stats between the 2 datasets of genders and zones."
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"metadata": {},
237+
"outputs": [],
238+
"source": [
239+
"from xrspatial import zonal_crosstab\n",
240+
"\n",
241+
"zonal_crosstab(zones=zones_agg, values=genders_agg, zone_ids=[11, 12, 13, 14, 15, 16], cat_ids=[1, 2])"
242+
]
194243
}
195244
],
196245
"metadata": {
@@ -209,7 +258,7 @@
209258
"name": "python",
210259
"nbconvert_exporter": "python",
211260
"pygments_lexer": "ipython3",
212-
"version": "3.9.10"
261+
"version": "3.9.14"
213262
}
214263
},
215264
"nbformat": 4,

0 commit comments

Comments
 (0)