makepath
diff --git a/‎README.md
Lines changed: 1 addition & 1 deletion b/‎README.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/housing_price_feature_engineering.ipynb
Lines changed: 311 additions & 0 deletions b/‎examples/housing_price_feature_engineering.ipynb
Lines changed: 311 additions & 0 deletions
diff --git a/‎examples/user_guide/3_Zonal.ipynb
Lines changed: 50 additions & 1 deletion b/‎examples/user_guide/3_Zonal.ipynb
Lines changed: 50 additions & 1 deletion
@@ -283,4 +283,4 @@ With the introduction of projects like Numba, Python gained new ways to provide
 #### Citation
 Cite our code:
 
-`makepath/xarray-spatial, https://github.com/makepath/xarray-spatial, ©2020-2022.`
+`makepath/xarray-spatial, https://github.com/makepath/xarray-spatial, ©2020-2023.`
@@ -0,0 +1,311 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "housing_price_feature_engineering",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Housing price prediction feature engineering\n",
+        "\n",
+        "In general, price of a house is determined under many factors and the location always plays a paramount role in making value of the property. In this notebook, we will discover how geo-related aspects affect housing price in the USA. We will consider a house based on the information provided in ***an existing dataset*** with some addtional spatial attributes extracted from its location using xarray-spatial ***(and probably some elevation dataset, and census-parquet as well?)***.\n",
+        "\n",
+        "Existing features:\n",
+        "- ...\n",
+        "\n",
+        "New features:\n",
+        "- ***Slope?*** (from an elevation dataset)\n",
+        "- ***Population density, ...?*** (from Census if none available in the existing features)\n",
+        "- Distance to nearest hospital (or grocery store / university / pharmacy)\n",
+        "\n",
+        "We'll first build a machine learning model and train it with all existing features. For each newly added feature, we'll retrain it and compare the results to find out which features help enrich the model."
+      ],
+      "metadata": {
+        "id": "WEWW_CD4owNL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Imports\n",
+        "\n",
+        "First, let's import all neccesary libraries."
+      ],
+      "metadata": {
+        "id": "LWdCWgs31tjo"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "VoNuwihJovVG"
+      },
+      "outputs": [],
+      "source": [
+        "import numpy as np\n",
+        "import pandas as pd\n",
+        "import rasterio\n",
+        "\n",
+        "import datashader as ds\n",
+        "from datashader.transfer_functions import shade\n",
+        "from datashader.transfer_functions import stack\n",
+        "from datashader.transfer_functions import dynspread\n",
+        "from datashader.transfer_functions import set_background\n",
+        "from datashader.colors import Elevation\n",
+        "\n",
+        "from xrspatial import slope\n",
+        "from xrspatial import proximity"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Load the existing dataset"
+      ],
+      "metadata": {
+        "id": "eHISReTx1zmb"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# assume the data contain lat lon coords with some additional values\n",
+        "df = pd.DataFrame({\n",
+        "    'id': [0, 1, 2, 3, 4],\n",
+        "    'x': [0, 1, 2, 0, 4],\n",
+        "    'y': [2, 0, 1, 3, 1],\n",
+        "    'column_1': [2, 3, 4, 2, 6],\n",
+        "    'price': [1, 3, 4, 3, 7]\n",
+        "})"
+      ],
+      "metadata": {
+        "id": "o492hk5B5_mM"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Build and train housing price model\n",
+        "\n",
+        "We'll split the data into train data and test data."
+      ],
+      "metadata": {
+        "id": "NoXJhz-7153G"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "VrvvjQ0v6FME"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Now let's build the model to predict housing price. "
+      ],
+      "metadata": {
+        "id": "mIm7gBOf9Mi-"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "rSEWxAWh9Q7j"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "After tuning the hyper parameters, we selected the best model as below."
+      ],
+      "metadata": {
+        "id": "t7KiSyr29ROE"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "hmLkxoO-963c"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Prediction accuracy on the test set."
+      ],
+      "metadata": {
+        "id": "AL8mMHzI9-Hz"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "m4Pg2hXv99tf"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Calculated spatial attributes\n",
+        "\n",
+        "As stated above, we'll calculate spatial attributes (slope, ...?) of each location and its proximities to some nearest services. \n",
+        "\n",
+        "**TBD**: What is the format of additional data? Is it in vector or raster format?\n",
+        "- If raster (preferred), load directly as 2D xarray DataArrays\n",
+        "- If vector, load into a pandas/geopandas DataFrame and rasterize with datashader.\n",
+        "\n",
+        "Assume that the data is in vector format."
+      ],
+      "metadata": {
+        "id": "18o3Swi-6Fjg"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# bounding box of the raster\n",
+        "xmin, xmax, ymin, ymax = (\n",
+        "    df.x.min(),\n",
+        "    df.x.max(),\n",
+        "    df.x.min(),\n",
+        "    df.x.max()\n",
+        ")\n",
+        "xrange = (xmin, xmax)\n",
+        "yrange = (ymin, ymax)\n",
+        "\n",
+        "# width and height of the raster image\n",
+        "W, H = 800, 600\n",
+        "\n",
+        "# canvas object to rasterize the houses\n",
+        "cvs = ds.Canvas(plot_width=W, plot_height=H, x_range=xrange, y_range=yrange)\n",
+        "raster = cvs.points(df, x='x', y='y', agg=ds.min('id'))\n",
+        "\n",
+        "# visualize the raster\n",
+        "points_shaded = dynspread(shade(raster, cmap='salmon', min_alpha=0, span=(0,1), how='linear'), threshold=1, max_px=5)\n",
+        "set_background(points_shaded, 'black')"
+      ],
+      "metadata": {
+        "id": "mtQu5wSd6V_Y"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Identify location in pixel space of houses."
+      ],
+      "metadata": {
+        "id": "6k9E72q9Ge_d"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "EW3-QIoQLRN_"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Calculate new feature value."
+      ],
+      "metadata": {
+        "id": "kmkbu2e2LrQo"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "NJ6Djtu1Lqsv"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Retrain the model with new feature and compute test accuracy."
+      ],
+      "metadata": {
+        "id": "b15dhXjXLxMj"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "bVjatg3KLvvz"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Feature selection"
+      ],
+      "metadata": {
+        "id": "YEFwSg5Q6W1Z"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        ""
+      ],
+      "metadata": {
+        "id": "KQB3kmlg6Zh4"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
@@ -157,6 +157,7 @@
    "outputs": [],
    "source": [
     "from xrspatial import zonal_stats\n",
+    "\n",
     "zones_agg.values = np.nan_to_num(zones_agg.values, copy=False).astype(int)\n",
     "zonal_stats(zones_agg, terrain)"
    ]
@@ -191,6 +192,54 @@
    "source": [
     "Here the zones are defined by line segments, but they can be any spatial pattern or, more specifically, any region computable as a Datashader aggregate."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Zonal crosstab\n",
+    "\n",
+    "Zonal crosstab function can be used when we want to calculate cross-tabulated (categorical stats) areas between two datasets. As an example, assume we have gender data (male/female), and we want to see how all the gender groups distributed over all the zones defined above.\n",
+    "\n",
+    "First, let's create a mock dataset for gender: 1 for male, and 2 for female. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# define a mask where 0s are water, and 1s are land areas\n",
+    "mask = terrain.data.astype(bool).astype(int)\n",
+    "\n",
+    "# gender data where 0s are nodata, 1s are male, and 2s are female\n",
+    "# assume that the population only takes 10% overall, and the probability of male and female are equal.\n",
+    "genders = np.random.choice([0, 1, 2], p=[0.9, 0.05, 0.05], size=zones_agg.shape) * mask\n",
+    "genders_agg = xr.DataArray(genders, coords=terrain.coords, dims=terrain.dims)\n",
+    "\n",
+    "# visualize the results\n",
+    "genders_shaded = shade(genders_agg, cmap=['white', 'blue', 'red'])\n",
+    "stack(genders_shaded, terrain_shaded)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's calculate the stats between the 2 datasets of genders and zones."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from xrspatial import zonal_crosstab\n",
+    "\n",
+    "zonal_crosstab(zones=zones_agg, values=genders_agg, zone_ids=[11, 12, 13, 14, 15, 16], cat_ids=[1, 2])"
+   ]
   }
  ],
  "metadata": {
@@ -209,7 +258,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.10"
+   "version": "3.9.14"
   }
  },
  "nbformat": 4,