diff --git a/.DS_Store b/.DS_Store deleted file mode 100644 index 5fce09cd10..0000000000 Binary files a/.DS_Store and /dev/null differ diff --git a/README.md b/README.md index cdf14c0967..2bcf367bc9 100644 --- a/README.md +++ b/README.md @@ -1,31 +1,51 @@ # Amazon SageMaker Examples -This repository contains example notebooks that show how to apply machine learning and deep learning in Amazon SageMaker(https://aws.amazon.com/amazon-ai/). +This repository contains example notebooks that show how to apply machine learning and deep learning in [Amazon SageMaker](https://aws.amazon.com/machine-learning/platforms/sagemaker). ## Examples ### Introduction to Applying Machine Learning -- [XGBoost for Direct Marketing](xgboost_direct_marketing) targets potential customers that are most likely to convert based on customer and aggregate level metrics. -- [PCA and k-means for Movie Clustering](pca_kmeans_movie_clustering) creates clusters of movies based on genre, ratings, and other characteristics. +These examples provide a gentle introduction to machine learning concepts as they are applied in practical use cases across a variety of sectors. + +- [Targeted Direct Marketing](introduction_to_applying_machine_learning/xgboost_direct_marketing) predicts potential customers that are most likely to convert based on customer and aggregate level metrics, using Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost). +- [Predicting Customer Churn](introduction_to_applying_machine_learning/xgboost_customer_churn) uses customer interaction and service usage data to find those most likely to churn, and then walks through the cost/benefit trade-offs of providing retention incentives. This uses Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost) to create a highly predictive model. +- [Time-series Forecasting](introduction_to_applying_machine_learning/linear_time_series_forecast) generates a forecast for topline product demand using Amazon SageMaker's Linear Learner algorithm. +- [Cancer Prediction](introduction_to_applying_machine_learning/breast_cancer_prediction) predicts Breast Cancer based on features derived from images, using SageMaker's Linear Learner. ### Introduction to Amazon Algorithms +These examples provide quick walkthroughs to get you up and running with Amazon SageMaker's custom developed algorithms. Most of these algorithms can train on distributed hardware, scale incredibly well, and are faster and cheaper than popular alternatives. + +- [k-means](introduction_to_amazon_algorithms/1P_kmeans_highlevel) is our introductory example for Amazon SageMaker. It walks through the process of clustering MNIST images of handwritten digits using Amazon SageMaker k-means. +- [Factorization Machines](introduction_to_amazon_algorithms/factorization_machines_mnist) showcases Amazon SageMaker's implementation of the algorithm to predict whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier. +- [Latent Dirichlet Allocation (LDA)](introduction_to_amazon_algorithms/lda_topic_modeling) introduces topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset. +- [Linear Learner](introduction_to_amazon_algorithms/linear_learner_mnist) predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner. +- [Neural Topic Model (NTM)](introduction_to_amazon_algorithms/ntm_synthetic) uses Amazon SageMaker Neural Topic Model (NTM) to uncover topics in documents from a synthetic data source, where topic distributions are known. +- [Principal Components Analysis (PCA)](introduction_to_amazon_algorithms/pca_mnist) uses Amazon SageMaker PCA to calculate eigendigits from MNIST. +- [Seq2Seq](introduction_to_amazon_algorithms/seq2seq) uses the Amazon SageMaker Seq2Seq algorithm that's built on top of [Sockeye](https://github.com/awslabs/sockeye), which is a sequence-to-sequence framework for Neural Machine Translation based on MXNet. Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation. This notebook shows translation from English to German text. +- [XGBoost for regression](introduction_to_amazon_algorithms/xgboost_abalone) predicts the age of abalone ([Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html)) using regression from Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost). +- [XGBoost for multi-class classification](introduction_to_amazon_algorithms/xgboost_mnist) uses Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost) to classifiy handwritten digits from the MNIST dataset as one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented. + ### Scientific Details of Algorithms +These examples provide more thorough mathematical treatment on a select group of algorithms. + +- [Latent Dirichlet Allocation (LDA)](scientific_details_of_algorithms/lda_topic_modeling) dives into Amazon SageMaker's spectral decomposition approach to LDA. + ### Advanced Amazon SageMaker Functionality -- [Installing the R Kernel](install_r_kernel) shows how to install the R kernel into an Amazon SageMaker Notebook Instance. -- [Bring Your Own Model for k-means](kmeans_bring_your_own_model) shows how to take a model that's been fit elsewhere and use Amazon SageMaker containers to host. -- [Bring Your Own Algorithm with R](r_bring_your_own) shows how to bring your own algorithm container to Amazon SageMaker using the R language. +- [Installing the R Kernel](advanced_functionality/install_r_kernel) shows how to install the R kernel into an Amazon SageMaker Notebook Instance. +- [Bring Your Own Model for k-means](advanced_functionality/kmeans_bring_your_own_model) shows how to take a model that's been fit elsewhere and use Amazon SageMaker Algorithms containers to host it. +- [Bring Your Own Algorithm with R](advanced_functionality/r_bring_your_own) shows how to bring your own algorithm container to Amazon SageMaker using the R language. - [Bring Your Own Tensorflow Model](sagemaker-python-sdk/tensorflow_iris_byom) shows how to bring a model trained anywhere into Amazon SageMaker ## FAQ -*Will these example work outside of Amazon SageMaker?* +*Will these examples work outside of Amazon SageMaker?* - Although most examples utilize key Amazon SageMaker functionality like distributed, managed training or real-time hosted endpoints, these notebooks can be run outside of Amazon SageMaker Notebook Instances with minimal modification (updating IAM role definition and installing the necessary libraries). -*How do I contribute my own example notebook?" +*How do I contribute my own example notebook?* -- Although we're extremely excited to receive contributions from the community, we're still working on the best mechanism to take in examples from and external source. Please bear will us in the short-term if pull requests take longer than expected or are closed. +- Although we're extremely excited to receive contributions from the community, we're still working on the best mechanism to take in examples from and external source. Please bear with us in the short-term if pull requests take longer than expected or are closed. diff --git a/introduction_to_amazon_algorithms/README.md b/introduction_to_amazon_algorithms/README.md index 0e4ab9ac24..89f73bd937 100644 --- a/introduction_to_amazon_algorithms/README.md +++ b/introduction_to_amazon_algorithms/README.md @@ -3,6 +3,7 @@ This directory includes introductory examples to Amazon SageMaker Algorithms that we have developed so far. It seeks to provide guidance and examples on basic functionality rather than a detailed scientific review or an implementation on complex, real-world data. Example Notebooks include: +- *1P_kmeans_highlevel*: Our introduction to Amazon SageMaker which walks through the process of clustering MNIST images of handwritten digits. - *factorization_machines_mnist*: Predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Factorization Machines. - *lda_topic_modeling*: Topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset. - *linear_mnist*: Predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner. diff --git a/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb b/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb index dfe550ff69..2f3ff60fa3 100644 --- a/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb +++ b/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb @@ -41,13 +41,10 @@ "\n", "### Permissions and environment variables\n", "\n", - "Here we set up the linkage and authentication to AWS services. There are three parts to this:\n", + "Here we set up the linkage and authentication to AWS services. There are two parts to this:\n", "\n", - "1. The credentials and region for the account that's running training. Upload the credentials in the normal AWS credentials file format using the jupyter upload feature.\n", - "2. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.\n", - "3. The S3 bucket that you want to use for training and model data.\n", - "\n", - "_Note:_ Credentials for hosted notebooks will be automated before the final release." + "1. The role(s) used to give learning and hosting access to your data. See the documentation for how to specify these.\n", + "1. The S3 bucket name and locations that you want to use for training and model data." ] }, { @@ -82,7 +79,9 @@ "source": [ "### Data ingestion\n", "\n", - "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." + "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. In this case we'll use the MNIST dataset, which contains 70K 28 x 28 pixel images of handwritten digits. For more details, please see [here](http://yann.lecun.com/exdb/mnist/).\n", + "\n", + "This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." ] }, { @@ -137,7 +136,7 @@ "source": [ "## Training the K-Means model\n", "\n", - "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the kmeans training algorithm - we will visit that in another example.\n", + "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the k-means training algorithm. But Amazon SageMaker's k-means has been tested on, and scales well with, multi-terabyte datasets.\n", "\n", "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 7 and 11 minutes." ] @@ -174,12 +173,7 @@ "metadata": {}, "source": [ "## Set up hosting for the model\n", - "In order to set up hosting, we have to import the model from training to hosting. A common question would be, why wouldn't we automatically go from training to hosting? As we worked through examples of what customers were looking to do with hosting, we realized that the Amazon ML model of hosting was unlikely to be sufficient for all customers.\n", - "\n", - "As a result, we have introduced some flexibility with respect to model deployment, with the goal of additional model deployment targets after launch. In the short term, that introduces some complexity, but we are actively working on making that easier for customers, even before GA.\n", - "\n", - "### Import model into hosting\n", - "Next, you register the model with hosting. This allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target." + "Now, we can deploy the model we just trained behind a real-time hosted endpoint. This next step can take, on average, 7 to 11 minutes to complete." ] }, { @@ -199,7 +193,7 @@ "metadata": {}, "source": [ "## Validate the model for use\n", - "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint." + "Finally, we'll validate the model for use. Let's generate a classification for a single observation from the trained model using the endpoint we just created." ] }, { @@ -268,7 +262,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### (Optional) Delete the Endpoint" + "### (Optional) Delete the Endpoint\n", + "If you're ready to be done with this notebook, make sure run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." ] }, { @@ -291,13 +286,6 @@ "#import sagemaker\n", "#sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { diff --git a/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb b/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb index e3114a3680..68524bcd59 100644 --- a/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb +++ b/sagemaker-python-sdk/1P_kmeans_lowlevel/kmeans_mnist_lowlevel.ipynb @@ -41,13 +41,10 @@ "\n", "### Permissions and environment variables\n", "\n", - "Here we set up the linkage and authentication to AWS services. There are three parts to this:\n", + "Here we set up the linkage and authentication to AWS services. There are two parts to this:\n", "\n", - "1. The credentials and region for the account that's running training. Upload the credentials in the normal AWS credentials file format using the jupyter upload feature.\n", - "2. The roles used to give learning and hosting access to your data. See the documentation for how to specify these.\n", - "3. The S3 bucket that you want to use for training and model data.\n", - "\n", - "_Note:_ Credentials for hosted notebooks will be automated before the final release." + "1. The role(s) used to give learning and hosting access to your data. See the documentation for how to specify these.\n", + "1. The S3 bucket name and location that you want to use for training and model data." ] }, { @@ -82,7 +79,9 @@ "source": [ "### Data ingestion\n", "\n", - "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." + "Next, we read the dataset from the existing repository into memory, for preprocessing prior to training. In this case we'll use the MNIST dataset, which contains 70K 28 x 28 pixel images of handwritten digits. For more details, please see [here](http://yann.lecun.com/exdb/mnist/).\n", + "\n", + "This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets." ] }, { @@ -137,12 +136,9 @@ "source": [ "### Data conversion and upload\n", "\n", - "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the hosted implementation of k-means takes recordio-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.\n", - "\n", - "Some of the effort involved in the protobuf format conversion is hidden in a library that is imported, below. This library will be folded into the SDK for algorithm authors to make it easier for algorithm authors to support multiple formats. This doesn't __prevent__ algorithm authors from requiring non-standard formats, but it encourages them to support the standard ones.\n", + "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the hosted implementation of k-means takes recordIO-wrapped protobuf, where the data we have right now is a pickle-ized numpy array on disk.\n", "\n", - "\n", - "For this dataset, conversion takes approximately one minute." + "To make this process easier, we'll use a function from the Amazon SageMaker Python SDK. For this dataset, conversion can take up to one minute." ] }, { @@ -170,7 +166,7 @@ "source": [ "## Training the K-Means model\n", "\n", - "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the kmeans training algorithm - we will visit that in another example.\n", + "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the k-means training algorithm. But Amazon SageMaker's k-means has been tested on, and scales well with, multi-terabyte datasets.\n", "\n", "After setting training parameters, we kick off training, and poll for status until training is completed, which in this example, takes between 7 and 11 minutes." ] @@ -259,9 +255,7 @@ "metadata": {}, "source": [ "## Set up hosting for the model\n", - "In order to set up hosting, we have to import the model from training to hosting. A common question would be, why wouldn't we automatically go from training to hosting? As we worked through examples of what customers were looking to do with hosting, we realized that the Amazon ML model of hosting was unlikely to be sufficient for all customers.\n", - "\n", - "As a result, we have introduced some flexibility with respect to model deployment, with the goal of additional model deployment targets after launch. In the short term, that introduces some complexity, but we are actively working on making that easier for customers, even before GA.\n", + "In order to set up hosting, we have to import the model from training to hosting. A common question would be, why wouldn't we automatically go from training to hosting? And, in fact, the [k-means high-level example](/notebooks/sagemaker-python-sdk/1P_kmeans_highlevel/kmeans_mnist.ipynb) shows the functionality to do that. For this low-level example though it makes sense to show each step in the process to provide a better understanding of the flexibility available.\n", "\n", "### Import model into hosting\n", "Next, you register the model with hosting. This allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target." @@ -302,9 +296,7 @@ "metadata": {}, "source": [ "### Create endpoint configuration\n", - "At launch, we will support configuring REST endpoints in hosting with multiple models, e.g. for A/B testing purposes. In order to support this, customers create an endpoint configuration, that describes the distribution of traffic across the models, whether split, shadowed, or sampled in some way.\n", - "\n", - "In addition, the endpoint configuration describes the instance type required for model deployment, and at launch will describe the autoscaling configuration." + "Now, we'll create an endpoint configuration which provides the instance type and count for model deployment." ] }, { @@ -375,7 +367,7 @@ "metadata": {}, "source": [ "## Validate the model for use\n", - "Finally, the customer can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.\n" + "Finally, we'll validate the model for use. Let's generate a classification for a single observation from the trained model using the endpoint we just created." ] }, { @@ -472,7 +464,7 @@ "source": [ "### Clean up\n", "\n", - "When we're done with the endpoint, we can just delete it and the backing instances will be released." + "If you're ready to be done with this notebook, make sure run the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on." ] }, { @@ -485,13 +477,6 @@ "\n", "# sagemaker.delete_endpoint(EndpointName=endpoint_name)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": {