1
1
---
2
- title : " Run Stan on the GPU with OpenCL"
2
+ title : " Running Stan on the GPU with OpenCL"
3
3
author : " Rok Češnovar and Jonah Gabry"
4
4
output :
5
5
rmarkdown::html_vignette :
@@ -8,46 +8,50 @@ output:
8
8
params :
9
9
EVAL : !r identical(Sys.getenv("CMDSTANR_OPENCL_TESTS"), "true")
10
10
vignette : >
11
- %\VignetteIndexEntry{Run Stan on the GPU with OpenCL}
11
+ %\VignetteIndexEntry{Running Stan on the GPU with OpenCL}
12
12
%\VignetteEngine{knitr::rmarkdown}
13
13
%\VignetteEncoding{UTF-8}
14
14
---
15
15
16
16
## Introduction
17
17
18
- This vignette demonstrates how to use the OpenCL capabilites of CmdStan with
19
- CmdStanR. The vignette requires CmdStan 2.26.1 or newer.
18
+ This vignette demonstrates how to use the OpenCL capabilities of CmdStan with
19
+ CmdStanR. The functionality described in this vignette requires CmdStan 2.26.1
20
+ or newer.
20
21
21
22
As of version 2.26.1, users can expect speedups with OpenCL when using vectorized
22
- probability distribution/mass functions (functions with the ` _lpdf ` or ` _lpmf `
23
- suffix). You can expect speedups when the input variables contain 20.000 or more elements.
23
+ probability distribution functions (functions with the ` _lpdf ` or ` _lpmf `
24
+ suffix) and when the input variables contain at least 20,000 elements.
24
25
25
- The actual speedup for a model will depend on whether the ` lpdf/lpmf ` functions
26
- are the bottlenecks of the model and the ` lpdf/lpmf ` function used.
27
- The more computationally complex the function is, the larger the expected speedup.
28
- The biggest speedups are expected when using the GLM functions.
26
+ The actual speedup for a model will depend on the particular ` lpdf/lpmf `
27
+ functions used and whether the ` lpdf/lpmf ` functions are the bottlenecks of the
28
+ model. The more computationally complex the function is, the larger the expected
29
+ speedup. The biggest speedups are expected when using the specialized GLM
30
+ functions.
29
31
30
- Use [ profiling] ( ../profiling.Rmd ) in order to establish the bottlenecks in your model.
32
+ In order to establish the bottlenecks in your model we recommend using
33
+ [ profiling] ( ../profiling.Rmd ) , which was introduced in Stan version 2.26.0.
31
34
32
35
## OpenCL runtime
33
36
34
37
OpenCL is supported on most modern CPUs and GPUs. In order to use
35
38
OpenCL in CmdStanR, an OpenCL runtime for the target device must be installed.
36
- A guide for the most common devices is available [ here] ( https://mc-stan.org/docs/2_26/cmdstan-guide/parallelization.html#opencl ) .
39
+ A guide for the most common devices is available in the CmdStan manual's
40
+ [ chapter on parallelization] ( https://mc-stan.org/docs/2_26/cmdstan-guide/parallelization.html#opencl ) .
37
41
38
42
## Compiling a model with OpenCL
39
43
40
- By default, models in CmdStanR are compiled without OpenCL support. Once OpenCL
44
+ By default, models in CmdStanR are compiled * without* OpenCL support. Once OpenCL
41
45
support is enabled, a CmdStan model will make use of OpenCL if the functions
42
46
in the model support it. Technically no changes to a model are required to
43
47
support OpenCL since the choice of using OpenCL is handled by the compiler,
44
48
but it can still be useful to rewrite a model to be more OpenCL-friendly by
45
- using vectorization as much as possible.
49
+ using vectorization as much as possible when using probability distributions .
46
50
47
51
Consider a simple logistic regression with parameters ` alpha ` and ` beta ` ,
48
52
covariates ` X ` , and outcome ` y ` .
49
53
50
- ``` stan
54
+ ```
51
55
data {
52
56
int<lower=1> k;
53
57
int<lower=0> n;
@@ -74,11 +78,8 @@ library(cmdstanr)
74
78
n <- 200000
75
79
k <- 20
76
80
X <- matrix(rnorm(n * k), ncol = k)
77
-
78
- y <- 3 * X[,1] - 2 * X[,2] + 1
79
- p <- runif(n)
80
- y <- ifelse(p < (1 / (1 + exp(-y))), 1, 0)
81
- mdata <- list(k = ncol(X), n = nrow(X), y = y, X = X)
81
+ y <- rbinom(n, size = 1, prob = plogis(3 * X[,1] - 2 * X[,2] + 1))
82
+ mdata <- list(k = k, n = n, y = y, X = X)
82
83
```
83
84
84
85
In this model, most of the computation will be handled by the
@@ -88,56 +89,55 @@ it should be possible to accelerate it with OpenCL. Check
88
89
with OpenCL support.
89
90
90
91
To build the model with OpenCL support, add
91
- ` cpp_options = list(stan_opencl = TRUE) ` to the model compile .
92
+ ` cpp_options = list(stan_opencl = TRUE) ` at the compilation step .
92
93
93
94
``` {r compile-opencl, message=FALSE, results='hide'}
94
95
# Compile the model with STAN_OPENCL=TRUE
95
- model_cl <- cmdstan_model("opencl-files/bernoulli_logit_glm.stan",
96
- cpp_options = list(stan_opencl = TRUE))
96
+ mod_cl <- cmdstan_model("opencl-files/bernoulli_logit_glm.stan",
97
+ cpp_options = list(stan_opencl = TRUE))
97
98
```
98
99
99
100
## Running models with OpenCL
100
101
101
102
Running models with OpenCL requires specifying the OpenCL platform and device
102
103
on which to run the model (there can be multiple). If the system has one GPU
103
104
and no OpenCL CPU runtime, the platform and device IDs of the GPU are typically
104
- both ` 0 ` , but the ` clinfo ` tool can be used to figure out for sure what devices
105
+ both ` 0 ` , but the ` clinfo ` tool can be used to figure out for sure which devices
105
106
are available.
106
107
107
108
On an Ubuntu system with both CPU and GPU OpenCL support, ` clinfo -l ` outputs:
108
109
109
- ``` {bash eval=FALSE}
110
+ ```
110
111
Platform #0: AMD Accelerated Parallel Processing
111
112
`-- Device #0: gfx906+sram-ecc
112
113
Platform #1: Intel(R) CPU Runtime for OpenCL(TM) Applications
113
114
`-- Device #0: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
114
115
```
115
116
116
- The GPU is platform ID 0 and device ID 0, while the CPU is platform ID 1,
117
- device ID 0 . These can be specified with the ` opencl_ids ` argument
118
- when running a model. The ` opencl_ids ` is supplied as a vector of
119
- length 2, where the first element is the platform ID and the second
120
- argument is the device ID.
117
+ On this system the GPU is platform ID 0 and device ID 0, while the CPU is
118
+ platform ID 1, device ID 0. These can be specified with the ` opencl_ids `
119
+ argument when running a model. The ` opencl_ids ` is supplied as a vector of
120
+ length 2, where the first element is the platform ID and the second argument is
121
+ the device ID.
121
122
122
123
``` {r fit-opencl}
123
- fit_cl <- model_cl $sample(data = mdata, chains = 4, parallel_chains = 4,
124
- opencl_ids = c(0, 0))
124
+ fit_cl <- mod_cl $sample(data = mdata, chains = 4, parallel_chains = 4,
125
+ opencl_ids = c(0, 0))
125
126
```
126
127
127
- Lets run a version without OpenCL:
128
+ We'll also run a version without OpenCL and compare the run times.
128
129
129
130
``` {r fit-cpu, message=FALSE}
130
131
# no OpenCL version
131
- model <- cmdstan_model("opencl-files/bernoulli_logit_glm.stan")
132
- fit_cpu <- model$sample(data = mdata, chains = 4, parallel_chains = 4,
133
- refresh = 0)
132
+ mod <- cmdstan_model("opencl-files/bernoulli_logit_glm.stan")
133
+ fit_cpu <- mod$sample(data = mdata, chains = 4, parallel_chains = 4, refresh = 0)
134
134
```
135
135
136
136
The speedup of the OpenCL model is:
137
137
``` {r time-ratio, message=FALSE}
138
- fit_cpu$time()$total/ fit_cl$time()$total
138
+ fit_cpu$time()$total / fit_cl$time()$total
139
139
```
140
140
141
- This speedup will be determined by the GPU/CPU used, the input problem
142
- sizes (data as well as parameters) and if the model uses functions
143
- that can be run on the GPU or other OpenCL device .
141
+ This speedup will be determined by the particular GPU/CPU used, the input
142
+ problem sizes (data as well as parameters) and if the model uses functions that
143
+ can be run on the GPU or other OpenCL devices .
0 commit comments