How to automate an R script on Google Cloud
Cloud services enable us to automate scripts on the cloud so that we don’t have to worry about whether the computer is on or not (learn more about automating R scripts on a computer here). That’s why today we will learn how you can automate an R script on Google Cloud. You can even do it for free! Sounds good, right? Let’s get into it!
First steps on Google Cloud to automate an R script
First of all, we must have an account on Google Cloud. If you don’t have one, you can create it here. When you do, you will see that it asks you to insert a card. They say they don’t charge without your permission, but in case you don’t trust it, I recommend creating a virtual card with a low balance (it must be at least € 1) to make sure there is no problem.
In order to automate our R script on Google Cloud we will use three services: Cloud Build, Cloud Schedule and App Engine.
Cloud Build + App Engine: working with Docker and yaml
Cloud build will enable you to upload your R code as a yaml file, while App Engine is the service that will execute that code.
Basically the yaml file is a text file with a sequence of procedures that Google servers will have to execute. Generally this files are created with Docker images.
Docker is a container that can be executed by any server with docker, regardless of the OS. So, a Docker that contains R code can be executed even if R is not installed on the server.
To use these services we will use Google Cloud. Like any other service, they are not free, so depending on the automation you might end paying. Luckily, every minute of compilation has a cost of 0,003USD being the first 120 of every day for free. So, unless you want to automate a very large script or a script that is executed every minute, most likely you will not need to pay. Anyway, you can see the price list here.
Cloud Scheduler: automating Docker
With Cloud Scheduler we can automate the execution of the Cloud Build image. Basically with this service, you create a cronjob, but instead of creating it on your computer, you do it on the server (if you don’t know about cronjobs check out this post).
Regarding price, Cloud Scheduler is quite cheap and in fact and free limits are quite interesting. In Cloud Scheduler every task costs 0,1USD being the first three tasks of your account for free. Despite this might not sound much, there are some important nuances:
- They charge by task, not by execution. So, if you want to create a task that is executed every day, this task will cost you 0,3USD, not 3USD.
- Free limits are at account level, not project level. If you have different projects that use Cloud Scheduler and in overall they add more than 3 tasks, for every new one they will charge you 0,1USD.
Anyway, here you can check the table or prices.
Now that we already understand what each service does, let’s learn how to automate an R script in Google Cloud!
How to automate an R script on Google Cloud
Step 1. Enable Cloud Build & Cloud Scheduler APIs
The first thing that we need to do is to enable Cloud Build’s and Cloud Scheduler’s API (App Engine is activated by default). To do so, we need to follow these steps:
- Go to the API library either by clicking on this link or accessing “Apis and services” and going to the library.
- Search one of the two APIs that we want to activate.
- Access that services API configuration and Enable the API.
Once both APIs are enabled they will show on the side menu. Will can pin them if we want to access them more easiliy in the future.
Step 2. Download auth files and necessary info
If you want to use Google Cloud services from R, we will have to show R that we are the owners of that Google Cloud Account. To do so, we need to create and download some credentials in JSON format. This file will be used by
googleAuthr function to undertake all operations.
In order to create these credentials we need to go to the service account key page. Once there we need to:
- Click on Create service account if we haven’t created one yet.
- We assign it a name. This will generate an ID so I recommend you to set a descriptive name so that it is easier to know what it is.
- We assign the roles. Necessary roles are the following: Service Account User, Cloud Scheduler Admin, Cloud Run Admin, Storage Admin. It is important to grant all those roles because if you don’t it won’t work.
Once we have done this, a JSON file will automatically download to our computer. We will need to take that file to our R project.
Other things to configure
Apart from the JSON file, we need to include some variables in our .Renviron so that R can read information without it being exposed. These variables are:
- Project Id: to find it click on the name of the project. In the window that pops up will show the id of the project.
- Region: where do you want to upload your Cloud Build. It should be the same that the region where you configure App Engine. In my case, I configures europe-west1.
- Cloud Build service account email. You can find this email on the Cloud Build configuration page.
- Auth: the path for the JSON file we have downloaded on the previous step.
GCE_AUTH_FILE="ruta_fichero_auth.json" GCE_DEFAULT_PROJECT_ID="nombre_proyecto" GCS_DEFAULT_BUCKET="my-bucket" CR_REGION="europe-west1" CR_BUILD_EMAIL=cloudrunner@nombre_servicio.iam.gserviceaccount.com
Step 3. Create and upload a Docker image to Cloud Build
Create the YAML file
To do this process we will use two packages. On the one hand, the package
googleCloudRunner will enable us to interact with Cloud Build and Cloud Scheduler. Besides it will also allow us to create the yaml file. On the other hand, the package
googleAuthR will enable us to authenticate in Google Cloud. Both packages have been developed by Mark Edmonson.
Note: I recommend you to install
googleCloudRunner from Github instead of doing it from CRAN. By doing so you ensure to install the latest version.
#remotes::install_github("MarkEdmondson1234/googleCloudRunner") #install.packages("googleAuthR") library(googleCloudRunner) library(googleAuthR)
Once we have loaded the libraries we have to authenticate on Google Cloud. To do use we will pass the JSON file path to the function
After that we will create the yaml file. Basically this yaml file indicates Cloud Build what it has to do. As said before, a yaml file is a text file with commands that have to be followed. So, to create it we will have to first indicate which steps need to be followed.
In our case, we will just have one step:
cr_buildstep_r. This function will enable us to create a yaml object from an R file. If the file is stored locally, we will need to set
r_source parameter to local.
However, there are many other functions to create this YAML object.
cr_buildstep_docker, for example, creates a YAML object from a docker container, while
cr_buildstep allows you to create your own personalized YAML object.
Once we have build our YAML file, we can save it locally with the function
cr_build_write. By doing so we can have a version control of our deployments.
my_yaml <- cr_build_yaml( steps = c( cr_buildstep_r("Automatizar.R",r_source = "local") ) ) cr_build_write(my_yaml, file = "cloudbuild.yaml")
## i 2020-05-03 18:04:02 > Writing to cloudbuild.yaml
Upload the YAML file to Cloud Build
Lastly, we need to upload that YAML file to Cloud Build. As we have already logged in to our account we don’t need to authenticate again. We can check the status of Cloud Build in the history section of Cloud Build.
However, for this to work it is necessary to activate App Engine. As said before, the location of App Engine has to be the same as the location of our Cloud Build. This is vital because changing the location of App Engine is not allowed. So, if you are in Europe I would recommend you set the location as europe-west1.
Once we have configured App Engine and Cloud Build, we can upload our YAML object. To do so, we will use
cr_build function where we will indicate which YAML object we want to build.
itworks <- cr_build("cloudbuild.yaml", launch_browser = FALSE)
## i 2020-05-03 18:04:04 > Cloud Build started - logs: ## https://console.cloud.google.com/cloud-build/builds/a74e1552-b075-493f-a08d-15c9f02a625d?project=123456789
If the upload is correct we will see a green checks. If there has been any issue a red cross will show indicating the step that hasn’t worked. Errors can vary so I recommend to check the console at the right and try to debug it.
Anyway, if your YAML has several steps (in my case it only has one) at the left side will show which step has failed. This step corresponds to the step you have created with
cr_buildstep, so this might help you while debugging.
Step 4. Automate Cloud Build with Cloud Scheduler
We now have Cloud Build working. That’s great! We just need to automate its execution with Cloud Scheduler. To do so we will use the function
In order for this function to work we will need to pass the name of the task and how often do we want it to execute. This last point is important because the execution frequency is set as a Unix string, which might not be very intuitive.
To do so we will have to pass 5 different parameters that are separated by a space. These parameters are: minute, hour, day of the month, day of the week. All of them need to be set as numbers, so they have a maximum and minimum value, as you cannot set to run at the minute 61, for example. Besides, if we want to left a parameter blank, we will need to put * instead of blank.
So, if we want a task to be executed every Monday at 10:30, the Unix coding would be the following:
30 10 * * 1 while if we want to execute the 28th of every month it would be as follows:
30 10 28 * *. In our case, we will schedule it to run every minute.
cr_schedule("1 * * * *", name="nombre_tarea", httpTarget = cr_build_schedule_http(itworks))
## ==CloudScheduleJob== ## name: projects/XXXXXXXXXXXXXXXXXXXXX/locations/europe-west1/jobs/nombre_tarea ## state: ENABLED ## httpTarget.uri: https://cloudbuild.googleapis.com/v1/projects/XXXXXXXXXXXXXXXXXXXXX/builds ## httpTarget.httpMethod: POST ## userUpdateTime: 2020-05-03T16:04:04Z ## schedule: * * * * * ## timezone: Europe/Paris
Once the task is scheduled, we can check the console of Cloud Scheduler to see that everything is fine. There you can indicate to execute the next Cloud Build, to stop the execution, etc. Besides, you can see when the processes have been executed on Cloud Buil’s history. If the execution has gone right you will see a green check:
Done! You have just learnt how to automate an R script on Google Cloud!
Summary on how to automate an R script on Google Cloud
As you can see automating an R script on Google Cloud is much more complex that to do it locally. It might is also more expensive, depending on the task that we need to automate, but no doubt it is much more secure robust that to automate an script on your computer.
In my opinion, knowing how to automate tasks is a must for any Data Scientist. Yeap, it is true that this is more into DevOps than Data Science, but this does not mean they are not compatible. In fact, automating scripts will enable you to take the models that you create as a Data Scientist into production, which is critical for companies that cannot afford both a Data Scientist and a DevOps Engineer.
So I hope that you have enjoyed the post and that you will take advantage of all this knowledge. If you have any doubt, do not hesitate to reach out on Linkedin! See you on the next post!