How to use Github Actions for Data Science

In this post I am going to explain what Github Actions is and how to use Github Actions for Data Science. Let’s go with it!

What is Github Actions

Github Actions is a service offered by Github that puts at our disposal different types of virtual machines (Linux, Windows, MacOs, etc.) with which we can make automations.

Initially, Github Actions is intended to perform automation within Github in order to make certain tasks easier for developers.

However, as data scientists, Github Actions opens a very interesting door for us to automate issues such as data extraction or model retraining, among others, whether we work with Python, R or any other language .

In addition, how could it be otherwise, Github Actions integrates perfectly with Github, which is one of the tools most used by data scientists as a code repository.

On the other hand, regarding prices, Github Actions has two large groups depending on the type of repository in question:

  • Public repositories: the service is free.
  • Private repositories: both storage and execution times will depend on the type of repository you have, as listed in this link.

So far I’m sure it sounds good to you. But what use cases can we give Github Actions for Data Science? Let’s see!

Github Actions Use Cases for DataScience

Here I list some of the use cases that you can give to Github Actions for Data Science:

  • Automate simple ETL processes. In my case, for example, I have automated the extraction of data from a real-time API that does not save history ( link). In addition, you can also automate to create dashboards that interest the community, as Rami Krispin has done with his electricity dashboard (link).
  • Putting algorithms into production using MLOps: with Github Actions we can automate the retraining of models in a simple way.
  • Creation of alert systems for tools such as Slack.
  • Run code tests or model comparisons.
  • Automate the merge of a pull request.
  • Automate the build and push of Docker images .

As you can see, there are many uses for this tool. And yes, deep down, several of these points could be replaced by automation in the Cloud, but if your project is open source or if, being a private project, you don’t have things in the cloud or the automation will not involve large financial resources, in my opinion, Github Actions will serve you perfectly.

With that said, let’s continue with the Github Actions tutorial to see how it works. Let’s get to it!

How Github Actions works

How to create a workflow in Github Actions

To create a workflow in Github Actions we must create a file with the .yaml ending inside the .github/workflows subfolder. This file is the one that will give Github the following information:

  • Give a name to our workflow.
  • How or when this automation should be executed.
  • The type of virtual machine to use.
  • Step by step, what commands should Github run.

If we upload a file to Github in this directory, automatically without us doing anything Github will understand it as a workflow and will execute what we have indicated exactly when we have indicated it.

Now that you understand how to roughly create a Github Actions workflow, let’s take a step-by-step look at all the components it has.

How to define the trigger of a Github Actions workflow

The first part of the .yaml file is used to define when we want our workflow to run.

In this sense, we are going to see the main triggers that are usually used when we use Github Actions for data science.

If you want to see all the triggers that Github Actions offers, you can find them all here.

Likewise, it is important to note that, although in each example we see each trigger separately, a workflow can have one or more triggers.

That said, let’s see what are the main Github Actions triggers for data science:

Launch Github Actions workflow on a regular basis

allows the execution to be carried out automatically at a specific moment, either every hour, the first day of the week or every 5 days, for example.

The execution will be indicated using cron syntax, just as it happens in the automation of Python files in the Cloud.

In case you don’t know what the Cron syntax consists of, we simply have 5 digits in a row and each one refers to a certain moment:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of the month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of the week (0 - 6)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │

For example, if we use a cron 0 * * * *, the file will be executed every time the minute reaches 0, that is, every hour.

If you want to test to understand how Cron works, I recommend you to check out Crontab.guru

That said, an automated schedule workflow starts like this:

on:
  schedule:
    - cron: '0 * * * *' #Every hour

How to launch a Github Actions workflow manually

To launch a Github Actions workflow manually, we must use the workflow_dispatch trigger.

In this way, when we select our workflow within the Github Actions tab, the Run workflow option will appear. As shown in the following image:

Manual Trigger en Github Actions

Clicking the green Run workflow button will run the workflow.

Workflows are usually executed automatically. However, it is interesting to add a manual trigger for when you want to demonstrate the operation of the automation to third parties.

On the other hand, declaring the manual execution of the workflow at the code level is quite simple:

on:
  workflow_dispatch:

Launch a workflow through an HTTP request

Many times when we use Github Actions for Data Science we want a workflow to run when we say.

For example, let’s say we’re going to use Github Actions to create an MLOps process (I’ll cover this point in a future post). In MLOps, we will have a system that analyzes the predictive capacity of the model. In this way, when the predictive capacity of the model falls below a limit, the model must be retrained.

Well, assuming that retraining the model is a workflow, it should be executed when a condition is met. So, the easiest way to address this issue is for the retraining workflow to trigger an http request.

To do this, we must set the trigger of our workflow as repository_dispatch. In addition, we must also include a parameter called types to which we must give a value, as shown in the following code:

on:
  repository_dispatch :
    types: event_type

This being so, we can execute the workflow through a reques api. For this we will need the following information:

  • github_user: name of the Github user (it comes in the URL of your account).
  • repo: name of the repository where the workflow resides (it also comes in the repo URL).
  • GITHUB_TOKEN: Token of your Github account that you can generate here. This token must have permission to your repo and workflows.
  • event_type: value that we have passed to the types parameter.

In the following snippet I show you how to make the request in Python:

 url = f'https://api.github.com/repos/{github_user}/{repo}/dispatches'

    # I make the request
    resp = requests.post(
        URL,
        headers={'Authorization': f'token {GITHUB_TOKEN}'},
        data = json.dumps({'event_type': event_type})
        )

With this we know how to launch Github Actions workflows. Now, we continue with the Github Actions tutorial, seeing how to choose the virtual machine where the workflow will run.

How to choose the type of virtual machine to use in Github Actions

In the following table I indicate the environment options that Github offers for each virtual machine, as well as the YAML tag that we must use to choose said virtual machine.

< tr>
EnvironmentYAML tag
Ubuntu 20.04ubuntu-latest
Ubuntu 18.04ubuntu-18.04
macOS 11macos-11
macOS 10.15macos-latest or macos-10.15< /td>
Windows Server 2022windows-2022
Windows Server 2019 windows-latest or windows-2019
Windows Server 2016windows-2016

If you want to keep up to date with the available virtual machines, you can see it at this link.

Also, at the code level, the choice of the virtual machine is very simple, as shown in the following code snippet:

jobs:
  build:
    runs-on: ubuntu-latest

Now that you know how to choose the virtual machine, we continue with our Github Actions tutorial looking at how to tell Github Actions what to run. Let’s get to it!

How to tell Github Actions which commands Github should execute

When we talk about the steps that Github Actions must execute, first of all we must distinguish between two types of steps: the predefined ones and the custom ones. I explain each of them separately.

Predefined steps in Github Actions

The predefined steps are those questions that are widely used by all users and that, therefore, Github has packaged so that it is easier for users to define the entire process. To give you an idea, it would be similar to Docker images that already include certain things, such as Python, FastAPI, etc. (If you want to learn more about Docker, I recommend you take a look at this post).

In this sense, the most used predefined steps when using Github Actions for Data Science are two:

  1. Install the programming language that we are going to use. There are predefined steps to install python, java, go, node, .NET etc.
  2. Give access to the machine where the Github Actions is executed to our repository.

You can find all the predefined steps in the Actions official repo you can find all the predefined steps.

I give you an example of what a YAML would look like that contains the two steps mentioned above and runs on Ubuntu:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:

      - name: Check out repo
        uses: actions/checkout@v2

      - name: Configure Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9.7' 

As you can see, in a few lines we have created an Ubuntu virtual machine, which has access to our repo and which also has Python installed in the version that we indicate.

Having seen the predefined steps, now let’s see how custom steps work in Github Actions

Custom Steps on Github Actions

As far as custom steps are concerned, we simply need to define two things, with an optional third step:

  1. The name that we are going to give to the step. All the steps (including the standard ones), must receive a name, which is what allows us to know what Github is executing.
  2. The console command that must be executed, taking into account that the code will depend on the type of machine that we have chosen.
  3. Execution parameters (optional). If the repository includes parameters, we can set those parameters as runtime environment variables and either pass them to our code or have the code itself read from the environment.

In the following snippet you will see how custom steps work in Github Actions:

 ...

      - name: Install libraries
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Set env variable & execute script
        send:
          URI: ${{ secrets.URI }}
        run: python update_data.py

Conclusion of using Github Actions for Data Science

As you can see, Github Actions is a very simple tool but at the same time very powerful and, sometimes, free. Personally, Github being a widely used tool in Data Science, I think it’s worth getting to know this tool, since it can help us a lot on a day-to-day basis.

As always, I hope you enjoyed this Github Actions tutorial. If so, and you want to be aware of new posts, I encourage you to subscribe. See you in the next one!