LHC Tunnel

LHC Tunnel

Friday, 2 March 2018

Expiry of VMs in the CERN cloud

The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.

Personal projects have a number of differences compares to other projects in the cloud
  • No non-standard flavors
  • No additional quota can be requested
  • Should not be used for production services
  • VMs are deleted automatically when the person stops being a CERN user
With the number of cloud users increasing to over 3,000, there is a corresponding growth in the number of cores used by personal projects, growing by 1,200 cores in the past year. For cases like training users, there is often the case that the VMs are created and the user then does not remember to delete the resources so they consume cores which could be used for compute capacity to analyse the data from the LHC.



One possible approach would be to reduce the quota further. However, tests such as setting up a Kubernetes cluster with OpenStack Magnum often need several VMs to perform the different roles so this would limit the usefulness of personal projects. The usage of the full quota is also rare.



VM Expiration

Based on a previous service which offered resources on demand (called CVI based on Microsoft SCVMM), the approach was taken to expire personal virtual machines.
  • Users can create virtual machines up to the limit of their quota
  • Personal VMs are marked with an expiry date
  • Prior to their expiry, the user is sent several mails to inform them their VM will expire soon and how to extend it if it is still useful.
  • On expiry, the virtual machine is locked and shutdown. This helps to catch cases where people have forgotten to prolong their VMs.
  • One week later, the virtual machine is deleted, freeing up the resources.

Implementation

We use Mistral to automate several OpenStack tasks in the cloud (such as regular snapshots and project creation/deletion). This has the benefit of a clean audit log to show what steps worked/failed along with clear input/output states supporting retries and an authenticated cloud cron for scheduling.

Our OpenStack projects have some properties set when they are created. This is used to indicate additional information like the accounting codes to be charged for the usage. There are properties for indicating if the type of project such as personal and if the expiration workflow should apply. Mistral YAQL code can then select resources where expiration applies.

task(retrieve_all_projects).result.select(dict(id => $.id, name => $.name, enabled => $.enabled, type => $.get('type','none'),expire => $.get('expire','off'))).where($.type='personal').where($.enabled).where($.expire='on')

The expire_at parameter is stored as a VM property. This makes it visible for automation such as CLIs through the openstack client show server CLI.

There are several parts to the process
  • A cron trigger'd workflow which
    • Machines in error state or currently building are ignored
    • A newly created machine which does not have an expiry date set has the expiration date set according to the grace period
    • Sees if any machines are entering close to their expiry time and sends a mail to the owner
    • Checks for invalid settings of the expire_at property (such as people setting it a long way in the future or deleting the property) and restores a reasonable value if this is detected
    • If a machine has reached it's expiry date, it's locked and shutdown
    • If a machine has past it's date by the grace period, it's deleted
  • A workflow, launched by Horizon or from the CLI
    • Retrieves the expire_at value and extends it by the prolongation period
The user notification is done using a set of mail templates and a dedicated workflow (https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows/blob/master/workflows/send_mail_template.yaml). This allows templates such as instance reminders to have details about the resources included, such as the example from the mail template.

The Virtual Machine {instance} from the project {project_name} in the Cloud Infrastructure Service will expire on {expire_date}.

A couple of changes to Mistral will be submitted upstream
  • Support for HTML mail bodies which allows us to have a nicer looking e-mail for notification with links included
  • Support for BCC/CC on the mail so that the OpenStack cloud administrator e-mail can also be kept on copy when there are notifications
A few minor changes to Horizon were also done (currently local patches)
  • Display expire_at value on the instance details page
  • Add a 'prolong' action so that instances can be prolonged via the web by using the properties editor to set the date of the expiry (defaulting to the current date with the expiry time). This launches the workflow for prolonging the instance.

Author

Jose Castro Leon from the CERN cloud team

References