Serverless NLP transformer model with ONNX and Azure functions

October 10, 20205 Min Read

The missing blog to deploying transformer models


We read a recent article on huggingface blog. In the blog, Julien discussed how to train a new language model from scratch. Even though the blog is fantastic, we felt it lacked the details needed to execute such a task. In particular, the necessary infrastructure to train and deploy such a model was missing. In this blog, we hope to provide this missing piece.



What is a transformer language model?


sepa_reco



As with all deep neural networks, Transformers contain neurons (math functions). The neurons are arranged in interconnected layers that transmit signals from input data and slowly adjust each connection's synaptic strength (weights). That's how all AI models extract features and learn to make predictions. However, a Transformer model uniquely has attention build in so that every output element is connected to every input element. The weightings between them are calculated dynamically, in effect.

For a deeper understanding of these models' inner workings, we advise you to read the original paper on transformer models. It details how transformer models do not require sequential data to be processed, allowing for greater parallelization than other deep learning models, such as RNNs.



Azure training infrastructure



We choose Azure because we have in-depth knowledge and experience using Azure infrastructure. You can however, run our code on any Infrastructure that has Nvidia and docker installed.


We trained the model on Azure NC6_Promo VM. You can check our code on GitHub, specifically the azure/ directory, to spin up a working VM with GPU driver and docker preinstalled.

The location of the python training code is in src/. We package the code in a docker image to quickly transfer it between our local environment and the cloud training VM.

It's so simple. It only takes two commands to go from dev to training!

On the development machine, we can package the code from the command line

with make deploy and on the VM we can execute a training with docker run ...

Sit back and enjoy the training 😁


Alternatives


We need to point out that although this approach is straight forward, there are other ways of training a transfomer model. For example, using Azure Machine Learning. Azure machine learning lets us use Jupyter notebooks and provision powerful VMs for doing training. It also has an excellent library to interact with data stored in storage accounts and versioning of data.



Deploying the transformer model to production



There are many ways how to deploy an AI model to production. We can choose between IaaS, CaaS, PaaS and FaaS. The ops overhead decreases going from IaaS towards FaaS.

Usually, we try to start with the deployment option with the least amount of ongoing OPs burden. However, transformer models can be quite big.

The model weights range from 200MB - 2GB+. Fortenly for us, IsRoBERTa is around 360MB.


Optimizing for serverless deployment


After we decided to use a serverless deployment strategy, we needed to optimize the model for the serverless environment. For this, we chose the ONNX runtime. It provides high performance on not so performant hardware. The ONNX runtime has also been optimized for transformer models. These improvements and the costs / ops benefits make deploying the model ideal for serverless!


Let's start with converting our model with one command for ONNX serving:

python .venv/lib/python3.8/site-packages/transformers/convert_graph_to_onnx.py \
  --framework pt \
  --model neurocode/IsRoBERTa \
  model/isroberta-mask.onnx \
  --pipeline=fill-mask \
  --check-loading \
  --opset 12

Now that we have an ONNX model, lets optimize it:

python -m onnxruntime_tools.optimizer_cli \
  --input model/isroberta-mask.onnx \
  --float16 \
  --output model/isroberta-mask-optimized-quantized.onnx \
  --model_type bert

You can see that we use the --float16 flag to leverage mixed-precision performance gains. We tried the model with and without mixed precision and did not notice degraded inference results.

The optimized model and the quantization let to ~3x size decrease!


Now that we obtained an optimized ONNX version of our model, we are ready to write an API that takes in a sentence and returns five predictions.


Since we decided to use Azure functions, this meant to create an Azure functions app:

func new --name IsRoBERTa --template "HTTP trigger" --python

and writing the predict function, full code is on GitHub

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info("Python HTTP trigger function processed a request.")

    req_query = req.params
    sentence = req_query.get("sentence")

    result = fill_mask_onnx(sentence)

    return func.HttpResponse(json.dumps(result), mimetype="application/json")


def fill_mask_onnx(sentence: str):
    tokens = fast_tokenizer(sentence, return_tensors="np")
    output = session.run(None, tokens.__dict__["data"])
    token_logits = output[0]

    mask_token_index = np.where(tokens["input_ids"] == fast_tokenizer.mask_token_id)[1]
    mask_token_logits_onnx1 = token_logits[0, mask_token_index, :]

    score = np.exp(mask_token_logits_onnx1) / np.exp(mask_token_logits_onnx1).sum(-1, keepdims=True)

    top_5_idx = (-score[0]).argsort()[:5]
    top_5_values = score[0][top_5_idx]

    result = []

    for token, s in zip(top_5_idx.tolist(), top_5_values.tolist()):
        result.append(f"{sentence.replace(fast_tokenizer.mask_token, fast_tokenizer.decode([token]))} (score: {s})")

    return {"result": result}

You can go and try the final model out on our website

We provide a german version as well if you aren't fluent in Icelandic 😉



Summary


  • We trained a custom model on Azure infrastructure
  • We optimized the model for serverless deployment using ONNX
  • We serve the prediction API with Azure functions and reap the benefits of serverless deployments

Even though we have a good first base, it can be even better by creating a deployment pipeline that automates:

  • Model versioning
  • Model prediction verification
  • Canary deployments

Let us know if you want to see this taken to the next level!

Spread the


Thirsty for more?