Different types of model serving

There are a variety of ways to serve machine learning models, but the following are the three most common:

  • Microservice Model,
  • Embedded Model,
  • Compute Predictions.

We’ll go over the various architectures/patterns and then break down the benefits and drawbacks in a more unbiased fashion.

Although the first of these three ways is more structurally flexible and current in terms of a virtualized environment, it is not a magic fix because it requires and expects that infrastructure is capable and prepared to support the architecture’s versatility. It’s the most practical structure to use. Other methods, on the other hand, have their advantages over the first architecture.

Microservice Model

Model is supplied as a microservice that is distinct from the main application in this design. This architecture, which is the most adaptable in terms of system deployment, has both benefits and drawbacks.


  • Expandable– Independent of the primary application, the ml model serving layer may be scaled out. Because models are supplied by discrete microservices, adding more models necessitates the creation of new containers/services that can be scaled independently.
  • Creates and provides real-time model serving– It can affect predictions by using internet information and context. Additionally, this type of ML model can make new predictions based on previously unseen facts. It can also handle Out Of Vocabulary (OOV) circumstances that other procedures in an offline computation couldn’t manage, according to the model standards.
  • The serving layer has a lot of freedom – the layer could be created and maintained independently of the primary application. It can be modified and updated separately from the main application. Finally, the serving layer might use a different technological stack than the primary application. This is especially crucial if your primary stack employs a language that isn’t widely used in the machine learning community.


  • Separate infrastructure is required to properly leverage the flexibility of the concept. The serving layer must include monitoring, as well as online validation and assessment metrics.
  • The model, as well as the API, must be a high-performance prediction system with minimal latency. Engineers must either improve the prediction layer or employ an off-the-shelf option to do this.
  • It necessitates the deployment and release of different services. In terms of operations, this will add to the overhead. This necessitates independent monitoring, alerting, and production readiness for the models and services.

Embedded model

The model is integrated into the primary application in this design. The architecture is a balance amongst two other choices since it allows apps to offer predictions in a real-time while without keeping the forecasts. However, because the serving layer is strongly coupled with the primary application, it adds to the burden of maintaining and managing both the model and the application.


  • There isn’t a network call on model prediction. The model will most likely be imported into memory by the program and called via a callback function. This reduces the time it takes for a service to be delivered.
  • Utilize the same technological stack– Changes to the main application may be readily made by the same developers using the same stack. Connectivity across model and application would’ve been simple and out of the box if the machine learning library’s library and framework were likewise the application’s language.
  • There is no requirement for a model-specific infrastructure. The very same software architecture, development, and distribution procedure may be used by both models. Although it provides periodic overhead to the application, this minimizes the operational cost for the model.
  • It isn’t expandable. If your primary application requires the usage of more than one model, all of these models must be included in the main program, which adds a significant level of maintenance cost for software engineers who are maintaining and developing this application.

Compute predictions

One of the first types of serving machine learning models is this. A data scientist creates a SQL table, which is then consumed into operation and served by an application from a database. It’s that simple. Even though this design does not directly serve the model, it offers many benefits.


  • It does not need any particular infrastructure. It may be sufficient to have a database that is well-suited for production use cases. The deployment/serving methodology does not need any infrastructure. A cron-job-like method may populate/update the database, which can then deliver forecasts automatically.
  • Because the predictions are pre-computed, it can offer them quickly depending on the database. This may be vital for mission-critical applications, and in some circumstances, it may be the preferable option.


  • It necessitates the preparation of all inputs ahead of time. It may not be practicable or realistic in other fields, such as search. Because many of the queries being input have never been seen before. Huge amounts of storage space are required for highly dynamic features/datasets with a substantial number of combinations.
  • It’s difficult to add more variables since the storage and calculation needs will skyrocket.

Finally, using your model as data is the ideal way if you need a model that can grow without the need for a lot of growing pains.

This approach saves the model in a standardized format (TensorFlow model serving) that can be read programmatically by any programming environment or framework, allowing it to be reused across several applications.

From a technical sense, this is a little more difficult at first, but it has significant long-term benefits because you only have to store it once, and now all of your apps can use your stored models.