Page tree
Skip to end of metadata
Go to start of metadata

Architecture


The above architecture achieves the following goals:

  • Scalability based on on demand AWS Lambda functions
  • Cost efficiency and security (Golang is a lightweight high throughput language with a small memory and CPU footprint to Java with reliable libraries available). AWS Lambda is charged on the basis of number of requests and execution time for those requests, depending on memory usage.
  • Maintenance (the Golang lambda function only needs to be provisioned once)
  • Reliability (CPU load spikes from EC2 instance are offset to lambda functions which allows for predictable resource usage on EC2 instance)

Architectural Components

  1. Magnolia instance invokes the lambda functions using AWS SDK (https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/examples-lambda.html). To send a request for creating image variants, the magnolia instance sends over the reference to the customer and the corresponding image, and gets as a response the success or failure message with mediaId. and a list of variant ids.
  2. Golang image service performs the on demand processing of the image, creating the variants and dumping them in the S3 bucket relevant for the customer
  3. AWS Cloudfront hooks onto the S3 and returns the images requested by the user and at the same time caches them

Benefits

  • The Golang image service can be bundled as a docker image (supported by Lambda) and can be tested locally
  • Stable library support in Golang
  • CPU and memory requirements on Magnolia instance are predictable and can enable high density SaaS offering reducing our costs
  • Twirp framework allows both JSON and gRPC bindings, with the benefit that gRPC allows streaming request/response, with a side benefit that gRPC is more compact and performant compared to JSON for (de)serialization.
  • Golang image service remains portable and can also be spun off as a replica on Kubernetes cluster and does not lock us into AWS platform
  • The magnolia instance can be replaced with a scalable Golang image upload service, which allows for chaining of user defined image operations as a lambda functions on AWS.
  • AWS SDK for Golang supports multiple part uploads with configurable number of goroutines.

Phases

Phase 1 (Implement Golang Image Service PoC)

Goals

  • The image service can accept a zip file with images 
  • Generate multiple variants of the images concurrently
  • Store the image variants in S3
  • Return references to image variants

Results:

Two implementations for the imaging exist:

Resources

  • No labels

4 Comments

  1. In my last company, there was an image-service that did more or less what this one does. It was written in a language that none of the team really knew, also using some sort of library, that nobody in the team ever contributed to.

    It was developed by an external dev that wasn't around anymore. Over the years the service got tied to dozens of services throughout the company.

    Because image processing is a tricky thing, and this was an insurance company with absolutely no expertise on such matters, things didn't turn out to go very well. The service was running in Openshift and got OOMKilled several times an hour and nobody had the expertise to fix it. It was just how things were.

    The moral of the story:

    Unless you're Netflix and image and video processing is your key business, never try to implement an imaging service yourself.


    https://cloudinary.com maybe?

    1. We btw do have Cloudinary external dam connector. Actually estimating the cost/complexity trade-offs between would be an interesting exercise, which may help us make better decisions in future.
      As counter arguments to what you've written, I could suggest maybe:

      • "complexity of the task": we did have an embedded imaging service for years now and despite obvious flaws that it has, it did/does work (mostly (smile)). It wouldn't cause mysterious outages that we at least wouldn't be able to explain. Let alone the library that we used - even though nobody contributed to the twelve monkey lib (or what are we using now?), we could/can rely on it being widely adopted open source solution (much like for the case of any other third-party lib that we use without contributing). The new solution is deemed as just a proper way to re-implement the current approach: use standalone process instead of just a servlet in mgnl monolith, use workers to actually invoke library etc. Pretty sure this is something we can collectively handle.
      • new language concerns - somewhat valid in my opinion, I also wouldn't unnecessarily use smth non-JVM originated without a good reason. On the other hand, Go isn't an "esoteric" (like Lisp) or "academic" language (like Haskell), it is a language that is specifically good at handling asynchronous tasks across distributed services. Anyone who understands what that means should be able to grok such service implementation without much effort. Fwiw, I can see us using it more and more in our solutions in future.
      • Another positive reason for developing own service is to create a precedence of a sidekick sub-system that functions apart from the main monolith. This should give us the insights re: how it would look like operations-wise and act as a segway to shared ownership over the infra. 

      That being said - I am both hands for considering a third-party system as an alternative, just wouldn't jump into conclusions just yet.

      Re: "image and video processing is your key business" - storing and managing content is after all one of the key parts of a CMS, and cropped/scaled variations do seems to be reasonable features. 


      p.s. I talked with Teresa Miyar, who's been involved in the discussions with Cloudinary reps and can able to put us in touch with them upon request.

  2. With the above mentioned I would try as much as possible abstract ourselves away from the eventual implementation of the rendition generator and rather focus first on the integration of such service with DAM API/impl. 

  3. All this proposal was motivated to avoid having to deal with a CDN in front of many buckets or having to cope with many forwards/redirects that could be hard to manage with many customers, but as Ilgun IlgunJesus Alonso and Rishab Dharexplained to me during this afternoon meeting, these problems are solved through a very simple API Gateway that trigger a Lambda function which take care of talking to all subscription buckets:

    Pasted from pd-cloud channel and written by Jesus

    FYI, the Poc I showed this morning is based on this > https://aws.amazon.com/solutions/implementations/serverless-image-handler/
    a working solution based on a API gateway -> lambda -> S3 integration

    So this proposal can be discarded for now:

    Some thoughts about this topic to be discussed if you. consider appropriate... Having in mind the final goal of going toward a complete multi-tenant SaaS platform, everything that is taken off the platform should be built in a multi-tenant architecture, since otherwise it will be really hard (almost impossible) to get this goal, I think we should try to go to something like is described below:

    • 1 multi-tenant bucket for original assets and 1 multi-tenant bucket for store processed variants, with a folder structure that could be something like this:
    magnolia-cloud-original-assets/[public|private]/<subscription>/[small|medium|large|<other_size_or_category>]/<asset_files>
    
    magnolia-cloud-processed-assets/[public|private]/<subscription>/[small|medium|large|<other_size_or_category>]/<asset_files>
    • 1 lambda function using the magnolia-cloud-original-assets bucket as triggerer, in the way that any time that a new image is uploaded to the bucket from any subscription, the lambda function is triggered to create the variants and store them in their corresponding folders (subscription and variant folders) inside the magnolia-cloud-processed-assets 
    • 1 CloudFront distribution with a global domain cdn.magnolia-cloud.com or whatever in front of the magnolia-cloud-processed-assets  bucket (and maybe magnolia-cloud-original-assets  as well).
    • Every Magnolia subscription only store the path to their managed images  in its database, and the CMS always and only deliver the links to them so that all images are retrieved directly from the user browser (and even fine-grain resized) requesting directly the CloudFront CDN (through the cdn.magnolia-cloud.com domain), distinguishing the following cases:
      • Public images: Return the link to the user. For example https://cdn.magnolia-cloud.com/public/swissre/medium/house.jpg
      • Private images: Return a pre-signed URL generated on the fly by Magnolia (these URLs define an expiration time) so that the requester user and only this one can get the image through the cdn.magnolia-cloud.com. This could be perfectly applied to all images managed from the AdminCentral.

    Some points about this proposal to consider:

    • S3 buckets are a pure logical element, so 1TB of data stored in one bucket has the same performance and cost as 1 TB spread in 1000 buckets.
    • With this approach is not longer needed to thing about forwards, redirects, reverse proxies and/or API Gateways, at least for these features...
    • CloudFront distribution is not suitable to put it in front of many origin buckets such as would be 1 per subscription (see service quotas https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/cloudfront-limits.html#limits-web-distributions).
    • A CloudFront distribution is a pretty complex service with many configuration possibilities, cache strategies, behaviors and the more important point, CloudFront distribution usually require continuous improvements in its configurations (refine the cache strategies, add new ones, adding behaviors or customise the existing ones), so IMO considering to deploy a dedicated CloudFront distribution for each subscription can drive us directly to a situation in which can be really really hard to deal with all of them.