by Ben Morrow
Those of you who read this blog and have been following our cloud rendering posts will know that we process massive amounts of data to create a Light Field VR piece. This processing is expensive, both in dollars spent and in turnaround time, and naturally we are constantly looking for new rendering architectures to reduce both costs.
In this post, we will describe a significant improvement in our cloud rendering process that allowed us to close 2017 with a more than 40% reduction in costs and a 30% increase in total render speed, while simplifying our rendering process significantly.
We had previously opted to use an open source distributed data storage filesystem, GlusterFS, as a mountable read/write filesystem in the cloud. All dependent files were copied there prior to rendering, and output was written to that location so that we could achieve the throughput we needed. But this created expensive bottlenecks, and scaling up the GlusterFS during the show was complicated, resulting in significant downtime as we rebalanced our data. As a result we often ran our GlusterFS at the maximum size for the duration of the project. This became very costly when, for large productions, we needed to run a proportionately more expensive GlusterFS (for greater performance and storage) and run it for a longer period of time. Moreover, most of the GlusterFS bandwidth was wasted – every render node had high-bandwidth access to every file for the entire show, even though each render node needed to read only a tiny fraction of the show’s data.
Enter Just in Time cloud rendering. Just in Time cloud rendering determines the file dependencies for each job, so that only the files that are needed for that job are copied locally for high-bandwidth access. In Foundry’s compositing software, Nuke, we leveraged the callback system (designed for integration into asset management systems) to implement our new rendering process. First we intercepted any file system call from the software, checked to see if the path existed in cloud storage, and copied the file to a local cache. This allowed the software to grab what was needed on a per frame, rather than per shot basis, and to cache that file on the local disk of the instances for reuse on the next frame (cached data is of course much quicker to read and write). Implementing a local cache for post-production renders eliminated the need to use a GlusterFS to handle file reads and writes.
We then reexamined the pre-processing render stage and realized that we could combine the three stages of pre-processing to act as one operation. While these jobs would now run longer, reading and writing data would be simpler and quicker because we would no longer need storage for intermediate stages. Though a percentage of our jobs would be preempted (approximately 15% in a 24 hour period), using cloud storage would be far more cost efficient than using a GlusterFS. We then enabled Just in Time rendering for pre-processing output so we could read from and write directly to cloud storage without a GlusterFS intermediary. This allowed us to increase the number of instances by 280%, and we anticipate an even greater improvement in the future, since reading and writing scale much more effectively on cloud storage than on a distributed data storage filesystem.
At this point, we were able to completely switch our show renders to Google Cloud Storage. It provides almost unlimited data storage that grows as we add clients and files, letting us scale linearly with shifts in production without having to worry about resizing a GlusterFS. Cloud storage is also significantly cheaper than running multiple servers and keeping a persistent disk continuously available. Replacing GlusterFS storage with GCS allowed us to save almost half of our show budgets, to be more flexible, to run larger render farms resulting in faster turnaround, and to decrease Lytro-maintained infrastructure.
Eventually we expect to apply Just in Time rendering to all stages of our render operations; pre-processing, post-production, and volume generation. But even with only partial implementation, this technique has greatly simplified and sped up our cloud rendering process, with real implications for our rendering team who were able to spend the last week of 2017 enjoying the holidays instead of monitoring renders.