different resolutions and bitrates to deliver it to iOS mobile devices. To address this .... queue structure and the way worker's schedulers select tasks from it, it is possible .... reduce call the transcoded segments are joined to recreate the whole transcoded ..... Conference on Inforamtion Science and Applications (ICISA), pp.
Scalable Distributed Architecture for Media Transcoding Horacio Sanson, Luis Loyola, and Daniel Pereira SkillupJapan Corp.
Abstract. We present a highly scalable distributed media transcoding system that reduces the time required for batch transcoding of multimedia files into several output formats. To implement this system we propose a fully distributed architecture that leverages proven technologies to create a highly scalable and fault-tolerant platform. Also a new task-oriented parallel processing framework that improves on MapReduce is developed in order to express transcoding tasks as distributed processes and execute them on top of the distributed platform. Preliminary results show a significant reduction in time resources required to transcode large batches of media files with little effects on the quality of the output transcoded files.
1
Introduction
The proliferation of mobile devices with video playback capabilities and Internet connectivity have increased the number of distribution channels for digital content. This, in turn, has created a burden to content creators that must support all different combinations of devices and channels to reach a larger audience. A recent example of this is the new HTTP Live Streaming protocol [1] from Apple Inc. that recommends at least six different versions of the same content with different resolutions and bitrates to deliver it to iOS mobile devices. To address this increasing number of transcoding tasks we implemented a distributed transcoding platform capable of reducing the time and human resources required to handle them. To this end we implemented a fully distributed and fault-tolerant architecture and developed a new task-oriented parallel processing framework that allows easy execution of transcoding tasks in a distributed fashion. The distributed architecture builds on top of proven technologies in distributed computing like distributed queues and storage. This resulted in a robust platform on which we could execute parallel processes without worrying about fault-tolerance, synchronization and availability issues. On top of the platform we needed a parallel processing framework that would allow us to model complex problems as smaller sub-tasks that could be executed in parallel. Unfortunately state of the art in parallel processing frameworks like MapReduce [2] and Dryad [3] lack the expressiveness required to model complex tasks, such as multimedia transcoding, forcing us to develop our own task-oriented framework. This new task-oriented parallel processing framework improves over MapReduce by adding a task primitive on top of the map/reduce semantics. Y. Xiang et al. (Eds.): ICA3PP 2012, Part I, LNCS 7439, pp. 288–302, 2012. c Springer-Verlag Berlin Heidelberg 2012
Scalable Distributed Architecture for Media Transcoding
289
In section 2 we present related work on distributed and cloud transcoding, sections 3 and 4 present detailed description of our distributed platform architecture and details on our improved task-oriented parallel processing framework respectively. On section 5 we elaborate on the model for transcoding process using the proposed parallel processing framework and explain how it is executed on the distributed platform. Section 6 presents transcoding experiments using the system and elaborate on its performance. Finally, in section 7, we remark the important points of this work and give an overview of our future work.
2
Related Work
This paper is inspired by the work in [4] that presents a high-speed distributed transcoding platform. The authors focus on the optimal segmentation of media files that generates lower degradation on the transcoded file; then propose a simple centralized architecture with a round-robin scheduler to handle the distribution of transcoding tasks among several transcoding servers. This architecture works on the assumption that all segments are equal and have the same encoding time. Unfortunately, even in a controlled environment as ours, there exist differences in transcoding server capacity, network conditions and varied media source complexity that invalidates such assumptions. To mitigate the effects of such a heterogeneous environment, the authors in [5] propose a scheduling algorithm that takes into account the estimated transcoding time of each source and the capabilities of the transcoding servers. The new scheduling results in a more linear scalability and better load distribution among the transcoding servers, which shows the importance of distributed scheduling. However the proposed scheduler requires global knowledge and must also calculate transcoding time estimates of tasks in order to properly schedule them. This estimation can be inaccurate due to the variable complexity of source materials and require complex operations that may choke the scheduler, jeopardizing the scalability of the system. Other authors [6,7] have proposed P2P architectures, however, transcoding poses high CPU and network I/O requirements that combined with the unreliability of low availability of P2P clients makes it difficult to envision a real implementation of such systems.
3 3.1
Architecture and Implementation Centralized Architecture
Figure 1 shows the classic distributed transcoding architecture common in previous research [4,8]. The architecture is composed of a storage system where media files are stored, a transcoder manager that is the center of the system and the transcoding workers that are one or more machines dedicated to transcode video segments. The transcoder manager contains the intelligence of the system and is composed of these sub-systems:
290
H. Sanson, L. Loyola, and D. Pereira
Cloud or Cluster
Transcoder Manager
Worker
Splitter Media Analyzer
Storage
Scheduler
Worker
Worker
Worker State
Worker Merger
Fig. 1. Centralized Transcoding Architecture
– Splitter: The splitter is in charge of analysing the source media files on the storage and split them in segments. The decision of the segments split points and the size must be done carefully since it affects the quality, load balancing and scheduling performance of the whole system. Splitting in large files may result in better quality movies but in less distributable loads and vice versa. – Scheduler: The scheduler must take decisions on which segments should be sent to each available worker. If all workers and all segments are equal in capacity/requirements then a simple First Come First Serverd (FCFS) scheduler is optimal. Unfortunately in a real scenarios both, the segments and the workers, have different requirements/capabilities that demands for more intelligent scheduling algorithms. – Merger: A separate server or process receives the transcoded segments from transcoding servers to create the final transcoded video. This architecture presents obvious scalability issues on several points. Firstly, the only channel which the workers can use to receive and return media segments is through the transcoder manager. This limits the network capacity of the whole system to that of the manager. Secondly, the scalability of the manager to handle high input load and more workers greatly depends on the complexity of the splitting and the scheduling algorithms used. Additionally the manager must track media segments states (encoded, waiting) and workers states (loaded, idle, failed) in order to make scheduling decisions. 3.2
Distributed Architecture
By taking advantage of recent advances in distributed storage, databases and parallel processing we decided to shift radically from the classical centralized architecture and implemented a fully distributed architecture as depicted in Figure 2. When compared to the centralized approach, we can observe that the central manager no longer exists and the sub-systems it contained are now inside the workers themselves. This way, the load that was handled before on
Scalable Distributed Architecture for Media Transcoding
291
Workers Pool Scheduler
Distributed Queue
Media Analyzer
Distributed Storage
Splitter
Encoder
Scheduler
Merger
Media Analyzer
Splitter
Encoder
Scheduler
Merger
Media Analyzer
Splitter
Encoder
Merger
Fig. 2. Distributed Transcoding Architecture
the central manager is now distributed along with the actual transcoding among all workers in the system. Since scheduling and splitting decisions are made by workers there is no need to keep track of each workers capabilities and state, thus reducing even further the complexity of the system. Transcoding tasks are submitted to the system by pushing them to a distributed queue. The workers scheduler poll this queue for new tasks and select which task to execute from the available tasks in the queue. Depending on the queue structure and the way worker’s schedulers select tasks from it, it is possible to implement different scheduling strategies. In our architecture the distributed queue is built on top of Redis DB [9]. Redis is a scalable and fault-tolerant database that supports atomic operations which are an important property to avoid synchronization issues in distributed systems. In order to transcode the media files the workers need a way to access them. We implemented this in our architecture using GlusterFS [10] that is a distributed storage with replication and stripping support. With replication and stripping the source files are stored in replicated stripes over several storage servers. This greatly improves the fault-tolerance and read performance of the whole system by distributing the reading load among all available storage servers.
4
Task-Oriented Parallel Processing
By leveraging known technologies in distributed queuing and storage, we implemented a distributed architecture for our transcoding platform. The main problem resides now on how to implement the workers logic that orchestrate all the different sub-systems to transcode media files in a distributed manner. Our first attempt was to apply parallel processing frameworks like MapReduce
292
H. Sanson, L. Loyola, and D. Pereira
or Dryad [2,3]. Unfortunately these frameworks are data-centric and lack the expressiveness to model complex problems such as multimedia transcoding. In order to implement the parallel transcoding we designed a task-oriented parallel framework where the main component is a task object. A task has by default a state that indicates its current position in the process workflow, an input that points to the input data on the distributed storage and an output that points to the place in the distributed storage where it must store the result of its processing. In addition to these properties, a task may also contain pointers to two subtasks. These subtasks are the key properties that allows the distribution of the processing among workers in the system and modelling of dependencies among those processes. The processing of each task is implemented in two simple methods called map and reduce. These are similar to the MapReduce map and reduce methods but, as we will see in the next section, the execution workflow around our map and reduce methods is more elaborate and allows modelling more complex problems. 4.1
Worker Execution Workflow
When a worker acquires a task from the distributed queue, it follows a simple execution workflow as shown in Figure 3. If the task is in pending state it means it is a new task so the worker invokes the map method on it and moves the task to the map state. If the input data is not splittable then the task result is stored in the output path and the task changes to the complete state. If on the other hand the input data is splittable the map method may split it into two equal parts. This splitting will push to the distributed queue two new sub-tasks with those parts as inputs and move the current task to the waiting state.
pending
Created sub-tasks? map
waiting
failed
complete
reduce
Fig. 3. Worker Parallel Processing Execution Workflow
Scalable Distributed Architecture for Media Transcoding
293
If the reserved task is in waiting state it means the task has subtasks it is waiting for. In this case the worker checks the state of the subtasks and if these are completed then the worker invokes the reduce method with the subtasks outputs as inputs. The reduce method, then merges these inputs and produce the task’s output.
5
Modelling the Distributed Transcoding Problem
With the task-oriented parallel processing framework presented in section 4 we can now model each step of a transcoding process in a parallel fashion and execute it on top of our distributed platform. The model is easy to visualize if we think about transcoding as a group of smaller processes: demuxing, decoding, filtering, encoding and muxing. As previous research shows [4,8], the encoding process can be divided further in smaller processes using video segmentation. These processes have precedence dependencies (e.g. encoding cannot start if demuxing is not finished) but for the most part they can be executed in parallel. Based on this premise we can model the transcoding problem with three simple task classes: TranscodeTask, VideoTask and AudioTask (see Listings 1.1,1.2 and 1.3). The process execution flow of these tasks is depicted in Figure 4. When a user requires to transcode a media file he submits a TranscodeTask into the distributed queue. A worker, that acquires this new task, sees it is in pending
TranscodeTask 1 (demux)
VideoTask 1 (split)
VideoTask 2 (split)
AudioTask 1 (transcode)
VideoTask 3 (transcode)
VideoTask 5 (transcode)
VideoTask 4 (transcode)
VideoTask 2 (join)
VideoTask 1 (join)
TranscodeTask 1 (mux)
Fig. 4. Distributed Transcoding Model Flow
294
H. Sanson, L. Loyola, and D. Pereira
state and invokes its map method. The map method of the TranscodeTask (see Listing 1.1) demuxes the input data into the audio and video components and creates two new sub-tasks: a VideoTask and an AudioTask with the corresponding video and audio components as inputs. After the map method finishes, the worker marks the TranscodeTask as waiting and pushes it back to the queue. Later during the parallel processing, when both the VideoTask and AudioTask are completed, a worker will acquire the waiting TranscodeTask from the queue and find if its sub-tasks are completed. This will prompt the worker to execute the reduce method on the TranscodeTask that as shown in Listing 1.1 muxes back the outputs of the sub-tasks (transcoded video and audio) into the final transcoded file. Listing 1.1. TranscodeTask simplified pseudocode
class TranscodeTask < Task def map [audiofile,videofile] = demux(input) left = VideoTask.new(videofile) right = AudioTask.new(audiofile) end def reduce output = mux(left, right) end end The VideoTask map method is more complex as it can take two different paths. If the task input is too large and would take too long under the workers current load, then the worker may decide to split the input into two equal segments and create two new VideoTasks for the two segments (see VideoTask 1 on Figure 4). Any worker that acquires one of these sub-tasks can further split the segment into even smaller segments distributing more the load among all workers (see VideoTask 2 on Figure 4). This splitting can continue until a max number of allowed splits is reached or until a worker decides it can handle the processing without splitting. When this splitting limit is reached the worker proceeds to transcode the segment and mark the task associated to the segment completed (VideoTasks 3, 4, and 5 on Figure 4). Listing 1.2. VideoTask simplified pseudocode
class VideoTask < Task def map if split_count < max_split [segment1, segment2] = cut(input) left = VideoTask.new(segment1) right = VideoTask.new(segment2)
Scalable Distributed Architecture for Media Transcoding
295
else output = encode_video(input) end end def reduce output = join(left, right) end end As segments are transcoded the VideoTask reduce methods are called with the transcoded outputs backtracking all the way to the initial VideoTask. On each reduce call the transcoded segments are joined to recreate the whole transcoded video stream that can then be muxed with the transcoded audio stream (see VideoTasks 2 and 1 at bottom of figure 4). The example on Listing 1.2 is very simple and uses a hard limit on the max number of segments a video stream can be split. We could add on the workers more complex logic that considers the segment size (in frames or seconds), video complexity (slow or fast action scenes), the workers capabilities (CPU/RAM), and the network conditions (error rate, bandwidth) to decide either to further segment the task or transcode it. Listing 1.3. AudioTask simplified pseudocode
class AudioTask < Task def map output = encode_audio(input) end end For completeness we also present a simple implementation of the AudioTask class. Since audio trancoding uses less resources the map method always decides to transcode the audio (AudioTask 1 on Figure 4). This does not mean that audio transcoding cannot be distributed. We could, for example, transcode different audio tracks (e.g. languages, formats) or channels in parallel using our system.
6
Evaluation
In this section we evaluate our distributed architecture and the task-oriented parallel processing framework presented in Sections 3 and 4 respectively by performing some transcoding jobs on the system using the model we described on Section 5. 6.1
Sample Media
For evaluation purposes, we transcoded a batch of 45 media files into 6 different output formats (Table 1) used for HTTP live streaming to mobile devices. Table
296
H. Sanson, L. Loyola, and D. Pereira Table 1. Transcoding output formats Name Dimensions Bitrate FPS HLS0 1024x768 2Mbps max 30 HLS1 854x640 1.5Mbps max 30 HLS2 576x432 1Mbps max 30 HLS3 426x320 600Kbps max 30 HLS4 384x288 400Kbps max 15 HLS5 384x288 200Kbps max 15 Table 2. Sample Media Distribution
FHD HD SD
Short 20% 7% 12%
Medium 0% 2% 44%
Long 0% 2% 13%
2 shows the batch of movies we used. The batch contains a variety of source media files with different sizes: FHD (1440x1080), HD (1280x720) and SD (720x480) and durations: Short clips (¡10 min), Medium clips (≈ 20 min), and Long clips (¿ 1 hour). The transcoding cluster consists of four transcoding servers with Intel Xeon iCore7 3.2GHz CPU and 6GByte RAM. Each server is running six instances of the transcoding worker that gives us a total of 24 transcoding workers in the cluster. Two small virtual servers running Redis in master-slave mode serve as our distributed queue and two 24TB storage servers with GlusterFS in replication 2 with four stripes per replica.
6.2
Transcoding Speed
To evaluate the speed improvements of distributed transcoding, we submitted each transcode task, one at the time, to the distributed system. This is done to measure the transcoding time of each task when it has all the system resources available to it. We repeated and measured the test for each source media and output format combination for one (no splitting) and sixteen segments max. We accumulated the transcode times for each output format and plotted it against the source media duration in Figure 5. It is clear from the plot that splitting the source media and transcoding each segment in parallel reduces the time it takes to finish all tasks. We can also observe that the longer the source media duration the larger the benefits obtained from segmenting it. To give a concrete example; a DVD source that takes 6.48 hours to transcode with no splitting, takes around 28.5 minutes to transcode when splitted in sixteen segments and processed in parallel.
Scalable Distributed Architecture for Media Transcoding
297
Transcode Time (minutes)
1400 Non Split Splitting (16 segments)
1200 1000 800 600 400 200 0 0
20
40
60
80
100
120
140
160
Source Duration (minutes) Fig. 5. Transcoding speed improvement of split transcoding vs non split transcoding
6.3
Transcoding Quality
Segmentation and parallel transcoding of a video along the time axis usually results in quality discontinuity and degradation around the segment cut points [4]. One way to avoid this problem is to always split the clip at GOP boundaries of the source media [8]. In Figure 7 we plot the Structural Similarity Index Metric (SSIM) between a source file and the corresponding transcoded files with no splitting and with two, four, eight and sixteen segments splitting. It can be seen that all lines overlap which means there is no difference in quality when transcoding with no splitting or when splitting in sixteen segments separately. A problem we found with this approach is that in case of non exact frame rates (e.g. 29.97fps) splitting the source at exact frame boundaries can be inefficient and difficult due to limitations in the used transcoding tools. Figure 6 shows the same similarity index test but with a source media that has a 29.97 frame rate. We can observe some frames that present a very low similarity index (¡ 0.80) that would indicate a large degradation on the resulting video; however a careful frame by frame comparison of the source media and the transcoded files, show that this degradation was due to a shift in the frames ordering on the output. Again this frame shifting is due to the inhability of our splitting tools to correcly cut at exact frame boundaries on the source. Fortunately, at sixteen segments, subjective tests show a negligible difference in output quality and no audible effects on the audio synchronization. 6.4
Task Scheduling
The system becomes overloaded when the transcoding task demands exceeds by far the available transcoding resources. In this case scheduling of tasks plays
298
H. Sanson, L. Loyola, and D. Pereira
1 0.95
SSIM
0.9 0.85 0.8
No Splitting Two segments Four segments Eight segments Sixteen segments
0.75 0.7 0.65 0.6 0
10000
20000
30000
Frame Number Fig. 6. Distributed Transcoding SSIM (29.97fps source)
1 0.95
SSIM
0.9 0.85 0.8
No Splitting Two segments Four segments Eight segments Sixteen segments
0.75 0.7 0.65 0.6 0
10000
20000
30000
40000
Frame Number Fig. 7. Distributed Transcoding SSIM (30fps source)
an important role that affects the overall system performance and user quality perception. Scheduling, centralized or distributed, is a known hard problem [11,12,5,13,14] and is a topic on our list of future research. Therefore in this section we will only evaluate two simple distributed schedulers: First Come First Served (FCFS) scheduler and a Random scheduler, both on a 24-worker system. In the FCFS scheduler tasks are submitted to the distributed queue and workers acquire them in the order they are submitted. In the Random scheduler workers take a random task from the queue with uniform probability. Using both schedulers we submitted the batch of 270 transcoding tasks using splitting (16
State Duration Times (seconds)
Scalable Distributed Architecture for Media Transcoding
299
No Split FCFS Scheduler 100000 80000 60000 40000 20000 0 0
5
10
15
20
25
30
35
40
45
35
40
45
Source Media Files Queuing Mapping
Reducing Waiting
State Duration Times (seconds)
Fig. 8. No split FCFS Scheduler
Split FCFS Scheduler 100000 80000 60000 40000 20000 0 0
5
10
15
20
25
30
Source Media Files Queuing Mapping
Reducing Waiting
Fig. 9. Split (16 segments) FCFS Scheduler
segments) and no splitting, and measured the time each source file spent on each of the states: queuing, mapping, reducing and waiting. Figure 8 shows the FCFS scheduler with no splitting. The y axis shows the stacked times each media file spends in each state of the transcoding process and the x axis corresponds to the transcoding tasks in the order they were submitted to the system. As expected from FCFS, the tasks that are submitted later have larger queuing times, because latter tasks must wait for all previous tasks to finish, before they can be taken by a worker. The same Figure 8 also shows how
State Duration Times (seconds)
300
H. Sanson, L. Loyola, and D. Pereira
Non Split Random Scheduler 100000 80000 60000 40000 20000 0 0
5
10
15
20
25
30
35
40
45
35
40
45
Source Media Files Queuing Mapping
Reducing Waiting
State Duration Times (seconds)
Fig. 10. No Split Random Scheduler
Splitting Random Scheduler 100000 80000 60000 40000 20000 0 0
5
10
15
20
25
30
Source Media Files Queuing Mapping
Reducing Waiting
Fig. 11. Split (16 segments) Random Scheduler
a few heavy tasks (e.g. long mapping times) at the middle cause large delays on the following smaller tasks. The splitting case, using FCFS and depicted in Figure 9, shows no improvement in queuing times and further increases on the waiting time of each transcoding task. This is expected if we consider how our parallel processing framework works. When a task creates sub-tasks, they are pushed back to the queue resulting in longer waiting times for the parent task. As shown in figures 10 and 11 the random scheduler does a better job spreading the load among the available workers. This results in a better overall system
Scalable Distributed Architecture for Media Transcoding
301
throughput but at the cost of longer completion times per task. The reduced performance in completion times is aggravated when we use splitting, because each split increases the number of tasks in the queue, thus increasing the number of processes competing for resources. This is an expected trade off, given the naive schedulers tested, between throughput and performance in distributed systems [15]. Completion time of a task corresponds to the sum of all state times: queuing, mapping, reducing and waiting.
7
Conclusions and Future Work
We have presented and evaluated a distributed transcoding platform that can be used to speed up the process of batch transcoding into different formats and sizes. Our results show significant performance improvements, in terms of transcoding completion times, on different input sources with little impact on the resulting quality of the output under normal conditions. Under overloaded conditions, the increased number of tasks generated by splitting combined with both naive schedulers results in larger completion times than the no-split cases. This indicates that further research on distributed scheduling, that can better accommodate the available worker resources to the factorial number of possible task allocations, is required. Likewise, other video stream segmentation approaches that do not affect output quality and are independent of the source properties (e.g. FPS, codec, GOP) must be investigated.
References 1. Pantos, A.I.R., May, E.W.: HTTP Live Streaming - draft (September 2011) (expires: April 2, 2012) 2. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) 3. Isard, M., Budiu, M., Yu, Y., Andrew, B., Fetterly, D.: Dryad: Distributed dataparallel programs from sequential building blocks. In: European Conference on Computer Systems, EuroSys (March 2007) 4. Sambe, Y., Watanabe, S., Yu, D., Nakamura, T., Wakamiya, N.: High-speed distributed video transcoding for multiple rates and formats. IEICE Transactions on Information and Systems 88(8), 1923–1931 (2005) 5. Dongmahn, S., Jongwoo, K., Inbum, J.: Load distribution algorithm based on transcoding time estimation for distributed transcoding servers. In: International Conference on Inforamtion Science and Applications (ICISA), pp. 1–8 (April 2010) 6. Yang, C., Chen, Y., Shen, Y.: The research on a p2p transcoding system based on distributed farming computing architecture. Knowledge Engineering and Software Engineering (KESE), 55–58 (December 2009) 7. Ravindra, G., Kumar, S., Chintada, S.: Distributed media transcoding using a p2p network of set top boxes. In: Consumer Communcications and Networking Conference (CCNC), pp. 1–2 (January 2009) 8. Deneke, T.: Scalable Distributed Video Transcoding Architecture, Master’s thesis. ˚ Abo Akademi University (2011)
302
H. Sanson, L. Loyola, and D. Pereira
9. Redis key-value database, http://redis.io 10. Gluster file system architecture, tech. rep., http://download.gluster.com/pub/ gluster/documentation/Gluster Architecture.pdf 11. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP 2009, pp. 261–276. ACM, New York (2009) 12. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys 2010, pp. 265–278. ACM, New York (2010) 13. Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Marchal, L., Robert, Y.: Centralized versus distributed schedulers for bag-of-tasks applications. IEEE Transactions on Parallel and Distributed Systems (May 2008) 14. Ghatpande, A., Nakazato, H., Beaumont, O.: Scheduling of divisible loads on heterogeneous distributed systems. In: Ros, A. (ed.) Parallel and Distributed Computing, pp. 179–202. In-Tech (2010) 15. Raman, R., Livny, M., Solomon, M.: Matchmaking: Distributed resource management for high throughput computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, pp. 28–31 (1998)