Abstract
Local ad insertion in linear video is an established and lucrative practice in the United States, generating billions of dollars of revenue for video service providers. However, in most of the rest of the world, the ecosystem for local ad insertion is almost nonexistent, and deploying the components leads to a “catch-22” problem: Creating ad insertion opportunity signaling is an expensive proposition that broadcasters won’t put in place if the splicing ecosystem is not ready, and the splicing ecosystem can’t be put in place without the presence of a signaling mechanism.

In this paper, we outline a solution that can be used to bootstrap the targeted ad insertion ecosystem without widely distributed requirements across the end-to-end video delivery chain. We show how service providers can create a complete ecosystem locally, just before video is distributed to subscribers. The solution relies on a new CableLabs specification that creates a nonproprietary, flexible and complete architecture that enables both broadcast and targeted (multiscreen) ad insertion capabilities.

Introduction
The SCTE-35 specification1 defines a signaling protocol used to insert markers that indicate the location in a video stream of source (or “national”) ads that video service providers can replace by other ads, often localized, which they monetize. These SCTE-35 signals (or cues) support a mutually beneficial ecosystem in which content originators and video service providers share the opportunities available for ad insertion, generating revenue for both. This ecosystem supports a thriving and growing $60B market of local ad insertion that ranges from broadcast to multiscreen applications.In this ecosystem, upstream content originators insert ads into their content, and some of the ads are marked for replacement by different service providers. Ads that are not replaced are typically not marked at all, and ads that can be replaced are marked with SCTE-35 signals that are carried in-band in a PID of the transport stream carrying the video. In most cases today, these SCTE-35 cues carry only the most minimal information needed to specify the location of the beginning (and sometimes the end) of the ad location, called the avail or ad opportunity.
Historically, this minimal information was sufficient for legacy broadcast use-cases that utilized an ad splicer to splice in a local ad over the original national ad. Because early technology utilized tape devices to carry the spliced ad, the SCTE-35 signal was sent sufficiently in advance of the actual splice time in order to allow the tape device time to spin up to speed. (Interestingly, this “pre-roll” time has become useful in the more modern use of SCTE-35 to signal ad avails for multiscreen devices that don’t require a broadcast-style splicer. The extra pre-roll time creates additional time for making ad decisions when every device —and hence many devices — may require a unique decision.)
With the transition to fragmented video delivery and adaptive bitrate (ABR) formats, the splicer was replaced by a simpler process of video fragment substitution. A video stream consisting of fragmented files can support ad insertion by replacing the fragments of the national ad by fragments of a local ad, as shown in Figure 1 below.
However, the limited data carried in the SCTE-35 cue is typically not sufficient to create such a fragment substitution ecosystem, because the video fragment boundaries must line up with the ad boundaries. In order to support this alignment, the SCTE-35 signal must specify the avail duration. Doing this allows the transcoder that converts the input video into fragmentable video deliverable by adaptive bitrate formats to insert fragmentation points (called IDR frames) into its output. Without the duration, simple fragment substitution just can’t work.
Array
Event Signaling and Management
In order to solve this problem, CableLabs, working with a number of its members, created the Event Signaling and Management (ESAM) specification, which allows SCTE-35 signals to be conditioned by a “signal processing point,” typically an SCTE-130 Placement Opportunity Information Service (POIS). The ESAM specification allows two highlevel types of behavior.
First, when the transcoder encounters an SCTE-35 cue, it can confirm the validity of the signal, for example, whether the operator has the rights to insert a local ad at the ad signaled by the cue. The POIS can also delete the cue or update it with extra information, such as the avail duration. This solves the problem of ad location and source stream fragment alignment.
Second, the POIS may inject SCTE-35 cues based on a schedule. This functionality was created in order to enable program substitution (or blackout) using the same architecture that supports ad insertion. The program substitution is treated like a long ad, but one that comes at a point in time that may be driven by considerations local to the operator. This second type of functionality can be utilized in locales where SCTE-35 cues are not present.
The ESAM specification has a few other interesting applications. When conditioning SCTE-35 cues, unique identifiers can be added to the cues (along with the avail duration). These identifiers can be used to help with ad decisions that can then be made based on a stream property rather than the time of the avail. For example, when content is recorded, the avail can be specified based on its identifier, which becomes part of the cue. ESAM can also be used to specify locations of IDR frames in on-demand content or to specify the markup used to specify SCTE-35 cues in ABR formats.
Array

Use Cases

Ad Fingerprinting
Ad marking is managed by fingerprinting the ad content, a process that is analogous to traditional hashing. The hash of a set of data is a short, fixed-size set of data that can be used as a key to represent the original data. Because the hash is much smaller than the original data, it is by definition not a bijective mapping to the original data; however, because real-world data is sparse in the space of all possible data sets, the mapping is effectively unique and is generally used as such. Given two data sets, their hashes can be compared far more quickly than comparing the full data sets, and thus identical hashes can be used to find identical data sets.
This same idea is applied to ad fingerprinting, for example, with ads. Source ads are fingerprinted, and the fingerprints are stored in a database. The streamed content is then continually fingerprinted and compared against the existing fingerprint database until a region of the stream matches the fingerprint of a previously fingerprinted source ad. That match then implies that the fingerprinted source ad appears in the stream. The fingerprint algorithm is continuous and can find the exact start time of the match.
A major difference between hashing and fingerprinting ads is that the version of the ad used to create the original fingerprint may differ from the version that needs to be recognized in the stream, because the latter may have been transcoded or manipulated in some way. The fingerprint algorithm must thus utilize features of the data that survive through such transformations, notably transcoding. The details of the fingerprint algorithms are beyond the scope of this paper (and are generally proprietary trade secrets). The fingerprinting is often just based on the audio data in the stream, but there are implementations that utilize the video stream as well.
Manual Fingerprinting
Manual fingerprinting of ads involves processing the ads independently of any content. This means all the ads that are going to be replaced need to be available separately and ahead of time for the fingerprinting process. The created fingerprints are stored in a database that associates a fingerprint with the specific ad, its duration, and other meta-data. While this process is straightforward, it is not always achievable, because the ads typically originate at the content owners, but must be recognized by a video service provider that doesn’t have any relationship with the ad provider. When manual fingerprinting is not possible, automatic fingerprinting may be usable.
Array
Array
Automatic Fingerprinting
It is possible to automatically fingerprint ads by continually monitoring and fingerprinting video streams. When fingerprints for different portions of the stream repeat and have ad-like characteristics, those portions are designated as ad regions. The specific ad cannot be determined, of course, but recurring segments of the video with the same fingerprint can be recognized as ads and subsequently marked for insertion of SCTE-35 cues. Automatic fingerprinting is used to not only detect the ad region, but, crucially, also detect its duration. So while the fingerprint of the initial portion of the stream is used to rapidly detect an ad, the fingerprint of the whole ad is used to calculate the ad’s duration.
Automatic fingerprinting can be run as a service in which multiple broadcast channels are ingested and monitored for ads. Service providers can then use the service to receive updates of fingerprint databases to recognize ads in their incoming streams.

The End-to-End Ecosystem

In the end-to-end ecosystem (shown below), an ad fingerprinting database is used to host fingerprints of ads that were either manually or automatically created. This database feeds a Fingerprint Detection process that continuously compares the linear input stream with the stored fingerprints. When a match is found, the Fingerprint Detection process connects to a POIS and signals the time in the stream that the ad occurred, as well as the ad’s duration. The POIS then uses an ESAM interface to inject an SCTE-35 signal into the incoming stream at the transcoder, both at the beginning and end of the ad, and it includes the ad’s duration in the first cue. The transcoder must then react as it would normally when encountering an SCTE-35 cue in an input stream: It must pass this cue to its output, so that downstream devices can also be aware that an ad opportunity has occurred; and it must insert an IDR frame into its transcoded output, so that downstream packagers can create a fragment boundary at the ad boundaries — both at the beginning and end of the ad.

The rest of the dataflow proceeds as it normally would for an ad insertion ecosystem containing SCTE-35 cues. For linear broadcast distribution, the transcoder output feeds a broadcast splicer. For adaptive bitrate distribution, the transcoder output feeds a packager with a downstream system for manifest manipulation or client-side ad insertion.

For cloud DVR systems that capture the linear stream, the included SCTE-35 signal can be used to substitute ads during playout or to excise the ads, if desired. Basically, all functionality that is available to architectures that natively carry SCTE-35 cues is now available using this architecture. This approach means that the testing, monitoring, and downstream ecosystems that have been developed for native SCTE-35 systems can now be leveraged in this architecture, and this is another major value over any type of proprietary or non-standards-based system.

Note that the transcoder must be fronted with a delay line. The ad fingerprint detection algorithm requires a few seconds of latency, both to acquire sufficient data to create the fingerprint and to compare the stream fingerprint with the set of database fingerprints. This process typically takes 3-5 seconds, and so a short delay line is used in front of the transcoder to ensure that when the ESAM signal is received, the splice point in the video that needs to be marked with an SCTE-35 cue has not yet passed through the encoder.
Array
Potential Issues
This ecosystem has some potential shortcomings. First, it will not differentiate between ads that have the same initial portion, including for example, leading silence. Second, in some cases, the fingerprint recognition can take a long time — longer than the delay buffer in front of the transcoder. In this case, the ad recognition happens too late to insert the SCTE-35 cue, and the opportunity is lost. The automatic fingerprinting of ads can lead to false-positive ad labels that may occur on short promotions, station identification, or repeated segments of video.
These latter issues can be mitigated by utilizing heuristics that restrict the ad identification (for example, to portions of the stream that are close to 15 and 30 seconds, etc.). The automatic fingerprinting of ads also can only work after the ads have appeared enough times to be recognized as ads. This means that the initial few occurrences of an ad would not be recognized and marked as ad opportunities.
Conclusion
This paper presents a new, standards-based ecosystem that can bootstrap a video service provider-driven local ad insertion capability without the need for content originators to mark their existing ads. The ecosystem has uses for both linear ad insertion and network DVR content playback with new ads substituted over the recorded ads. Because the same system can be used to recognize programs as well as ads, it can lead to other valuable use-case functionalities, such as calculating the exact start times of programs, which is useful for cloud DVR recordings, for example.
By eliminating the need for equipment across the whole delivery chain and by focusing only on operators, this solution can speed up the monetization of video content and help operators enter the ad insertion market without dependencies on their upstream content provider.

Imagine Communications places cookies on your computer to help make our website better. By continuing to browse our website we will assume that you agree to the placing of these cookies. For more information on these cookies, including how to manage your cookie preferences through your browser settings, please see our Cookie Policy