How Many Views Do My Videos Get?
That’s the fundamental question I’ve been working to answer. Accurate data collection is no mean feat. I wanted to share why, how and what we’ve been working on…
The gradual eclipse of Flash by HTML5, and the proliferation of mobile devices consuming video content had put an increasing strain on the way we captured playback data. In 2015, I felt that it was no longer appropriate to modify incrementally. I saw two fundamental problems:
- H5 and mobile data hit our CDN logs multiple times and it was unpredictable how these would be interpreted in our playback data.
- Our method of data collection precluded real-time reporting. We needed to create a next generation analytics platform. My aim was to create an analytics architecture that would be powerful and simple for our subscribers now and for many years to come.
How We Collect Playback Data
I started by thinking of how we wanted to get the data out as quickly, responsively and accurately as possible. I researched a number of different databases that would allow us to perform some aggregation at the point of data collection. This would speed up querying and reporting. I selected Amazon’s Aurora, which is essentially MySQL.
To keep the query engine efficient, I tried to keep the size of the data chunks relatively small. This avoids the snowball effect we’d experienced in the past. I decided to break the data into time slices at the point of collection, so that only the data relevant to the period being reviewed needed to be manipulated. The cost is that we are duplicating data, but with the much greater benefit, I believe, of faster speed of analysis.
We collect each play event in hourly, daily and monthly data-tables. Going forward, I would like to reduce this to collecting the data twice. In retrospect, the daily data would be better aggregated from hourly data; by creating a concept of ‘daily’ data we have to introduce an overhead to deal with multiple time zones, which I think is not necessary or beneficial.
One of the curious issues that I encountered was caused by the way the iFrame was sending playback data. We were sending the data twice for audit purposes. I realized that the order in which the two types of data were reported made a difference to the accuracy of the data collection. I had to dig into why this made a difference and make sure the accuracy was achieved independent of the order the data was sent. Rather unexpectedly, this was a big help in achieving extremely high accuracy results.
It took more than six months to get to the point at which we were able to release our new video analytics. About two thirds of the time was spent researching alternative approaches and then doing exhaustive prototyping and testing. One of the most important and time-consuming phases was to test accuracy under very large loads, painstakingly isolating and removing anomalies and errors.
But it was worth it. As a result, I think we’ve achieved the simple and highly robust analytics engine I was hoping for.
Playback data requires, at root, a relatively simple structure; loads, plays and time. But with our new analytics architecture we are able to cut the data in several ways that are proving very powerful for our subscribers.
Obviously they can look at load and play counts by video per period. But we have also released the vzaar Playability Index which allows our clients to understand how likely a video is to be played. Subscribers can also simply create powerful insights from things like Top Ten lists, and, of course, you can home in on specific time periods like holiday weekends, specific weeks or times of day.
What’s Still To Come? The Future of Video Analytics
I’m now starting to look at engagement and multivariate analysis. This is a different level of complexity involving a dozen or so different types of data, asking quite different questions. For playback and analysis, the basic concept is a “counter” that specific events “turn”. Engagement is a multi-dimensional analytical challenge, more like a palette of colours than a counter. It will almost certainly require a different database solution. That’s my next challenge for the second half of the year. Bring it on!