How Post Content is Stored on Tumblr

macmanx:

engineering:

We’re currently rolling out an opt-in beta for a new post editor on web which will leverage the Neue Post Format behind the scenes. It’s been a very long time coming – work on the Neue Post Format began in 2015 and was originally codenamed “Poster Child”, and it was borne out of a lot of things we learned dealing with the previous new post editor we released on web around that time. Over the years, the landscape of how people make posts on different platforms across the internet has changed dramatically. But here on Tumblr, we still want to stay true to our blogging roots, while giving access to a wide creative canvas, and the Neue Post Format reflects that work.

With literally billions (tens of billions!) of posts on Tumblr, how do we move this churning engine of content from one format to another without breaking everything? It took many phases, and releasing the new editor on the web will be one of the final pieces in place. To understand how far we’ve come and the challenges we’ve had to face, you need to know the deep dark secrets of how we store post content on Tumblr. This hellsite we all love is held together by duct tape, good intentions, and luck, and we’re constantly working to make it better!

A post is seemingly a very simple data model: it has an author, it has content, and it was posted at a certain time. Every post has a unique identifier once it’s created. In the case of reblogs, they also have the “parent” post and blog it was reblogged from (more on How Reblogs Work over here). In a standard normalized database table, these columns would look like:

  • Post identifier (a very big integer)
  • Author blog identifier (an integer pointing to the “blogs” database table)
  • Parent post identifier (if it’s a reblog)
  • Parent blog identifier (if it’s a reblog)
  • When it was posted (a timestamp of some kind)
  • Post content (more on this in a minute)

Before the Neue Post Format, posts had discrete “types”, so that’d be a column here as well. But once you have these discrete “types”, you have to determine how you want to store the content of each “type”. For photo posts, this is a set of one or more images. For video posts, this is either a reference to an uploaded video file, or it’s a URL to an external video. For text posts, it’s just text, in HTML format. So the actual value of that “post content” column can change depending on what type it is.

Here’s a simple example, note how each post type has different kinds of content:

As Tumblr grew, its capabilities grew. We added the ability to add a caption to photo, video, and audio posts. We added the ability to add a “source” to quote posts. We needed somewhere to store that new post content. Because Tumblr was growing so rapidly at the time, this needed to happen fast, so we took the easiest path available: add a new column! That first “post content” column was renamed “one”, and the new post content column was named “two”. And as Tumblr grew more, eventually we added “three”. And each column’s value could be different based on the post type.

Needless to say, eventually this made it very difficult to have consistent and easy to understand patterns for how we figure out things like… how many images are in a post? Since we added the ability to add an image in the caption, it’s possible there’s images in the “one”, “two”, or “three” columns, but each may be in a different format based on the post type. Reblogs further complicate the storage design, as a reblog copies and reformats post content from its parent post to the new post. The code to figure out how to render a post became extremely complicated and hard to change as we wanted to add more to it.

Further complicating this was the fact that most (but not all) of these post content fields leveraged either HTML or PHP’s built-in serialization logic as the literal data format. Before PHP 7, HTML parsing in PHP (which is what Tumblr uses behind the scenes) was extremely slow, so rendering a post became more of a struggle as the post’s reblog trail grew or its post content complexity increased. And HTML and PHP’s serialization logic isn’t easily portable to other languages, like Go, Scala, Objective-C, Swift, or Java, which we use in other backend services and our mobile apps.

With all this in mind, in 2015, two needs converged: the need to have a more easily understandable and portable data format shared from the database all the way up to the apps, and the need for more types of post content, decoupled from post type. The Neue Post Format was born: a JSON-based data schema for content blocks and their layout. This has afforded us the flexibility to make new types of content available faster, without needing to worry necessarily about how we’ll store it in HTML format, and has made the post content format portable from the database up to the Android app, iOS app, and the new React-based web client.

Going back to the standard, normalized database table schema for posts, we’ve now achieved the intended simplicity with a flexible JSON structure inside that “post content” column. We no longer need post types at all when storing a post. A post can have any and all of the content types within it, instead of being siloed separately with a myriad of confusing options depending on the post type. Now a post can be a video and photo post at the same time! When the new editor on the web is fully released, we can finally say that this format is the fuel powering the engine of content on Tumblr. It’ll enable us to more quickly build out block types and layouts we couldn’t before, such as polls, blog card blocks, and overlapping images/videos/text. Sky’s the limit.

– @cyle

tl;dr

“We no longer need post types at all when storing a post. A post can have any and all of the content types within it, instead of being siloed separately with a myriad of confusing options depending on the post type. Now a post can be a video and photo post at the same time!”

Coming soon! (or switch on the beta for yourself the next time you’re prompted)

Leave a Reply

Your email address will not be published. Required fields are marked *