Flowdock is a collaboration tool for technical teams. From the very beginning, we had this idea of using tags to categorize real-time discussions. It works really well for building lists (like ToDo lists, lists of competitors, notes etc.) and for categorizing links (think delicious.com for your internal discussions). It changed the way we work. Still, sometimes you just forget to tag something – which is probably why our recently announced full-text search was a very common feature request.

We store all the messages in MongoDB. Implementing tags with multikey indices was trivial, and using the same mechanism for full-text search seemed to make sense. Some characteristics of our use case are:

  1. Each flow (workspace) is separate, and there’s never need to search through all messages in the database.
  2. Search results are simply sorted chronologically, there is no need for a “Pagerank”.
  3. In Flowdock everything happens in real time, so people expect to see their messages in search results immediately.
  4. We use MongoDB to store messages (with 3 copies of each message + backups). Having all that data in another redundant system would be cumbersome.

Search implementation

The implementation was very straight-forward. Every message document would simply have a new field, _keywords, that gets populated from the message contents. At this point we decided not to do stemming, but to get feedback from our customers about what’s relevant and what’s not. This already resulted in adding the functionality to jump back in chat history.

A message document would now look like this:

{ id: 12345,
  author: …,
  content: “Hey @Otto, should you write the blog post?”,
  flow: “flowdock:developers”,
  _keywords: [“hey”, “otto”, “should”, “you”, “write”, “the”,
    “blog”, “post”],
  _tags: [“user:12”]
}

We needed to add new index { flow: 1, _keywords: 1, id: -1 }. It basically means that we’re not going to be able to search by tags and keywords at the same time.

Keyword population was enabled in our backend secretly much before we rolled out the end-user UI. We also had a Scala script to go through all the older messages and populate the field.

Operational issues

Even before this exercise our system was in state where indices are too large to fit in memory. As it turned out, the _keywords index size in our use-case is ~2500 MB per 1 million messages. Having accumulated more than 50M messages, this really turned out to be a challenge.

Tweeting about our indexing process

When running the migration we encountered several challenges:

  1. MongoDB supports generating the index in background. However, since the service was running at the same time, index generation and active users were aggressively fighting about the available resources. Since adding new messages already involved updating several large indices, and since background index generation was populating memory cache with its own values, all queries started hitting the disk and the user experience was not tolerable. To solve this issue we actually removed _keywords from all the messages that were populated by the backend so far, and then built the (empty) index. This way it was way faster.
  2. Running the _keywords population script also made all queries hit the disk. Since it’s not possible to give priorities to MongoDB queries, we ended up adding lots of sleep() calls in our script. Process 500 messages, sleep 5 seconds… By default MongoDB doesn’t ensure that the change was actually written to disk before returning from the query. This could lead to a situation where queries are just stacking up because the indexing script is faster than your database. Setting the Write Concern level appropriately will make the queries block until they’ve actually completed. If possible, it’s also good to optimize for data locality. If your data is partitioned per organization, it’s likely that two consequent messages from the same organization are closer to each other in the index than two completely random messages. In some cases this might ease the disk I/O a lot. Even with these tricks we’d still have to stop the script every once in a while, when the queries started getting too slow.

In the end we took a couple of shortcuts to make our lives easier:

  1. We deleted millions of messages from organizations who had stopped using Flowdock during the beta phase.
  2. Ordered SSD disks from our service provider. It turned out to be the best investment ever! Of course it’s not doable for people using Amazon EC2 etc. but in our case we got a HUGE performance improvement. We could now populate _keywords on a live database with minimal load.

Other solutions would’ve of course been to simply shard until the indices per server fit in memory. Since in our use case people mostly address their newer messages, we felt that SSD is a more efficient solution.

Now the search is running happily, with only a couple of queries per day taking more than 100 ms. Full-text search with MongoDB might be a good fit for many use cases, just be prepared for all the operational issues that any search solution will bring you.