Searching for Search
The Flowdock team wants to ensure that you can use Flowdock to be the place to connect all of the information for your work. Part of our commitment to you, our customer, is making sure we offer the best search solution available to help you access the information you need to get stuff done! We have heard a lot of feedback about issues with our current approach to search and we hear you loud and clear! Therefore, to better support searching for messages and inbox items, we’ve redesigned our search storage and indexing pipeline.
Today’s SearchToday, Flowdock will tokenize all of the text in the content of a chat message or inbox message into an array of words. So, “the quick brown fox jumps over the lazy brown dog” tokenizes into an array of words [“the”, “quick”, “brown”, “fox”, “jumps”, “over”, “lazy”, “dog”]. Any words over 20 characters in length or less than 2 characters will be skipped. This is faster and more space efficient than a btree search of content, which is what would be used if we indexed the content field. Also it is simple to build and it fits into the existing Flowdock architecture. We can build the keywords array in the same atomic operation that we update the content.
This approach does have a few drawbacks. The index does not have a concept of common or stop words (like “the”, “and”, “a”, “an” in English), cannot handle fuzzy or partial matches, and ends up taking up a large amount of disk RAM compared to an index that is purpose built to support text searching. It also does not score terms that are close in distance higher. So “quick brown fox” scores the same as “lazy dog the” because that keywords array matches the terms we are passing.
Elasticsearch – Our Shiny New EngineWe had experience with Elasticsearch from previous projects and appreciate its scalability, easy to use programming model, and search grammar. Although Elasticsearch is often used for log indexing, it is also great for application data indexing and retrieval. We evaluated several vendors that host Elasticsearch then decided on Amazon’s Elasticsearch because we are running most of our services in Amazon. We definitely did not want to stand up our own Elasticsearch cluster – as easy as it might be to get going with a cluster, we view our mission as delivering value to the customer and felt that having to manage another data store was detracting from that mission. AWS Elasticsearch lets us focus on higher level metrics on index health instead of worrying about lower level infrastructure concerns. We’re excited that we’re now working with an index that is half the size of our existing inverted index!
Our Pipeline – KinesisKinesis is a scalable message stream that allows us to put new messages and enqueue messages for re-indexing without disrupting our existing Redis infrastructure. We think of it as big, scalable pipe for pushing messages onto for further processing. Kinesis acts as a multiplexing queue – all subscribers can consume messages and manage their own high water marks. This allows multiple consumers to read the same stream and perform different actions. For example, one consumer might be building a search index while the other is doing natural language processing on the content to build tag arrays. Kinesis has a retention policy that we set at 7 days to maximize our ability to retry. It also allows us to partition data – we chose flow ID as our key – to ensure we have a good read and write distribution while maintaining in-order processing of messages in a given flow.
Peanut Butter and Chocolate or Serverless and Lambda?We decided to use the serverless framework – a lightweight NodeJS package for managing functions as a service deployment. Serverless was super easy to get started and we’re definitely looking to use it further in the future. We used serverless to build and deploy our search indexer function which is Typescript code that we use to take incoming Flowdock messages, massage them to meet Elasticsearch requirements, then save them to Elasticsearch. We deploy the same code for both indexing and reindexing and the only difference is what Kinesis stream they read from. Serverless gives us an abstraction so that if we ever need to deploy to another cloud, we can use different serverless modules to create services in these providers with minimal effort.
Example of deploying the same function for multiple streams:
Lambda is a function as a service provided by AWS. The benefits of not managing infrastructure helps our teams focus on crafting the logic needed to add value to Flowdock and using lambdas gives us an advantage of focusing on the indexing logic over managing infrastructure. Lambda supported NodeJS and since we are big fans of Typescript from our previous project, it was a clear winner.
Lambda has a few interesting “gotchas”. When the Lambda function exists with an error, the lambda will retry. When this happens because of something downstream, say a connection issue with the Elasticsearch cluster, then the retry will succeed when the connection is stable. However, if there is something wrong in the code that we wrote, we have to treat this error as an Alarm condition and push a code fix before the data expires from the Kinesis stream. This attention to errors is both good hygiene and makes sure we are still processing messages in order (as our consumer will not move its high water mark forward until it successfully processes the message). We’re using CircleCI as it has built in serverless support. We’re continuously deploying our develop branch to QA and master to production.
Templates with Cloud FormationWe’re using Cloud Formation for infrastructure automation for components and services that are not covered by Serverless. Serverless uses Cloud Formation under the covers so we’re using similar underlying technology. Cloud Formation allowed us to create simple JSON templates for Elasticsearch and AWS Kinesis, therefore we can deploy to our QA environment as well as production with the same configuration.
The Bright Future!Now that we have our new search index in place and are adding five million new documents a week with minimal infrastructure lift, we’re excited about the possibility of cross flow search, searching one on ones, and fuzzy matches. Right now, we’re shipping fuzzy search and matching on these criteria:
- Tags (#featureRequest) matches the indexed value’s keyword using a term query. This is effectively matching the unanalyzed text with an unanalyzed keyword.
- Term + keyword match for queryable fields like content, body text, thread title. This is an exact match of the given query
- Common match filter – this filters any indexed terms that are in the index more than .1% of the time as common and uses a match query to analyze the query text and compare it to the analyzed text in the index.
- Fuzzy match the given query to the queryable fields. Elasticsearch and Lucene use Levenshtein Distance to determine how close of a match a fuzzy match is and we have parameters to limit false positives
We are really excited about what we’re shipping now and what we’ll be shipping in the near future as this enhanced index gives us the ability to keep delivering awesome search features to our customers.