As a SaaS application, Lithium Social Web needs to support a consistent view of the engagement console to multiple browsers for different customer service agents that are authenticated against the application. It must handle asynchronous client updates for conversation workflow, and uses Redis to manage this functionality. Redis is a popular key-store, well-known for its performance and small memory footprint, which makes it ideally suited for use in Lithium Social Web.
One of the ways we use Redis is for server event propagation to UI clients in conjunction with long polling. We chose this approach over WebSockets primarily due to our earlier infrastructure limitations — specifically, Elastic Load Balancing (ELB) on Amazon Web Services (AWS) did not have sufficient support for WebSockets.
We developed an alternative solution that uses a "message delivery" model by managing a "mailbox" in Redis for each active client session. For each relevant action in the conversation management workflow, the server generates an event that is published ("delivered") to all active mailboxes in Redis, from where it is eventually propagated to the associated clients. Events are tracked using strictly-increasing id values, allowing synchronization between a client and the server by using the "last event id" sent by the server to that specific client. Clients implement event listening using long polling and the id of the most recent event that was received on a given client. The poll request returns when the client mailbox in Redis has one or more events that are newer (i.e. with higher id values) than the most recent id processed by that client.
The above process repeats until a client is no longer active, at which point, the corresponding mailbox in Redis is also discarded. On the flip side, when a new client starts, it receives the current snapshot of the UI view, and then begins the long poll process using its own event counter for asynchronous UI updates.
But, there is a catch… While the basic approach has worked well for our needs, there are some useful lessons arising from our implementation choices, especially related to matching the data model design to the expected usage patterns.
The most important lesson was to avoid key lookups using a wildcard key search (with the KEYS command). While this is generally not recommended because of its slow (linear) performance characteristics, our limited usage for identifying active client mailboxes for event delivery was not expected to be a potential bottleneck. Unfortunately, a couple of factors contributed to make this a bigger hit than expected. Specifically, an unintended namespace overlap with keys used for session management, combined with an unrelated bug that left "stale" session keys behind, led to a higher-than-expected number of keys included in the scan. This resulted in intermittent connection failures from other components of the application that attempted to communicate with the Redis server while it was in midst of a key scan.
Separately, scalability of this event management model is becoming an important factor as we start to support an increasing number of concurrent clients. Since Redis is tightly coupled to memory, there is an effective upper bound on the number of events that can be stored in the mailbox for each client. All else being equal, as the number of clients increases, the number of events that can be stored per client decreases. Alternatively, the number of events saved can be held the same at the expense of a larger memory footprint for the Redis instance. Another approach may be to use partitioning to spread data across multiple Redis instances and effectively gain a larger memory footprint, although this may not be suitable for all managed keys due to the restrictions it can impose on key management.
In order to mitigate the performance impact of key scanning, we have separated our namespaces such that keys for mailboxes do not overlap with any other keys used by the application. Furthermore, we are evaluating whether the model should be updated to use alternative data structures that are more suitable and performant to serve our need for identifying subsets of managed keys. Also, an upcoming Redis release is expected to include a SCAN command that may provide a significant improvement over the KEYS command for key scans even with our current model.
For scalability related to the number of concurrent clients, we are also investigating moving to WebSockets for event propagation. Websockets would obviate the need to store and manage events in Redis because these events would be immediately delivered to clients through the persistent connections created via WebSockets. We expect this to provide several benefits, including reducing resource requirements in Redis, but further analysis and performance measurements are necessary in order to make the final decision.
Redis continues to be an important component of Lithium Social Web, and we look forward to sharing more details on our experiences with Redis and other core technologies in upcoming posts.
Sheetal Kakkad is a Senior Software Engineer at Lithium, with a Ph.D. In Computer Science from the University of Texas at Austin. Before Lithium, he worked on building enterprise software at companies such as Sun Microsystems/Oracle and Motorola, as well as smaller startups.