Algolia’s Fury Road to a Worldwide API

Published in

Algolia Stories

19 min readAug 14, 2015

The most frequent questions we answer for developers and devops are about our architecture and how we achieve such high availability.

Some of them are very skeptical about high availability with bare metal servers, while others are skeptical about how we distribute data worldwide.

However, the question I prefer is “How is it possible for a startup to build an infrastructure like this”. It is true that our current architecture is impressive for a young company:

Our high-end dedicated machines are hosted in 13 worldwide regions with 25 data-centers
our master-master setup replicates our search engine on at least 3 different machines
we process over 6 billion queries per month
we receive and handle over 20 billion write operations per month

Just like Rome wasn’t build in a day, our infrastructure wasn’t as well. This series of posts will explore the 15 instrumental steps we took when building our infrastructure. I will even discuss our outages and bugs in order to you to understand how we used them to improve our architecture.

The Cloud versus Bare metal debate

Before diving into the details of our architectural journey, I would like to address a choice that had a big consequences on the rest of the infrastructure. We needed to decide if we should use a cloud-based infrastructure or bare metal machines. A hot topic that is regularly debated in technical discussions.

Cloud infrastructure is a great solution for most use cases, especially in the early stages.They have been instrumental in improving the high-availability of many services.

Solutions having databases on multiple availability zones (AZ) or several instances running on different AZ while storing all their state in the multiple AZ database are perfect example of this. This is a standard setup used by many engineers and easily deployed in a few minutes.

Bare metal infrastructures require you to understand and design the small details in order to build the high availability yourself. This is a do-it-yourself approach that only makes sense for a small set of use cases.

We often encounter deployments using bare metal machines in a single datacenter. This does not make sense as it is less fault tolerant than a quick deployment on a cloud provider, the datacenter is a single point of failure (SPoF).

Bare metal hardware remains an interesting option for businesses linked to hardware, which happens to be our case. By choosing a bare metal infrastructure, we were able to purchase higher performance hardware than that offered by cloud providers. On top of the performance gain the cost was significantly cheaper. We chose this option early on because we were fully aware we would need to build the high-availability ourself!

.1

March 2013

High availability was designed, not implemented!

At this time we had our first private beta running of our search as a service API. At this point in time, we could only measure our performance. We had not yet developed the high availability part of the product.

We were highly confident our market would be worldwide so we launched a single machine in two different locations, Canada/East and Europe/West, with the following specifications:

32G of memory
Xeon E3–1245 v2 (4 cores, 8 threads, 3.4Ghz-3.8Ghz)
2x Intel SSD 320 series of 120GB in Raid-0

Each machine was hosting a different set of users depending on their location. During our private beta, the focus was 100% on performance which is why the clock speed was a major factor in our decisions (for the same generation CPU, clock speed is directly related to speed of a search query in search engines). Since the beginning, we have done indexing on a separate process with a nice level of five. All search queries were processed directly inside nginx that we let with a nice level of zero (the lower the nice level of a process is, the more CPU time it gets). This setup allowed us to handle traffic spike efficiently by giving the search the highest allocation CPU priority. This worked very well compared to approaches used by other engines.

We were very surprised to have one of our first beta testers replace their previous solution with ours in production because they were so happy with the performance and relevance. As you might imagine, we were very stressed out about this. Since high availability was not implemented, we were worried about potential downtime affecting them and explained the product was not ready production! The customer told us the risk versus rewards was acceptable to them because they could rollback to their previous provider if needed. On a side note, this story helped us secure our first round of funding before the launch of the product. It ended up being our first proof of market fit. Better yet, we can call it our problem-solution fit! We can’t thank that customer enough :)

.2

June 2013

Implementation of high availability in our architecture

After three months of development and a huge amount of testing (the monkey testing approach was really fun!), we introduced the high availability support in our beta.

You can read more about it in our architecture post.

The idea was to have a cluster of three identical machines instead of one where each machine is a perfect replica with all of the data and able to acts as a master. This means that each one is capable of accepting write operations from API users. Each write operation triggers a consensus to make sure all machines have all the jobs and apply them in the same order.

We used the preliminary results from the first beta to design our new hardware setup. We discovered the following:

32G of memory was not enough, indexing was using up to 10G when receiving big indexing jobs from several users, which only let 22G to cache disk IO
Disk space was too low for high availability since machines needed to keep several jobs on disk in order to handle a node failure
Having more memory required us to move to Xeon E5 series (E3 can only address 32G of memory). As the clock speed was important, we decided to go for the Xeon E5 1600 series that offered a very good clock speed and is able to address a lot more memory that the Xeon E3.

With these discoveries, our setup evolved into three machines with the following specs:

64G of memory
Xeon E5–1650 (6 cores, 12 threads, 3.2Ghz to 3.8Ghz)
2x Intel SSD 320 series of 300GB in Raid-0

At this point we were able to tolerate hardware failures! However, we were still far from what cloud providers offer with multiple availability zones. All our machines were in the same datacenter with a single provider and without any knowledge of the infrastructure.

At the same time, we investigated whether or not the load balancing and detection failure between machines should be handled with hardware or software. We tested several approaches and found all hardware load balancers would make the use of several providers close to impossible. We ended up implementing a basic retry strategy in our API clients. Each API client was developed to be able to access three different machines. Three different DNS records represented each user: USERIDID-1.algolia.io, USERID-2.algolia.io and USERID-3.algolia.io. Our first implementation was to randomly select one of the records and then retry with a different one in case of failure.

.3

August 2013

Official launch of the service

During the summer, we increased the number of API clients to 10 (JS, Ruby, Python, PHP, Objective-C, Java, C#, Node.js…). We decided to avoid using auto code generation and developed the API clients manually.

While it was more work, we needed to make sure the networking code was good for things like HTTPS keep alive, using TLS correctly, having our retry strategy implemented correctly with the correct timeout, etc.

We officially launched the service at the end of August 2013 with our two locations (Europe/West and Canada/East). Each location contained a cluster of three identical hosts with the following specs:

128G RAM
E5–2687W (8 cores, 16 threads, 3.1Ghz to 3.8Ghz)
2x Intel S3500 series of 300GB in Raid-0

Compared to the previous configuration, the main things we changed were to increase the memory size and use a better SSD. Those two changes were done based on the observation that the SSD was the bottleneck during indexing and the memory was not enough to cache all the users’ data in memory. For the CPU upgrade, it was more a question of oversizing to ensure we would have enough resources.

At this point, the next big item for us to focus on was implementing availability zones for our deployments. We needed to run our three machines on different network equipment and power units. Hopefully our provider was transparent about their infrastructure and where our machines were allocated. It wasn’t perfect, but we were able to implement a solution similar to those of the different cloud providers. We suspect the cloud providers do something similar to what we implemented, but have not found any detailed documentation on this topic!

.4

January 2014

Deployment Is A Big Risk For High Availability

At this time, one of our major concerns was making sure our development was agile while not having it at the cost of stability.

For this reason, we developed a test suite of 6,000+ unit tests and 200+ non-regression tests to ensure code changes did not introduce new bugs. Unfortunately, it wasn’t enough. A new feature that passed all our tests introduced a bug in production that caused eight minutes of indexing downtime. Thankfully, because our architecture was designed to keep search queries separate from indexing, no search queries were impacted.

This issue helped us identify several problems in our high availability setup.

Rollbacks need to be fast so we developed the ability to perform a rollback with a single command line.

Deployment scripts need to perform sanity checks and automatically rollback if there are errors so we added it to our deployment procedure.

We shouldn’t deploy to all of our production clusters just because the tests passed.After this issue, we put in place a sequence of deployment for clusters. We start with our test clusters, followed by the community clusters (free customers), and finally the paying user clusters.

Today, our deployment is a bit different when there is a new feature to test. We select a set of clusters to test and we deploy using the following steps:

Deploy to the first machine in all clusters of the selected set.
Monitor for 24 hours then deploy to the second machine in all clusters of the selected set.
Monitor for 24 hours then deploy to the third machine in all clusters of the selected set.
After a few days, we deploy to all our production clusters.

Using this approach, we were able to detect bugs within a few hours that were nearly impossible to find with unit tests. This detection was possible because of the billions of queries we have per month.

.5

March 2014

Manage High Load Of Write Operations

While our clusters in Canada/East and Europe/West were handling our growth very well, we moved to solving a new problem: latency.

The latency of our clusters from Asia was too high for acceptable performance. We decided to test the market first by deploying machines in AWS. We didn’t like having to make this choice since the performance of search queries was 15% lower than our E5–2687W CPU even when using the best CPU offered by AWS (E5–2680 at that time)! We went this route in order to reduce the time to launch the experiment but made sure to not introduce any dependencies on AWS. This allowed us to easily migrate to another provider if the tests were successful. On paper, it all looked good. We ran into one issue when we discovered their virtualization did not support AVX instructions, which was an important feature for us. We had to generate binaries without the AVX instruction set (which reduced again a bit the performance of search queries).

While we were enjoying our recent launch in Singapore we started to receive some complaints from European customers about an increased latency in their search queries. We quickly identified it was correlated with a big spike of indexing operations, which was weird since we were confident our architectural split between indexing and search CPU usage was working correctly. After investigating, we found that the consensus algorithm for distributed coherency across the cluster that handled write operations had a potential bottleneck. When the bottleneck occurred, it would block the HTTP server threads. This would then lead to search queries needing to wait for a thread. We had accidentally created this issue by design!

We fixed this issue by implementing a queue before our consensus. It allowed us to receive a spike of write operations, batch them, and then send them to the consensus. The consensus was no longer in the sequence of the HTTP server so write operations could not freeze the HTTP server thread anymore. During normal operation, the queue is not doing anything except passing the job to the consensus. In cases where we have spikes, it allowed us to buffer and batch before calling the consensus. After we made this change, we have not had any clusters freeze.

.6

April 2014

Network High Availability Is Close To Impossible With One Data Center

In the beginning of April 2014, we started receiving complaints from our US-East customers that used a cluster in Canada/East.

Our users in US-West were not impacted. We immediately suspected a network issue between Canada and US-East and discovered our provider was reporting the same issue. A car accident had broken a pipe containing 120 fibers between Montreal and New York. All the traffic was being routed on a different path through Chicago. Unfortunately the bandwidth was not good enough to avoid packet lost. This outage was not only impacting our provider but all traffic between Canada and US-East. This is the type of accident that can break the Internet that we often forget about, but it can still happen. There was nothing we could do that day to help our customers except reach out to each one and notify them of the current situation.

It was on this day we kicked off a big infrastructure discussion. We needed to improve our high availability by relying on more providers, datacenters, and network providers. More importantly, we needed to have our infrastructure in the US. By serving the US from Canada, we were able to cut our costs but our high availability suffered. We needed to have our infrastructure truly distributed. We couldn’t continue having several availability zones on a single provider!

.7

July 2014

First Deployment On Two Data Centers

We had discovered our deployment in a single datacenter was not enough.

We decided to start with one of our biggest customers by deploying a machine in two different datacenters (with more than 100 kilometers between them). These two datacenters were using the same provider (same Autonomous System) and already offered a level of high availability a bit above cloud provider multiple availability zones. This was due to the two datacenter locations being completely different in regards to network link and power units.

Based on our past experience, we took this time to design the next version of our hardware with next generation of CPUs and SSDs:

Xeon E5–1650v2 (6 cores, 12 threads, 3.5Ghz-3.9Ghz)
128G of RAM
Intel SSD S3500 (2x480G) or S3700 (2x400G). We use the intel S3700 for machines with high amounts of writes per day as they are more durable

The changes we made were mostly around the CPU. We were never using 100% of the CPU on our E5–2687W. The E5–1650v2 was the next generation CPU and provided higher clock speeds. The result was that our service was able to gain close to 15% in speed compared to the previous CPU. Milliseconds matter!

.8

September 2014

Presence In The US!

After spending a few months in discussions with several providers, we launched the service in US-East (Virginia) and US-West (California) in September 2014 with a new provider.

This first step was a great way to improve our search experience in the US thanks to better latency and bandwidth. The improvement to high availability was not big but it did help us be more resilient to the past problems we had between Canada and US-East. We used the same approach as we did with previous locations: different availability zone within one provider (different network equipment and power unit).

.9

October 2014

Automation Via Chef

With a significant increase in the number of machines we needed to manage, we migrated our management to Chef.

Previously, we managed our machines using shell scripts but fixing security issues like Heartbleed (OpenSSL vulnerability) was time consuming compared to automation with Chef.

The automation provided by Chef was very useful when configuring hundreds of machine, but it had it drawbacks. A few months after migrating to Chef, a typo in a cookbook caused some production servers to crash. Luckily, we were able to discover the problem early on. It didn’t impact much of the production thanks to the non-aggressive nature of the CRON job that launched the Chef client. We had enough time to react and fix the issue.

We got really lucky with this issue because the error could have broken production. In order to avoid this type of problem from happening again, we decided to split our production cookbooks into two versions:

The first version, the stable one, was deployed to the 1st and 2nd machine of all clusters
The second version, the production one, was deployed on the 3rd machine of all clusters

When any modification is made to our cookbook, we first apply the modification to the production version. After a few days of testing, we apply the modification to the stable cookbook version. High availability needs to be applied at all levels!

.10

November 2014

DNS Is A SPoF In The Architecture

Over time, we started to receive more and more feedback from users stating our service was intermittently slow, especially in Asia.

After investigating, we found our usage of the .io TLD was the cause of this issue. It turned out the anycast network of the .io TLD has fewer locations than the main TLDs (.net, .com, and .org) and the DNS servers are overloaded. Users would sometimes encounter timeouts during DNS resolution.

We discussed this issue with CDN providers and they all told us the best practice was to use the .net TLD. We needed to migrate from .io to .net! While preparing for the release of our Distributed Search Network, we decided to move to another DNS provider. This new provider offered the linked domains feature, which meant they would handle the synchronization between algolia.io and algolia.net for us so we could maintain backwards compatibility easily.

We performed extensive testing from different locations for this migration and felt comfortable everything was working well. Unfortunately, after we completed the migration we discovered several problems that impacted some of our users. You can read a detailed report in this blog post. DNS is far more complex than we realized. There are many different behaviors of DNS providers and our tests were not covering all of them (e.g., resolution in IPv6 first, different behaviors on TTL, etc.)

This issue also changed our minds about identifying SPoF. Having only one DNS provider is a SPoF and the migration of this piece was highly critical since there was no fallback. We started working on a plan to remove any SPoF in our architecture.

.11

February 2015

Launch Of Our Synchronized Worldwide Infrastructure

During this month, we accomplished the vision we had been working towards since April 2014, “worldwide expansion to better serve our users”.

This network was composed of 12 different locations: US-East (Virginia), US-West (California), Australia, Brazil, Canada, France, Germany, Hong Kong, India, Japan, Russia and Singapore. On top of all that, we also launched our “Distributed Search” feature. With this feature, in a few clicks, you are able to set the locations in our network where your data should be automatically duplicated. You get to use the same API as before and queries are automatically routed from the end-user’s browser or mobile application to the closest location.

This was a huge step in reducing latency for end users and improving our high availability via worldwide distribution of searches. Serving international users from one location leads to a very different quality of service because of distance and saturation of Internet links. We now provide our users a way to solve that! On top of reducing latency, this improved the high availability of our search infrastructure because it was no longer dependent on a single location. It’s worldwide!

Some people compare our DSN (Distributed Search Network) to a CDN (Content Delivery Network), but it is very different. We don’t store a cache of your frequent queries on each edge. We store a complete replica of all your data. We can respond to any query from the edge location itself, not just the most popular. This means when you select three POPs (US-East, Germany, Singapore), users in Europe will go to the Germany location, users in Asia will go to Singapore, and users in America will go to US-East. All POPs will respond to queries without having to communicate to another edge. This makes a huge difference in terms of user experience and high availability.

In order to support this change, we updated our retry logic in our API clients to target theAPPID-dsn.algolia.net hostname first, which is routed to the closest location using DNS based on geoip. If the closest host is down, the DNS record is updated to remove the host in less than one minute in order to return the next closest host. This is why we use a low TTL of 1 minute on each record. In case of failure, if the host is down and DNS has not been updated, we redirect traffic to the master region via a retry on APPID-1.algolia.net,APPID-2.algolia.net and APPID-3.algolia.netin our official API clients. This approach is the best balance we have found between high performance and high availability, we only have a degradation of performance during one minute in case of failure, but the API remains up & running!

.12

March 2015

Better High Availability Per Location

The Distributed Search Network option was a game changer for high availability but it’s only for search and international users.

In order to improve the high availability of the main region, we spread our US clusters across two completely independent providers:

Two different datacenters in close locations (Ex: COPT-6 in Manassas & Equinix DC-5 in Ashburn : 24 miles between them, 1ms of latency)
Three different machines — just like before (two in the first datacenter in different availability zones and one on the second datacenter)
Two different Autonomous Systems (so two totally different network providers)

These changes improved our ability to react to certain problems we faced like the saturation of a peering link between on provider and AWS. Having different providers gives us the option to reroute traffic to other providers. This is a big step in terms of high availability improvements in one location.

.13

April 2015

The Random File Corruption Headache

April 2015 was a black month for our production team. We started observing random file corruptions on our production machines due to a bug in the TRIM implementation of some of our SSDs.

You can read the problem in detail in this blog post. It took us approximately one month to track the issue and properly identify it. During this time, we suspected everything, starting with our own software! This was also a big test for our architecture. Having random file corruptions on disk is not something easily handled. Nor is it easy to inform our users that they needed to reindex their data because our disks were corrupted. Luckily, we never lost any data of our customers.

There are two important factors in our architecture that allowed us to face this situation without losing any data:

We store three replicates of the data. The probability of a random corruption of the same data on the three machines is almost zero (and it didn’t happen).
More importantly, we did not replicate the result of indexing. Instead, we duplicated the operation received from the user and applied it on each machine. This design makes the probability of having several machines having the same corruption very low since one affected machine cannot “contaminate” the others.

We never considered this scenario when we designed our architecture! Even though we tried to think about all types of network and hardware failure, we never imagined the consequences of a kernel bug and even worse a firmware bug. The fact that our design was to have independent machines is the reason we were able to minimize the impact of the problem. I highly recommend this kind of independence to any system that needs high availability.

.14

May 2015

Introducing Several DNS Providers

We decided to use NSONE as a DNS provider because of their great DNS API.

We were able to configure how we wanted to route queries for each of our users via their API.

We also really liked how they supported edns-client-subnets because it was key to having better accuracy for geo-routing. These features all made NSONE a great provider, but we couldn’t put ourselves at risk by having only one provider in our architecture.

The challenge was to introduce a second DNS provider without losing all the powerful features of NSONE. This meant having two DNS providers on the same domain was not an option.

We decided to use our retry strategy in our API clients to introduce the second DNS provider. All API clients first try to contact APPID-dsn.algolia.net and if there are any problems, they would retry on a different domain, TLD, and provider. We decided to use AWS Route 53 as our second provider. If there were any problems, the API client would retry randomly on APPID-1.algolianet.com, APPID-2.algolianet.com and APPID-3.algolianet.com that targets the 3 machines of the main cluster. This approach allowed us to keep all interesting geo-routing features of NSONE on algolia.net domain while introducing a second provider for a better high availability on algolianet.com domain.

.15

July 2015

Three Completely Independent Providers Per Cluster

You can now consider that with the infrastructure we developed, we are now fully resilient to any problem… but the reality is different.

Even with services hosted by a Cloud provider with multiple Availability Zones you can have outages. We see two big reasons:

Link/Router starts producing packet loss. We have seen this multiple times between US-East and US-West, including a border ISP router of a big cloud provider while they were not even aware of it.
Route leaks. This particularly impacts the big players that a large portion of the Internet relies on.

Our improved setup in US now allows us to build clusters spanning multiple datacenters, multiple Autonomous systems and multiple upstream providers. That said, in order to accept indexing operations, we need to have a majority of machines up, which means two machines out of three. When using two providers, If one of our providers goes down, we could potentially lose the indexing service although search would still be available. That’s why we decided to go even further with the hosting of clusters across three completely independent providers (different datacenters in locations close to each other that rely on different upstream providers and Autonomous systems). This setup allowed us to provide high availability of search and indexing via an ultra redundant infrastructure. And we provide all of that along with the same easy to use API!

Building A Highly Available Architecture Takes Time

I hope the details of our journey were inspiring and provided useful details about how we started and where we are today. If you are a startup, don’t be worried if you don’t have the perfect infrastructure in the beginning. This is to be expected! But you should think about your architecture and how to scale it early on. I would even recommend doing it before beta!

Having the architecture designed early was our secret weapon. It allowed us to focus on execution as soon as our market fit was there. Even today, we have some features on scalability/availability that were designed very early on that are still not yet implemented. But they will be coming in the near future for sure :)

This story was originally published on highscabability.com under three separate articles.