Our systems are built with some principles in mind:
Here’s a very basic overview of our architecture. I'll start with a very high level overview and break it down as we go. I’ll also talk about how our principles are expressed in our system.
At the highest level we have 3 major parts - The frontend, data storage and the backend. The frontendis everything a customer sees and works with. This includes the main site where they set up their policies. The data storage is where we store the information we find, and the backend consists of the various workers that collect the data.
By having a clean split between the backend and the frontend, we are able to make changes independently. This makes our systems resilient - if I push buggy code into the backend systems, the frontend keeps on working fine, which keeps our customers worry-free.
Our backend is a series of workers: crawling, parsing/analysis, link following and final analysis. The initial crawlers gather a web page or SERP (depending on the policy). Then the initial parsing/analysis workers inspect what we found. They look for affiliate information, interesting links and anything else that we should look into more. A selection of those links are passed to the link followers to see where they go. Once we've followed all the interesting links found on the page, we will pass the page onto the final analysis stage. In final analysis we look at how a policy is set up and see if we should alert a customer to what we found.
Having individual worker types for each stage provides a great number of benefits. We know exactly who's doing what (transparency), which makes both scaling and debugging easier. It's easy to scale - just add the worker type you need. If there are problems, you can follow the chain and find out exactly where it's breaking down.
For example: if we're successfully doing the initial crawling but are not doing link following, then the problem must be in the initial parsing/analysis.
We use SQS and queues all throughout our system. Between each worker there is a queue. Each message on the queue represents one job that our backend system needs to do. Workers update the job as it moves through the system.
For example, a customer might create a policy for monitoring discount tire ads. Once they've set it up, our system will generate a job which basically says 'Find all ads for "discount tires" on Google in Canada'. A crawler worker will see that job and attempt to do the search. If it's successful, it will delete the message off the first queue and add a new one to the downstream link parsing/analysis queue.
Adding in more detail, our system looks more like this:
Using a queue/worker system is one of the reasons our systems are so great.
The resiliency of this kind of system is amazing. If someone, say yours truly, pushes bad code and causes all of the link followers to break - we don't lose work! Unsuccessful jobs will not get deleted from the upstream queue. That means once I've fixed my bug, the workers will start work again.
This resiliency has a bunch of benefits. For one, we don't have to make sure every change is perfect before it gets pushed to production. We try to do our best, but if bad code goes out it's not the end of the world. This means we don't have to have a long and painful test cycle, we can iterate fast and it lowers the overall stress of pushing to production.
We also get really good insights into the health of the system by looking at queue dynamics. If queues are growing (i.e., deletes are lower than sends), we know that somethings is up and we should investigate. We can know about and fix problems before the customer sees them.
We use S3 extensively for webpage captures - including the raw HTML. We use DynamoDB's fast NoSQL storage for keeping track of links, redirects and other data that's interesting to our customers. Dynamo's ability to scale helps allows us to have many many workers all updating different data, all at the same time. We also lean heavily on Redis for caching and performance.
So our system looks more like this:
Basically our workers will save data to the data store, then update the job with the key for finding the data. Instead of storing a bunch of data in the database, we can store the keys for the other datastores. This keeps the DB lean and fast, but allows us to retrieve the information when a customer needs it.
On the worker side, instead of saving the html inside of the job, we'll save the html to an S3 bucket and then update the job with the S3 key for looking up the html. The parsers/initial analysis then can pull the page from S3, look for interesting links then pass those to the link followers.
Autoscaling and Spot Fleets:
I keep talking about workers, but what do I mean really? We use EC2 instances for all the servers, each server having a number of worker processes based on the machine type. Basically if we have a big beefy m4.4xlarge we can have say 100 workers, but a much smaller m3.medium might have only 10.
Our system automatically scales up and down the number of instances based on the amount of work in each stage. In our system that's the same as the number of messages in the queue. If it's above a certain threshold we'll automatically add instances, below and we'll remove instances. This means our system can automatically grow and shrink in response to customer usage.
This is a more accurate representation of how our system looks:
One nice benefit of using auto-scaling is that our system becomes more resilient to (poor) code performance. If I push poorly performing code our workers will slow down. The system notices less work is being done and automatically scales up by adding more instances and thus more workers. This gives me time to fix the performance while the system keeps on working.
We don't have to performance test every change, just pay attention to how our system dynamics change. This feels quite liberating to me!
It is also really easy to track down exactly what change I pushed that made our system slow down - we can look at when we started to scale up and the commit history. This transparency saves so much time in tracking down many performance issues.
So far I haven’t given the frontend enough credit. We don't just have Webservers, we have several systems that keep the frontend working strong.
We use Route53 and ELBs (Elastic Load Balancers) to get the customer's http request to our webservers. Our webservers are in an autoscaling group themselves. If a server goes down, we will automatically spin up a new one and it will automatically register with the ELB and start serving requests.
We also have both a Celery and what we call a 'cron host'. The Celery machine takes care of certain tasks, such as emailing customer's summaries, CSVs, etc. Our cron host is an instance that runs a bunch of scripts using cron. One of the most important one is the job scheduler: this puts jobs on the queue for our backend systems to work!
Adding in the front end details our system looks like this:
Wow! This looks really complex (and even this simplifies some important areas of our system - more posts to come), but underneath are our core principles: Scalability, Resilience, Transparency. These principles make our systems great.
If you're interested in working with and learning these technologies - Find out more! We're hiring!