GitHub is the home of the world’s software developers, with more than 100 million developers and 420 million total repositories across the platform. To keep everything running smoothly and securely, GitHub collects a tremendous amount of data through an in-house pipeline made up of several components. But even though it was built for fault tolerance and scalability, the ongoing growth of GitHub led the company to reevaluate the pipeline to ensure it meets both current and future demands.
“We had a scalability problem, currently, we collect about 700 terabytes a day of data, which is heavily used for detecting malicious behavior against our infrastructure and for troubleshooting. This internal system was limiting our growth.”
—Stephan Miehe, GitHub Senior Director of Platform Security
GitHub worked with its parent company, Microsoft, to find a solution. To process the event stream at scale, the GitHub team built a function app that runs in Azure Functions Flex Consumption, a plan recently released for public preview. Flex Consumption delivers fast and large scale-out features on a serverless model and supports long function execution times, private networking, instance size selection, and concurrency control.
In a recent test, GitHub sustained 1.6 million events per second using one Flex Consumption app triggered from a network-restricted event hub.
“What really matters to us is that the app scales up and down based on demand. Azure Functions Flex Consumption is very appealing to us because of how it dynamically scales based on the number of messages that are queued up in Azure Event Hubs.”
—Stephan Miehe, GitHub Senior Director of Platform Security
A look back
GitHub’s problem lay in an internal messaging app orchestrating the flow between the telemetry producers and consumers. The app was originally deployed using Java-based binaries and Azure Event Hubs. But as it began handling up to 460 gigabytes (GB) of events per day, the app was reaching its design limits, and its availability began to degrade.
For best performance, each consumer of the old platform required its own environment and time-consuming manual tuning. In addition, the Java codebase was prone to breakage and hard to troubleshoot, and those environments were getting expensive to maintain as the compute overhead grew.
“We couldn’t accept the risk and scalability challenges of the current solution,“ Miehe says. He and his team began to weigh the alternatives. “We were already using Azure Event Hubs, so it made sense to explore other Azure services. Given the simple nature of our need—HTTP POST request—we wanted something serverless that carries minimal overhead.”
Familiar with serverless code development, the team focused on similar Azure-native solutions and arrived at Azure Functions.
“Both platforms are well known for being good for simple data crunching at large scale, but we don’t want to migrate to another product in six months because we’ve reached a ceiling.”
—Stephan Miehe, GitHub Senior Director of Platform Security
A function app can automatically scale the queue based on the amount of logging traffic. The question was how much it could scale. At the time GitHub began working with the Azure Functions team, the Flex Consumption plan had just entered private preview. Based on a new underlying architecture, Flex Consumption supports up to 1,000 partitions and provides a faster target-based scaling experience. The product team built a proof of concept that scaled to more than double the legacy platform’s largest topic at the time, showing that Flex Consumption could handle the pipeline.
“Azure Functions Flex Consumption gives us a serverless solution with 100% of the capacity we need now, plus all the headroom we need as we grow.”
—Stephan Miehe, GitHub Senior Director of Platform Security
Making a good solution great
GitHub joined the private preview and worked closely with the Azure Functions product team to see what else Flex Consumption could do. The new function app is written in Python to consume events from Event Hubs. It consolidates large batches of messages into one large message and sends it on to the consumers for processing.
Finding the right number for each batch took some experimentation, as every function execution has at least a small percentage of overhead. At peak usage times, the platform will process more than 1 million events per second. Knowing this, the GitHub team needed to find the sweet spot in function execution. Too high a number and there’s not enough memory to process the batch. Too small a number and it takes too many executions to process the batch and slows performance.
The right number proved to be 5,000 messages per batch. “Our execution times are already incredibly low—in the 100–200 millisecond range,” Miehe reports.
This solution has built-in flexibility. The team can vary the number of messages per batch for different use cases and can trust that the target-based scaling capabilities will scale out to the ideal number of instances. In this scaling model, Azure Functions determines the number of unprocessed messages on the event hub and then immediately scales to an appropriate instance count based on the batch size and partition count. At the upper bound, the function app scales up to one instance per event hub partition, which can work out to be 1,000 instances for very large event hub deployments.
“If other customers want to do something similar and trigger a function app from Event Hubs, they need to be very deliberate in the number of partitions to use based on the size of their workload, if you don’t have enough, you’ll constrain consumption.”
—Stephan Miehe, GitHub Senior Director of Platform Security
Azure Functions supports several event sources in addition to Event Hubs, including Apache Kafka, Azure Cosmos DB, Azure Service Bus queues and topics, and Azure Queue Storage.
Reaching behind the virtual network
The function as a service model frees developers from the overhead of managing many infrastructure-related tasks. But even serverless code can be constrained by the limitations of the networks where it runs. Flex Consumption addresses the issue with improved virtual network (VNet) support. Function apps can be secured behind a VNet and can reach other services secured behind a VNet—without degrading performance.
As an early adopter of Flex Consumption, GitHub benefited from improvements being made behind the scenes to the Azure Functions platform. Flex Consumption runs on Legion, a newly architected, internal platform as a service (PaaS) backbone that improves network capabilities and performance for high-demand scenarios. For example, Legion is capable of injecting compute into an existing VNet in milliseconds—when a function app scales up, each new compute instance that is allocated starts up and is ready for execution, including outbound VNet connectivity, within 624 milliseconds (ms) at the 50 percentile and 1,022 ms at the 90 percentile. That’s how GitHub’s messaging processing app can reach Event Hubs secured behind a virtual network without incurring significant delays. In the past 18 months, the Azure Functions platform has reduced cold start latency by approximately 53% across all regions and for all supported languages and platforms.
Working through challenges
This project pushed the boundaries for both the GitHub and Azure Functions engineering teams. Together, they worked through several challenges to achieve this level of throughput:
- In the first test run, GitHub had so many messages pending for processing that it caused an integer overflow in the Azure Functions scaling logic, which was immediately fixed.
- In the second run, throughput was severely limited due to a lack of connection pooling. The team rewrote the function code to correctly reuse connections from one execution to the next.
- At about 800,000 events per second, the system appeared to be throttled at the network level, but the cause was unclear. After weeks of investigation, the Azure Functions team found a bug in the receive buffer configuration in the Azure SDK Advanced Message Queuing Protocol (AMQP) transport implementation. This was promptly fixed by the Azure SDK team and allowed GitHub to push beyond 1 million events per second.
Best practices in meeting a throughput milestone
With more power comes more responsibility, and Miehe acknowledges that Flex Consumption gave his team “a lot of knobs to turn,” as he put it. “There’s a balance between flexibility and the effort you have to put in to set it up right.”
To that end, he recommends testing early and often, a familiar part of the GitHub pull request culture. The following best practices helped GitHub meet its milestones:
- Batch it if you can: Receiving messages in batches boosts performance. Processing thousands of event hub messages in a single function execution significantly improves the system throughput.
- Experiment with batch size: Miehe’s team tested batches as large as 100,000 events and as small as 100 before landing on 5,000 as the max batch size for fastest execution.
- Automate your pipelines: GitHub uses Terraform to build the function app and the Event Hubs instances. Provisioning both components together reduces the amount of manual intervention needed to manage the ingestion pipeline. Plus, Miehe’s team could iterate incredibly quickly in response to feedback from the product team.
The GitHub team continues to run the new platform in parallel with the legacy solution while it monitors performance and determines a cutover date.
“We’ve been running them side by side deliberately to find where the ceiling is,” Miehe explains.
The team was delighted. As Miehe says, “We’re pleased with the results and will soon be sunsetting all the operational overhead of the old solution.“
Explore solutions with Azure Functions
The post GitHub scales on demand with Azure Functions appeared first on Microsoft Azure Blog.
Remember to like to check out our Windows Server content as well.