How Retraced Manages Infrastructure
At Retraced, Oracle cloud is the heart of our Infrastructure with our Compute, Database and files running on Oracle Cloud Infrastructure. While Oracle has wonderful managed services particularly the Autonomous transaction processing database, we still would not want to miss the Cloudflare's expertise in edge management—whether it’s DNS management, automated TLS/SSL management, WAF, or DDoS protection. Similarly, Entra ID (formerly Microsoft Azure AD) excels as a centralized IAM tool, not just for setting up SSO across different platforms but also for mobile device management.
We embraced a poly cloud approach to leverage the best solutions for each use case. However, this approach brings its own set of complexities in provisioning, maintaining, and upgrading infrastructure and software components.
Here’s a breakdown of the various clouds we’ve adopted:
Provider | Use cases |
---|---|
Cloudflare |
Networking & Edge: DNS, Rate limiting Security: Zero trust, WAF & DDoS Protection Workers for UI/Frontend, Caching and some event driven services |
Oracle Cloud Infrastructure |
Compute: Oracle Kubernetes Engine Database: Autonomous Transaction Processing, Autonomous Data warehouse Object storage: Files Management Container Registry: Oracle Container image registry |
Azure |
IAM: Entra ID Artificial Intelligence: Azure Open AI Services |
AWS |
Message Queue: Rabbit MQ Email Services: Simple Email service |
Yes, we are one of the few companies using Oracle Cloud Infrastructure. We chose it for its exceptional autonomous transaction processing databases, ensuring our compute and object storage are as close to the database as possible.
In addition to these, we utilize various SaaS providers, including development and automation tools such as HashiCorp Terraform Cloud, GitHub Enterprise, and Doppler.
High level overview of our architecture
Managing our poly cloud infrastructure with a small team dedicated to infrastructure operations relies on several key factors:
🦾 Managed Services
We strongly believe in the principle of buy before build. We avoid reinventing the wheel unless absolutely necessary. All our adopted services are fully managed - or in best case autonomous - which minimizes maintenance overhead and provides peace of mind regarding availability. Downtime for maintenance is rare.
📜Infrastructure as Code
We adopted Terraform to provision, deprovision and upgrade our infrastructure. All modules and workflows are stored in a dedicated repository and orchestrated through GitHub Actions. Terraform Cloud manages our state, ensuring efficient state locking.
The following demonstrates the workflow of how a new microservice is provisioned:
📂 A new branch is created → 📝 New module/resource added in the code → 🔃 Pull request is created → ⚙️ Pull request triggers the plan workflow → 👀 Reviewer reviews both the changes and the plan → 🚀 Merge triggers the Terraform deployment pipeline
We've taken this a step further by enabling developer self-service. Developers can create a microservice through GitHub, which sets up everything from a repository and Doppler project to Kubernetes deployment. They can also provision an Azure Open AI service without logging into the Azure portal. (We will discuss the self-service workflows in another article).
🙌🏾 Keeping it Simple
Counterintuitive as it may seem, simplicity is key to managing complex infrastructure. We focus on implementing the simplest solutions that meet our needs, even if it means not using the latest technology or tools. For example, we use a simple kubectl patch
command for releases instead of more complex tools like Helm or Argo CD, given our straightforward microservices structure. This strategy helps minimize technical debt and keeps our system manageable.
📝Documentation and Communication
Documenting all changes is crucial for effective infrastructure management. We primarily document within the code, such as with Terraform. When code documentation isn’t possible, we use Notion and track all changes in our Significant Changes database. This practice ensures a clear record of actions taken, aiding in troubleshooting when issues arise. We communicate all changes to our technology department and the entire organization as necessary, fostering transparency and collaboration.
These strategies not only help us run our operations smoothly but also keep our infrastructure and software components up-to-date.