General

Enhancing Spark Development with SafeGraph's Configuration

Discover how SafeGraph improved Spark development by centralizing configuration management with Spring Cloud Config, GitHub, and AWS SSM for better efficiency.

RRobert KissPublished 14 January 2026

How SafeGraph Transformed Configuration Management (and Supercharged Spark Development)

Learn how SafeGraph built a centralized config management system with Spring Cloud Config, GitHub & AWS SSM—and used it to streamline Spark on EMR on EKS.

Configuration management sounds a bit boring on the surface, but in real-world engineering teams it can quietly make or break reliability, security, and developer productivity.

At SafeGraph—a geospatial data company managing precise data on over 41 million places globally—the lack of a proper configuration management system used to cause real production issues: wrong data shipped to customers, broken services after endpoint changes, and secrets floating around in plain text.

To be honest, this is a story a lot of teams can recognize. Config starts simple, grows organically, and at some point becomes a tangled mess.

In this article, we’ll walk through how SafeGraph:

Went from scattered, hard-coded configurations to a centralized, versioned system
Built a configuration management platform with Spring Cloud Config, GitHub, and AWS SSM Parameter Store
Used that system to dramatically simplify Spark job configuration on EMR on EKS
Improved both developer experience and operational control without over-engineering the solution

If you’re wrestling with configs across Kubernetes, Databricks, Lambda, Airflow, or Spark, this breakdown should give you some concrete ideas to borrow.

What Is Configuration Management (and Why It Actually Matters)

Before diving into SafeGraph’s journey, it helps to clarify what we mean by configuration management in a modern data and platform engineering context.

Configuration management is the process of organizing, versioning, testing, and releasing your application configuration across different environments.

That includes all the small but critical bits like:

Database endpoints and credentials
Third-party API keys and tokens
Caching and performance settings
Feature flags or logic toggles
S3 bucket names and paths
Service endpoints used across multiple apps

On paper, these look like “simple values.” In practice, they:

Change more often than code in some systems
Can break production instantly when wrong
Are needed consistently across multiple applications and environments

When configuration is scattered, implicit, or hard-coded, teams pay the price in outages, wasted debugging time, and slow rollouts. SafeGraph learned this the hard way.

The Hidden Cost of Scattered Configurations

In the early days, SafeGraph didn’t really have a formal configuration management system. And that’s very common. Things worked… until they didn’t.

Configs were spread across:

Kubernetes ConfigMaps and environment variables
AWS Lambda environment variables
Hard-coded values inside application code
Airflow’s variable store
Databricks cluster configuration as environment variables

On a whiteboard that doesn’t sound terrible. But a few concrete problems quickly emerged:

1. No auditability
There was no reliable way to answer basic questions:
- Who changed this configuration?
- What exactly changed?
- When did it change, and why?

2. No proper review or approval flow
Updating configs often bypassed code review. People would tweak an environment variable or Airflow variable directly. That meant:
- Risky changes in production
- No pull request history
- No systematic checks or tests before applying modifications

3. Hard to share configs across applications
Common settings (like a shared API endpoint or bucket path) were duplicated:
- Slightly different values in different systems
- No clear source of truth

4. Security issues: secrets in plain text
Some credentials were hard-coded or stored in plain text. Obviously not ideal from any security or compliance perspective, especially as the company and data scale grew.

None of these problems are especially exotic. The important part is how they compound over time, especially in a data-heavy environment.

Two Painful Incidents That Forced a Rethink

Two real incidents really highlighted why SafeGraph needed a better configuration management system.

#### 1. Customer configuration table catastrophe

A migration script had a bug that accidentally reset all rows in a critical customer configuration table to a default value.

Why was this table so important? It determined:

Which products each customer should receive
What data sets needed to be delivered each month

Because of the bug:

All customer configs were wiped to defaults
SafeGraph shipped the wrong datasets to all customers

Restoring the old state was painful:

Restore an RDS snapshot to a new cluster
Export the data to a file
Import it back into the main RDS cluster
Then re-run data deliveries for customers

The core problem wasn’t only the migration bug—it was that configurations weren’t versioned and managed like code, so rolling back config mistakes was heavy and slow.

#### 2. Hard-coded API endpoint breaks services

In another case, SafeGraph needed to update a widely used API endpoint referenced across multiple services and codebases.

But because the endpoint was hard-coded in many places:

Engineers had to open multiple pull requests across repos
The change required significant manual effort and coordination
Even with all that, they still missed a few usages of the old endpoint

So when the team finally deprecated the old endpoint:

Some applications still referenced it
Those apps broke in production

Again, the root issue was the pattern of hard-coded configuration. It made safe, global changes extremely difficult and eroded development velocity.

At that point, it was pretty clear that SafeGraph needed a more robust, centralized approach to configuration.

Designing a Centralized Configuration Management System

In response, SafeGraph built a new configuration management system centered around a few core principles:

Centralize configuration storage
Version and review every change via Git
Separate secrets from standard configuration
Serve configurations in a consistent way to multiple clients
Avoid making engineers deal with low-level API details

The result was an architecture based on Spring Cloud Config, GitHub, and AWS SSM Parameter Store, with client libraries to smooth developer experience.

Core Architecture: Spring Cloud Config + GitHub

At the heart of the new system is Spring Cloud Config Server, an open-source configuration service that:

Exposes configs over an HTTP API
Supports Git as a storage backend
Allows clients to request configs by application, environment, and version

In practice, this meant:

1. Configurations live in a single GitHub repository
- Every application gets its own directory
- Each directory contains one or more YAML configuration files

2. Git becomes the source of truth
- All changes go through pull requests
- Code review, approvals, and CI checks can apply to configs
- History and blame (“who changed what, when?”) are built-in

3. Config resolution is environment-aware
For each service, SafeGraph keeps:
- A base configuration file
- One or more environment-specific config files (e.g. `dev`, `staging`, `prod`)

When a client requests configuration for a given application and environment, the config server:

Loads the base config
Merges it with the environment-specific config
Returns the merged result

This layering has a few really nice benefits:

Common defaults live in one place (the base file)
Environment-specific overrides (like different endpoints or resource limits) are clean and explicit
Teams don’t have to duplicate entire config files per environment

To be honest, that base + override pattern is simple but incredibly effective in reducing duplication and config drift across environments.

Hiding the HTTP Layer: Python and Scala Client Libraries

Even with a robust config server, you don’t really want every engineer hand-rolling HTTP calls to fetch configs.

To make the system easy to adopt, SafeGraph wrote client libraries in:

Python
Scala

These libraries:

Handle talking to the Spring Cloud Config Server
Accept simple parameters like application name, environment, and version
Deserialize the YAML into typed config objects (or structured dictionaries)

The result is that application code can stay relatively clean. Instead of sprinkling HTTP logic and config parsing everywhere, engineers can do something conceptually like:

```python
config = sg_config.load(app="my-service", env="prod")
```

and get a ready-to-use configuration object. The awkward bits stay under the hood.

Secure Secret Management with AWS SSM Parameter Store

Plain configuration values are one thing. Secrets are a different category entirely.

SafeGraph explicitly did not want to store secrets as plain text in the Git-backed configuration files. Although Spring Cloud Config supports some encryption capabilities, its access control story wasn’t strong enough for their needs.

So they split responsibilities:

Non-sensitive configuration → stored in GitHub via Spring Cloud Config
Secrets and credentials → stored in AWS SSM Parameter Store

Referencing Secrets Indirectly in Config Files

In the YAML configuration files, secrets are represented indirectly by their SSM Parameter Store path, rather than by the secret value itself.

For example, a config file might contain something like:

```yaml
database_password: SG_SECRET:/safegraph/prod/db/password
```

(This is a conceptual example; the exact path or casing may differ.)

Here’s how it works end-to-end:

1. The configuration key/value in YAML uses a special prefix (e.g. `SG_SECRET`).
2. The client libraries (Python/Scala) detect keys with this prefix.
3. When the app loads configuration, the library:
- Calls AWS SSM Parameter Store with the referenced path
- Fetches the actual secret value
- Transparently replaces the placeholder with the real secret in the returned config object

From the user’s perspective, they just use `config.database_password` as a normal value. They don’t have to manually integrate with SSM in every service.

Why This Separation Works Well in Practice

This approach brings a few practical benefits:

1. No secrets in Git
- Even if someone browses the config repository, they only see secret references, not the values.

2. Fine-grained access control with IAM
Because secrets live in AWS SSM:
- You can manage access via IAM roles and policies
- Different services can be allowed to read only the parameters they need

3. Centralized secret rotation
Rotating a password or token becomes much easier:
- Update the value in SSM
- Apps read the new value on restart or next load, without any change to the YAML files

4. Keeps configuration readable and structured
You still see where secrets are used in your config hierarchy without exposing the sensitive values.

In my experience, this split of “config in Git, secrets in a managed store” is a solid default pattern for most teams building a configuration management strategy.

Using Config Management to Simplify Spark on EMR on EKS

SafeGraph is fundamentally a data company, so Apache Spark plays a pretty important role in their infrastructure. They run hundreds to thousands of Spark applications daily.

These jobs run on EMR on EKS, and engineers:

Build, test, and iterate on Spark applications regularly
Launch jobs from different contexts: Airflow, services, or locally via a Spark CLI

This is where a good configuration system really starts paying off—not just for reliability and security, but for developer experience too.

The Problem: Verbose EMR on EKS Job Parameters

Launching a Spark job on EMR on EKS via its API requires a long list of parameters, including:

Cluster or virtual cluster IDs
Job runtime configuration
Docker image
Executor and driver resource settings
Logging/config paths
Environment variables

Previously, each engineer or service that needed to submit a job had to:

Construct these large parameter dictionaries or JSON blobs manually
Duplicate the same boilerplate configuration over and over

This resulted in:

Cluttered code whenever a job was launched
Inconsistent defaults between different teams and codebases
A higher barrier to entry for new engineers trying to run Spark jobs

Surprisingly, this is a classic case where configuration management can drastically reduce cognitive load.

The Solution: Centralized Defaults + Minimal Job Parameters

SafeGraph used the new config management system to centralize most of the EMR on EKS job configuration.

The idea was pretty simple, but powerful:

1. Define a set of default job parameters in the centralized config repo
These included things like:
- Base EMR on EKS parameters
- Recommended Spark defaults
- Shared Docker image IDs
- Logging and monitoring configuration

2. Tooling uses these defaults as a base template
When a job is launched, SafeGraph’s internal tooling:
- Loads the base configuration from the config service
- Merges in any job-specific overrides provided by the engineer
- Calls EMR on EKS with the combined set of parameters

3. Engineers only specify what’s unique to their job
Instead of passing a long, verbose set of parameters, they just provide:
- Job name
- Application code or entrypoint
- Input/output paths or core business parameters

According to the example described, this approach eliminated roughly 90% of the boilerplate configuration engineers previously had to write when submitting jobs.

Conceptually, it looks like this:

Before: 30–40 lines of parameters per job submission
After: A short, focused set of job-specific fields

This change:

Made Spark development and iteration significantly faster
Reduced configuration errors and confusion
Standardized how jobs are launched across the company

Operational Control: One Config File to Rule the Spark Platform

The benefits weren’t just on the developer side. The platform team also gained much better operational control over Spark.

Using the centralized config system, they maintain a dedicated configuration file for the Spark platform itself. From that single file, they can control things like:

The Docker image ID used for Spark jobs
Which availability zones to use for running jobs
Default driver and executor resource parameters
Other platform-wide tuning settings

The key point is: these changes are now centralized and config-driven.

If the platform team wants to:

Update the base Spark Docker image
Change default resource allocations
Adjust availability zones for cost or reliability reasons

They can do it by editing a single config file, going through a standard pull request, and rolling it out—without touching any user code.

As a result:

Bug fixes can be rolled out more frequently
New platform features or tuning changes reach all jobs consistently
Users don’t have to update their code just because the platform evolves

In my view, this is where configuration management really shines: decoupling platform evolution from application changes.

Practical Takeaways for Your Own Configuration Management Journey

You might not be SafeGraph, or you might not be running Spark on EMR on EKS. Still, there are several general lessons here that apply to most engineering teams.

You don’t need to copy the exact stack (Spring Cloud Config, AWS SSM, etc.), but the principles are widely useful.

1. Treat Configuration Like Code

Configs change production behavior just as much as code does.

So, where possible:

Put configuration under version control (Git)
Use pull requests for reviews and approvals
Keep a consistent history of changes
Consider adding basic validation or checks for critical config

This alone will help with auditability and safer rollouts.

2. Centralize, but Allow Environment-Specific Overrides

The base + environment-specific layering pattern is simple but very effective.

Aim for:

A base configuration shared across all environments
Overrides for `dev`, `staging`, `prod` where values genuinely differ

This minimizes duplication and makes it much easier to see what’s different between environments at a glance.

3. Separate Configuration from Secrets

Even if you’re not using AWS SSM, try to:

Keep secrets out of Git
Use a dedicated secrets manager or parameter store
Reference secrets indirectly from configuration files

Then let your libraries or service layer resolve those secrets at runtime. Developers shouldn’t have to deal with raw secret values more than absolutely necessary.

4. Build Thin Client Libraries to Reduce Friction

If using a central config service, provide language-specific client libraries (even if they’re small wrappers):

Hide HTTP details
Offer a simple `load_config(app, env)` style API
Handle secret resolution transparently

The easier it is to consume configuration, the more consistently your teams will use the system.

5. Use Configuration to Improve Developer Experience

Configuration management isn’t just about avoiding outages.

You can also use it to:

Define shared defaults for complex APIs (like EMR on EKS)
Reduce boilerplate in job submission and service wiring
Standardize best practices across different teams

The SafeGraph Spark example is a good pattern: centralized defaults + minimal overrides.

Even outside Spark, you can apply this to things like:

API client configuration
Background job frameworks
Messaging systems (Kafka, SQS, etc.)

SafeGraph’s configuration management journey is a good illustration of how something that often starts as an afterthought can become a core piece of platform engineering.

They moved from:

Scattered, hard-coded configs and plain-text secrets
Painful incidents affecting customer deliveries and service reliability

To a system built on:

Spring Cloud Config Server with GitHub as a backend
Centralized, versioned configuration per app and environment
Secure secret storage in AWS SSM Parameter Store
Tooling that simplifies Spark on EMR on EKS and centralizes platform control

The biggest lesson, in my opinion, is that configuration management isn’t just “ops plumbing.” Done well, it directly improves:

Reliability – fewer surprises from hidden or diverging configs
Security – secrets handled properly with managed access control
Developer productivity – less boilerplate, clearer patterns, faster iteration

If your own setup feels a bit chaotic right now—with configs spread across Kubernetes, CI variables, code, and random dashboards—this might be a good moment to pause and design a more deliberate system.

You don’t have to adopt the exact same tools, but you can start by:

Centralizing configs in Git
Introducing a config service or at least a shared library
Moving secrets into a proper secrets manager
Gradually refactoring hard-coded values into managed configuration

Over time, those small steps add up. And, as SafeGraph’s experience shows, they can unlock a much smoother path for both platform teams and application developers.

If you’re considering a new configuration management approach and want to explore patterns like this in more depth, start by mapping where your configs live today. That inventory alone is often a surprisingly eye-opening first step.

← Back to Blog Start free trial →