Airflow install using docker-compose on mac-mini by en
1. What is Airflow?
Apache Airflow is an open-source tool for creating and running workflows, providing powerful automation and scheduling capabilities for data pipelines. It allows you to define and execute tasks using DAGs (Directed Acyclic Graphs) and monitor their status easily through a web UI.
- My favorite feature is the ability to share logs and code.
- The scheduling function is similar to Cron, but I particularly like that I can check execution results and logs directly from the web.
- The web UI makes it convenient to view and share code.
- Airflow is known for its powerful backfill functionality, but I have never actually used it.
- I didn't use backfill at work — instead, I used task clear to re-run past tasks. I looped over past dates, so setting the
execution_datewas necessary.
- I didn't use backfill at work — instead, I used task clear to re-run past tasks. I looped over past dates, so setting the
2. Setting Up the Mac Mini Environment
To run Airflow with Docker Compose on a Mac Mini, I first need to set up the required environment.
2.1 Required Software Installation
To run Airflow on a Mac, you need the following:
- Docker and Docker Compose (installable via Homebrew)
- Python 3 (needed for Airflow configuration)
- I use
pyenvto manage Python versions. - Since Airflow runs in Docker, installing Python separately is not necessary.
- However, having the Airflow package installed locally makes writing DAGs more convenient.
- I use
If these are not installed, you can install them using Homebrew:
brew install --cask docker
brew install python
After installing Docker, start Docker Desktop and verify that it runs correctly.
3. Installing and Running Airflow with Docker Compose
3.1 Downloading the Official Airflow Docker Compose File
Retrieve the Docker Compose file from the official Airflow GitHub repository:
mkdir airflow && cd airflow
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.3/docker-compose.yaml'
For the latest version:
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
- Note: The above commands may not be exact, as I had previously downloaded the Docker Compose file and backed it up on Google Cloud. I can't confirm if this is the exact method I used—ChatGPT suggested this.
- Additionally, I run the following services as standalone installations rather than using Docker:
- PostgreSQL
- Redis
Thus, I configure the following environment variables in my docker-compose.yaml file:
environment: &airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://id:password!@localhost/airflow_db?sslmode=disable
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://id:password!@localhost/airflow_db
AIRFLOW__CELERY__BROKER_URL: redis://:@localshot:6379/0
AIRFLOW__CORE__FERNET_KEY: ""
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "true"
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
AIRFLOW__API__AUTH_BACKENDS: "airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session"
# yamllint disable rule:line-length
# Use simple http server on scheduler for health checks
# See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
# yamllint enable rule:line-length
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: "true"
# WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
# for other purpose (development, test and especially production usage) build/extend Airflow image.
_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
Since I use standalone PostgreSQL and Redis, I comment out the corresponding sections in docker-compose.yaml:
services:
# postgres:
# #image: postgres:13
# image: busybox
# # environment:
# # POSTGRES_USER: airflow
# # POSTGRES_PASSWORD: airflow
# # POSTGRES_DB: airflow
# # volumes:
# # - postgres-db-volume:/var/lib/postgresql/data
# healthcheck:
# test: ["CMD", "pg_isready", "-h", "192.168.0.1", "-U", "admin"]
# interval: 10s
# retries: 5
# start_period: 5s
# # restart: always
# command: ["sleep", "infinity"]
# redis:
# # Redis is limited to 7.2-bookworm due to licencing change
# # https://redis.io/blog/redis-adopts-dual-source-available-licensing/
# # image: redis:7.2-bookworm
# # image: redis:7.2-bookworm
# image: busybox
# # expose:
# # - 6379
# healthcheck:
# test: ["CMD", "redis-cli", "-h", "192.168.0.2", "ping"]
# interval: 10s
# timeout: 30s
# retries: 50
# start_period: 30s
# # restart: always
# command: ["sleep", "infinity"]
airflow-webserver:
3.2 Setting Up Environment Variables
Create a .env file for Airflow with the required configurations:
echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env
- On Mac, setting
AIRFLOW_GID=0helps avoid permission issues (as suggested by ChatGPT). - However, I did not include
GIDin my.envfile. Instead, I structured my Airflow project as follows:
AIRFLOW_PROJ_DIR=/path/to/airflow/project/airflow
- File Structure
├── .env
└── docker-compose.yaml
3.3 Creating the Required Directory Structure
Create the necessary folders for Airflow:
mkdir -p dags logs plugins
- Folder structure:
- There is no file on config folder
.
├── config
├── dags
├── logs
└── plugins
3.4 Running Docker Containers
Start the Airflow containers using:
docker-compose up -d
Once all services (Webserver, Scheduler, etc.) are running, you can access the Airflow UI.
3.5 Verifying Web UI Access
Open http://localhost:8080 in your browser to check if the Airflow UI is running.
- Username:
airflow - Password:
airflow
4. Testing a Simple DAG
4.1 Enabling Default Example DAGs
By default, Airflow’s example DAGs are disabled. You can enable them by modifying airflow.cfg:
docker-compose exec webserver airflow config set core load_examples True
- Initially, I enabled example DAGs for testing, but they generated excessive logs, so I disabled them.
- Log management in Airflow was particularly challenging for me due to the large volume of logs generated.
However, I kept example DAGs disabled. Instead of using a command, I set the following in docker-compose.yaml:
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
4.2 Running an Example DAG
Navigate to the DAGs page in the Airflow UI and run the example_bash_operator DAG to verify that everything is working correctly.
5. Optimizing and Troubleshooting Airflow on Mac
5.1 Limiting Container Resources
To prevent Docker from consuming excessive resources on Mac, adjust the CPU and RAM limits in Preferences > Resources in Docker Desktop.
- Also, be mindful of Disk Usage. On macOS, Docker storage is classified under "System Data," and its size increases proportionally with usage.
- The main issue isn’t the shrinking of disk space, but the unpredictable expansion, which can make system control difficult.
- I encountered system crashes twice due to almost 100% disk usage. Restarting the system resolved the issue, twice.
- Instead of adjusting settings in Docker, I created a DAG that periodically removes old logs.
- Configure log retention settings in
airflow.cfg - Alternatively, adjust the logging level via
docker-compose.ymlby modifyingAIRFLOW__LOGGING__LOGGING_LEVEL.
ChatGPT’s Recommendations for Log Management
base_log_folder = /path/to/logs
logging_level = INFO
log_retention_days = 7
5.2 Resolving Port Conflicts
By default, Airflow uses port 8080. If this port is already in use by another process, you’ll need to change it. Modify docker-compose.yaml as follows:
webserver:
ports:
- "9090:8080"
After this, you can access the UI at http://localhost:9090.
5.3 Fixing Volume Permission Issues
On macOS, volume mounting may cause permission issues. Check the volumes section in docker-compose.yaml and, if necessary, adjust permissions using the chmod command.
5.4 Running Multiple DAGs Concurrently and Optimization
- To execute multiple DAGs simultaneously, increase the
max_active_runs_per_dagvalue inairflow.cfg. - If certain DAGs depend on each other, use
TriggerDagRunOperatorto enforce sequential execution. - Prevent system overload by appropriately setting
concurrencyanddag_concurrency:
concurrency = 8
dag_concurrency = 4
This configuration allows up to 4 concurrent runs per DAG, with a total of 8 tasks running simultaneously.
6. Conclusion
This guide covered setting up and running Airflow using Docker Compose on a Mac Mini. I tested basic DAG execution and addressed common issues that may arise in a macOS environment.
Airflow enables the creation of complex data pipelines. Future topics to explore include integrating external data sources, adding custom operators, and using the Kubernetes Executor.
Additional Notes: Backfill
Airflow’s Backfill feature is used to retroactively execute DAG runs for missed periods, often necessary when adding or modifying DAGs.
🔹 Understanding Backfill
- Airflow executes DAGs based on
execution_date. - If DAG runs were missed or a new DAG needs to process historical data, backfill can be used.
- Backfill runs DAGs for past dates according to their schedule, ensuring that missing task executions are completed.
🔹 Running Backfill
To manually trigger backfill for a specific DAG over a past period, use the following Airflow CLI command:
airflow dags backfill -s 2024-02-01 -e 2024-02-10 my_dag_id
-s 2024-02-01: Start date-e 2024-02-10: End datemy_dag_id: DAG ID to run
This command runs my_dag_id from February 1 to February 10, 2024.
🔹 Things to Consider When Using Backfill
- If
catchup=Falsein the DAG definition, backfill won’t execute automatically. - To enable backfill, set
catchup=True: - If processing a large backfill job, optimize execution by adjusting
parallelismandmax_active_runs_per_daginairflow.cfg: - Consider Resource Usage
- Backfill runs multiple historical DAG executions simultaneously, which increases CPU/memory usage.
- Adjust scheduler and worker settings accordingly.
Optimizing Parallel Execution
parallelism = 10
max_active_runs_per_dag = 5
Check catchup Setting
dag = DAG(
'my_dag',
default_args=default_args,
schedule_interval='@daily',
catchup=True # 과거 실행을 허용
)
🔹 When to Use Backfill
✅ Running a new DAG on historical data
✅ Applying DAG modifications retroactively to past data
✅ Re-executing DAG runs for periods when they failed or weren’t triggered
Additional Notes: Executors
Airflow’s Executor determines how tasks are executed. The main types of Executors are:
- Since this setup is for a home server,
LocalExecutororStandaloneExecutormight be sufficient. However, CeleryExecutor was used here for testing.
- SequentialExecutor
- Executes one task at a time
- Default executor with SQLite
- Recommended only for small test environments
- LocalExecutor
- Allows parallel task execution
- Runs on a single machine using multiprocessing
- Suitable for development or small production setups
- CeleryExecutor
- Distributes tasks across multiple worker nodes
- Uses a message broker (Redis, RabbitMQ, etc.)
- Ideal for large-scale distributed environments
- KubernetesExecutor
- Runs each task in an isolated Kubernetes Pod
- Provides strong resource isolation and scalability
- Best for cloud environments
- DaskExecutor
- Uses Dask for distributed execution
- Supports dynamic scaling and parallel processing
- StandaloneExecutor (Airflow 2.7+)
- Similar to LocalExecutor but with a simpler setup
- Easily runs with
airflow standalone
Additional Notes: Flower
What is Flower?
Flower is a web-based monitoring tool for Celery tasks. If using CeleryExecutor in Airflow, Flower allows tracking worker and task statuses.
Flower’s Key Features
- Monitor currently running Celery tasks
- Check the status of individual workers
- Retry or terminate tasks
- View execution logs and queue status
Running Flower in Airflow
If using CeleryExecutor, start Flower UI with:
airflow celery flower
However, in docker-compose.yml, Flower is configured to start with:
# You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
# or by explicitly targeted on the command line e.g. docker-compose up flower.
Once running, access Flower UI at http://localhost:5555.
- I haven't tried running it.
Additional Notes: Using Airflow CLI with Docker Compose
- In the current
docker-compose.yaml, theairflow-cliservice is assigned the profiledebug.- To enable it, use:
docker-compose --profile debug up - Simply running
docker-compose upwill not startairflow-cliunless thedebugprofile is explicitly included. - This starts the
airflow-clicontainer, executes the command, and then shuts it down.
- To enable it, use:
- For an interactive shell inside the container:
To execute Airflow commands (airflow dags list, etc.), run:
docker-compose run --rm airflow-cli airflow dags list
docker-compose run --rm airflow-cli bash
Additional Notes: Resolving Disk Usage Issues
- To free up disk space, periodically clean up Docker volumes using:
docker system prune -a --volumes
- To automatically clear old Airflow logs, schedule a cron job:
find /path/to/airflow/logs -type f -mtime +7 -delete
Automating Container Restarts
- To ensure Airflow containers restart automatically after a system reboot on macOS,
- use
launchctlorcronto run:docker-compose up -d
- use