118 lines
4.1 KiB
Plaintext
118 lines
4.1 KiB
Plaintext
---
|
||
title: Async Queries via Celery
|
||
hide_title: true
|
||
sidebar_position: 8
|
||
version: 1
|
||
---
|
||
|
||
## Async Queries via Celery
|
||
|
||
### Celery
|
||
|
||
On large analytic databases, it’s common to run queries that execute for minutes or hours. To enable
|
||
support for long running queries that execute beyond the typical web request’s timeout (30-60
|
||
seconds), it is necessary to configure an asynchronous backend for Superset which consists of:
|
||
|
||
- one or many Superset workers (which is implemented as a Celery worker), and can be started with
|
||
the `celery worker` command, run `celery worker --help` to view the related options.
|
||
- a celery broker (message queue) for which we recommend using Redis or RabbitMQ
|
||
- a results backend that defines where the worker will persist the query results
|
||
|
||
Configuring Celery requires defining a `CELERY_CONFIG` in your `superset_config.py`. Both the worker
|
||
and web server processes should have the same configuration.
|
||
|
||
```python
|
||
class CeleryConfig(object):
|
||
BROKER_URL = 'redis://localhost:6379/0'
|
||
CELERY_IMPORTS = (
|
||
'superset.sql_lab',
|
||
'superset.tasks',
|
||
)
|
||
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
|
||
CELERYD_LOG_LEVEL = 'DEBUG'
|
||
CELERYD_PREFETCH_MULTIPLIER = 10
|
||
CELERY_ACKS_LATE = True
|
||
CELERY_ANNOTATIONS = {
|
||
'sql_lab.get_sql_results': {
|
||
'rate_limit': '100/s',
|
||
},
|
||
'email_reports.send': {
|
||
'rate_limit': '1/s',
|
||
'time_limit': 120,
|
||
'soft_time_limit': 150,
|
||
'ignore_result': True,
|
||
},
|
||
}
|
||
CELERYBEAT_SCHEDULE = {
|
||
'email_reports.schedule_hourly': {
|
||
'task': 'email_reports.schedule_hourly',
|
||
'schedule': crontab(minute=1, hour='*'),
|
||
},
|
||
}
|
||
|
||
CELERY_CONFIG = CeleryConfig
|
||
```
|
||
|
||
To start a Celery worker to leverage the configuration, run the following command:
|
||
|
||
```
|
||
celery --app=superset.tasks.celery_app:app worker --pool=prefork -O fair -c 4
|
||
```
|
||
|
||
To start a job which schedules periodic background jobs, run the following command:
|
||
|
||
```
|
||
celery --app=superset.tasks.celery_app:app beat
|
||
```
|
||
|
||
To setup a result backend, you need to pass an instance of a derivative of from
|
||
cachelib.base.BaseCache to the RESULTS_BACKEND configuration key in your superset_config.py. You can
|
||
use Memcached, Redis, S3 (https://pypi.python.org/pypi/s3werkzeugcache), memory or the file system
|
||
(in a single server-type setup or for testing), or to write your own caching interface. Your
|
||
`superset_config.py` may look something like:
|
||
|
||
```python
|
||
# On S3
|
||
from s3cache.s3cache import S3Cache
|
||
S3_CACHE_BUCKET = 'foobar-superset'
|
||
S3_CACHE_KEY_PREFIX = 'sql_lab_result'
|
||
RESULTS_BACKEND = S3Cache(S3_CACHE_BUCKET, S3_CACHE_KEY_PREFIX)
|
||
|
||
# On Redis
|
||
from cachelib.redis import RedisCache
|
||
RESULTS_BACKEND = RedisCache(
|
||
host='localhost', port=6379, key_prefix='superset_results')
|
||
```
|
||
|
||
For performance gains, [MessagePack](https://github.com/msgpack/msgpack-python) and
|
||
[PyArrow](https://arrow.apache.org/docs/python/) are now used for results serialization. This can be
|
||
disabled by setting `RESULTS_BACKEND_USE_MSGPACK = False` in your `superset_config.py`, should any
|
||
issues arise. Please clear your existing results cache store when upgrading an existing environment.
|
||
|
||
**Important Notes**
|
||
|
||
- It is important that all the worker nodes and web servers in the Superset cluster _share a common
|
||
metadata database_. This means that SQLite will not work in this context since it has limited
|
||
support for concurrency and typically lives on the local file system.
|
||
|
||
- There should _only be one instance of celery beat running_ in your entire setup. If not,
|
||
background jobs can get scheduled multiple times resulting in weird behaviors like duplicate
|
||
delivery of reports, higher than expected load / traffic etc.
|
||
|
||
- SQL Lab will _only run your queries asynchronously if_ you enable **Asynchronous Query Execution**
|
||
in your database settings (Sources > Databases > Edit record).
|
||
|
||
### Celery Flower
|
||
|
||
Flower is a web based tool for monitoring the Celery cluster which you can install from pip:
|
||
|
||
```python
|
||
pip install flower
|
||
```
|
||
|
||
You can run flower using:
|
||
|
||
```
|
||
celery --app=superset.tasks.celery_app:app flower
|
||
```
|