I am trying to create a DAG which uses the DatabricksRunNowOperator to run pyspark. However I'm unable to figure out how I can access the airflow config inside the pyspark script.
parity_check_run = DatabricksRunNowOperator(
task_id='my_task',
databricks_conn_id='databricks_default',
job_id='1837',
spark_submit_params=["file.py", "pre-defined-param"],
dag=dag,
)
I've tried accessing it via kwargs but that doesn't seem to be working.
CodePudding user response:
You can use the notebook_params argument as seen in the documentation .
e.g:
job_id=42
notebook_params = {
"dry-run": "true",
"oldest-time-to-consider": "1457570074236"
}
notebook_run = DatabricksRunNowOperator(
job_id=job_id,
notebook_params=notebook_params,
)
Then you can access the value via dbutils.widgets.get("oldest-time-to-consider") in the PySpark code.
CodePudding user response:
The DatabricksRunNowOperator supports different ways of providing parameters to the existing jobs, depending on how job is defined (doc):
notebook_paramsif you use notebooks - it's a dictionary of the widget name -> value. You can fetch parameters using thedbutils.widgets.getpython_params- list of parameters that will be passed to Python task - you can fetch them viasys.argvjar_params- list of parameters that will be passed to Jar task. You can get them as usual for Java/Scala programspark_submit_params- list of parameters that will be passed to thespark-submit
