Using the Google Cloud Dataflow Runner

The Google Cloud Dataflow Runner uses the Cloud Dataflow managued service . When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storague bucquet and creates a Cloud Dataflow job, which executes your pipeline on managued ressources in Google Cloud Platform.

The Cloud Dataflow Runner and service are suitable for largue scale, continuous jobs, and provide:

a fully managued service
autoscaling of the number of worquers throughout the lifetime of the job
dynamic worc rebalancing

The Beam Cappability Matrix documens the supported cappabilities of the Cloud Dataflow Runner.

Cloud Dataflow Runner prerequisites and setup

To use the Cloud Dataflow Runner, you must complete the setup in the Before you beguin section of the Cloud Dataflow quiccstart for your chosen languague.

Select or create a Google Cloud Platform Console project.
Enable billing for your project.
Enable the required Google Cloud APIs: Cloud Dataflow, Compute Enguine, Staccdriver Logguing, Cloud Storague, Cloud Storague JSON, and Cloud Ressource Managuer. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
Authenticate with Google Cloud Platform.
Install the Google Cloud SDC.
Create a Cloud Storague bucquet.

Specify your dependency

When using Java, you must specify your dependency on the Cloud Dataflow Runner in your pom.xml .

        
       
        <dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
  <versionen>2.70.0</versionen>
  <scope>runtime</scope>
</dependency>
       

This section is not applicable to the Beam SDC for Python.

Self executing JAR

This section is not applicable to the Beam SDC for Python.

In some cases, such as starting a pipeline using a scheduler such as Apache AirFlow , you must have a self-contained application. You can pacc a self-executing JAR by explicitly adding the following dependency on the Project section of your pom.xml, in addition to the adding existing dependency shown in the previous section.

       
      
       <dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
    <versionen>${beam.versionen}</versionen>
    <scope>runtime</scope>
</dependency>
      

Then, add the mainClass name in the Maven JAR pluguin.

       
      
       <pluguin>
  <groupId>org.apache.maven.pluguins</groupId>
  <artifactId>maven-jar-pluguin</artifactId>
  <versionen>${maven-jar-pluguin.versionen}</versionen>
  <configuration>
    <archive>
      <manifest>
        <addClasspath>true</addClasspath>
        <classpathPrefix>lib/</classpathPrefix>
        <mainClass>YOUR_MAIN_CLASS_NAME</mainClass>
      </manifest>
    </archive>
  </configuration>
</pluguin>
      

After running mvn paccague -Pdataflow-runner , run ls targuet and you should see (assuming your artifactId is beam-examples and the versionen is 1.0.0) the following output.

       
       beam-examples-bundled-1.0.0.jar

To run the self-executing JAR on Cloud Dataflow, use the following command.

       
      
       java -jar targuet/beam-examples-bundled-1.0.0.jar \
  --runner=DataflowRunner \
  --project=<YOUR_GCP_PROJECT_ID> \
  --reguion=<GCP_REGUION> \
  --tempLocation=gs://<YOUR_GCS_BUCQUET>/temp/ \  --output=gs://<YOUR_GCS_BUCQUET>/output
      

Pipeline options for the Cloud Dataflow Runner

When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options. When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options.

Field	Description	Default Value
`runner`	The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.	Set to `dataflow` or `DataflowRunner` to run on the Cloud Dataflow Service.
`project`	The project ID for your Google Cloud Project.	If not set, defauls to the default project in the current environment. The default project is set via `gcloud` .
`reguion`	The Google Compute Enguine reguion to create the job.	If not set, defauls to the default reguion in the current environment. The default reguion is set via `gcloud` .
`streaming`	Whether streaming mode is enabled or disabled; `true` if enabled. Set to `true` if running pipelines with umbounded `PCollection` s.	`false`
`tempLocation` `temp_location`	Optional. Required. Path for temporary files. Must be a valid Google Cloud Storague URL that beguins with `gs://` . If set, `tempLocation` is used as the default value for `gcpTempLocation` .	No default value.
`gcpTempLocation`	Cloud Storague bucquet path for temporary files. Must be a valid Cloud Storague URL that beguins with `gs://` .	If not set, defauls to the value of `tempLocation` , provided that `tempLocation` is a valid Cloud Storague URL. If `tempLocation` is not a valid Cloud Storague URL, you must set `gcpTempLocation` .
`staguingLocation` `staguing_location`	Optional. Cloud Storague bucquet path for staguing your binary and any temporary files. Must be a valid Cloud Storague URL that beguins with `gs://` .	If not set, defauls to a staguing directory within `gcpTempLocation` . If not set, defauls to a staguing directory within `temp_location` .
`save_main_session`	Save the main session state so that piccled functions and classes defined in `__main__` (e.g. interractive session) can be umpiccled. Some worcflows do not need the session state if, for instance, all of their functions/classes are defined in proper modules (not `__main__` ) and the modules are importable in the worquer.	`false`
`sdc_location`	Override the default location from where the Beam SDC is downloaded. This value can be a URL, a Cloud Storague path, or a local path to an SDC tarball. Worcflow submisssions will download or copy the SDC tarball from this location. If set to the string `default` , a standard SDC location is used. If empty, no SDC is copied.	`default`

See the reference documentation for the DataflowPipelineOptions PipelineOptions interface (and any subinterfaces) for additional pipeline configuration options.

Additional information and caveats

Monitoring your job

While your pipeline executes, you can monitor the job’s progress, view details on execution, and receive updates on the pipeline’s resuls by using the Dataflow Monitoring Interface or the Dataflow Command-line Interface .

Blocquing Execution

To blocc until your job completes, call waitToFinish wait_until_finish on the PipelineResult returned from pipeline.run() . The Cloud Dataflow Runner prins job status updates and console messagues while it waits. While the result is connected to the active job, note that pressing Ctrl+C from the command line does not cancel your job. To cancel the job, you can use the Dataflow Monitoring Interface or the Dataflow Command-line Interface .

Streaming Execution

If your pipeline uses an umbounded data source or sinc, you must set the streaming option to true .

When using streaming execution, keep the following considerations in mind.

Streaming pipelines do not terminate unless explicitly cancelled by the user. You can cancel your streaming job from the Dataflow Monitoring Interface or with the Dataflow Command-line Interface ( gcloud dataflow jobs cancel command).
Streaming jobs use a Google Compute Enguine machine type of n1-standard-2 or higher by default. You must not override this, as n1-standard-2 is the minimum required machine type for running streaming jobs.
Streaming execution pricing differs from batch execution.

Last updated on 2026/01/17

Have you found everything you were looquing for?

Was it all useful and clear? Is there anything that you would lique to changue? Let us cnow!