Building Batch Data Pipelines on GCP

Start Date: 03/08/2020

Course Type: Common Course

Course Link: https://www.coursera.org/learn/batch-data-pipelines-gcp

Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.

About Course

Data pipelines typically fall under one of the Extra-Load, Extract-Load-Transform or Extract-Transform-Load paradigms. This course describes which paradigm should be used and when for batch data. Furthermore, this course covers several technologies on Google Cloud Platform for data transformation including BigQuery, executing Spark on Cloud Dataproc, pipeline graphs in Cloud Data Fusion and serverless data processing with Cloud Dataflow. Learners will get hands-on experience building data pipeline components on Google Cloud Platform using QwikLabs.

Course Syllabus

Executing Spark on Cloud Dataproc
Summary

Deep Learning Specialization on Coursera

Course Introduction

Building Batch Data Pipelines on GCP This course is the second course in the series on using Google Cloud Platform (GCP) to build batch data pipelines. In this course, you will learn how to build data pipelines using the gcloud-php library (https://github.com/apo/gcloud-php) and a custom toolchain (called a Data Fetch utility in Python) called llvm-clang. You will also learn the tradeoffs between the fast download speeds of gcp versus using a hosted nginx server. We'll use the nginx configuration tool to configure gcp to drive the data pipeline. After completing this course, you will be able to: 1. Set up your own GCP services 2. Use the command line interface to automate tasks 3. Use the configuration management tool to automate tasks 4. Use the nginx configuration tool to configure gcp to drive the data pipeline 5. Use the llvm-clang library to debug and optimize LLVM code 6. Use the llvm-php library to debug and optimize LLVM code 7. Use the command line interface to build fast data pipelines 8. Use the configuration management tool to automate tasks 9. Use the llvm-clang library to debug and optimize LLVM code 10. Use the command line interface to build fast data pipelines 11. Use the configuration management tool to automate tasks 12. Use the llvm-clang library to debug and optimize

Course Tag

Related Wiki Topic

Article Example
CMS Pipelines Data on CMS is structured in logical records rather than a stream of bytes. For textual data a line of text corresponds to a logical record. In "CMS Pipelines" the data is passed between the stages as logical records.
GCP Applied Technologies GCP Applied Technologies was established as a subsidiary of W.R. Grace & Co. in Columbia, Maryland in 2015. Its parent company spun off GCP Applied Technologies on January 28, 2016.
Transnet Pipelines Transnet Pipelines uses a telecontrol system to monitor its pipeline. The telecontrol system is by Siemens Systems and "allows for automatic leak detection and batch tracking". The system operates with a 4-second delay between an event in the pipeline and on-screen display in Durban.
Batch processing Batch processing dates to the late 19th century, in the processing of data stored on decks of punch card by unit record equipment, specifically the tabulating machine by Herman Hollerith, used for the 1890 United States Census. This was the earliest use of a machine-readable medium for data, rather than for control (as in Jacquard looms; today "control" corresponds to "code"), and thus the earliest processing of machine-read data was batch processing. Each card stored a separate record of data with different fields: cards were processed by the machine one by one, all in the same way, as a batch. Batch processing continued to be the dominant processing mode on mainframe computers from the earliest days of electronic computing in the 1950s.
Enbridge Pipelines Enbridge Pipelines is a collection of four different systems of natural gas pipelines, all owned by Enbridge. They include the Enbridge Pipelines (AlaTenn) system, the Enbridge Pipelines (MidLa) system, the Enbridge Offshore Pipelines (UTOS) system, and the Enbridge Pipelines (KPC) system.
Batch file OS/2's batch file interpreter also supports an EXTPROC command. This passes the batch file to the program named on the EXTPROC file as a data file. The named program can be a script file; this is similar to the #! mechanism.
CMS Pipelines "CMS Pipelines" provides a CMS command, PIPE. The argument string to the PIPE command is the pipeline specification. PIPE selects programs to run and chains them together in a pipeline to pump data through.
Batch file A batch file is a kind of script file in DOS, OS/2 and Microsoft Windows. It consists of a series of commands to be executed by the command-line interpreter, stored in a plain text file. A batch file may contain any command the interpreter accepts interactively and use constructs that enable conditional branching and looping within the batch file, such as "if", "for", "goto" and labels. The term "batch" is from batch processing, meaning "non-interactive execution", though a batch file may not process a "batch" of multiple data.
Holistic Data Management "Data processing" – This user interface module provides the functionalities for defining and managing interface configurations and batch runtime engines on data object relationships. The interface configurations include scheduler, transmission mode, multi-batch transmission, user-defined DOR API and reporting. Data Processing interface configuration requires a data-mapping design defined within a data network scope.
Batch processing In a bank, for example, so-called "end-of-day (EOD)" jobs include interest calculation, generation of reports and data sets to other systems, printing statements, and payment processing. This coincides with the concept of Cutover, where transaction and data are cut off for a particular day's batch activity and any further data is contributed to the next following day's batch activity (this is the reason for messages like "deposits after 3 PM will be processed the next day").
Batch processing Many early computer systems offered only batch processing, so jobs could be run any time within a 24-hour day. With the advent of transaction processing the online applications might only be required from 9:00 a.m. to 5:00 p.m., leaving two shifts available for batch work, in this case the batch window would be sixteen hours. The problem is not usually that the computer system is incapable of supporting concurrent online and batch work, but that the batch systems usually require access to data in a consistent state, free from online updates until the batch processing is complete.
Batch file The following command in a batch file will delete all the data in the current directory (folder) - without first asking for confirmation:
Continuous analytics Continuous analytics is a data science process that abandons ETLs and complex batch data pipelines in favor of cloud-native and microservices paradigms. Continuous data processing enables realtime interactions and immediate insights with fewer resources.
TC PipeLines TC PipeLines, LP () is a publicly traded master limited partnership. TransCanada Corporation owns 24.65% of the outstanding units and controls the general partner. TC PipeLines, LP manages and owns natural gas pipelines in the United States including 46.45% of Great Lakes Gas Transmission Limited Partnership, 50% of Northern Border Pipeline Company, 100% of Gas Transmission Northwest, and 100% of Tuscarora Gas Transmission Company. TC PipeLines, LP is based in Calgary, Alberta.
Data masking Data masking is tightly coupled with building test data. Two major types of data masking are static and on-the-fly data masking.
Clinical data management Standard operating procedures (SOPs) describe the process to be followed in conducting data management activities and support the obligation to follow applicable laws and guidelines (e.g. ICH GCP and 21CFR Part 11) in the conduct of data management activities.
Batch processing Batch processing is the execution of a series of jobs in a program on a computer without manual intervention (non-interactive). Strictly speaking, it is a processing mode: the execution of a series of programs each on a set or "batch" of inputs, rather than a "single" input (which would instead be a custom "job"). However, this distinction has largely been lost, and the series of steps in a batch process are often called a "job" or "batch job".
Batch processing Batch processing is also used for efficient bulk database updates and automated transaction processing, as contrasted to interactive online transaction processing (OLTP) applications. The extract, transform, load (ETL) step in populating data warehouses is inherently a batch process in most implementations.
Batch processing Where batch processing remains in use, the outputs of separate stages (and input for the subsequent stage) are typically stored as files. This is often used for ease of development and debugging, as it allows intermediate data to be reused or inspected. For example, to process data using two program codice_1 and codice_2, one might get initial data from a file codice_3, and store the ultimate result in a file codice_4. Via batch processing, one can use an intermediate file, codice_5, and run the steps in sequence (Unix syntax):
Transnet Pipelines Transnet Pipelines, a subsidiary of Transnet, is the principal operator of South Africa's fuel pipeline system. It is responsible for over of pipelines. It is responsible for petroleum storage and pipeline maintenance. Transnet Pipelines works with petrols, diesel fuel, jet fuel, crude oil and natural gas (methane rich gas). Total throughput is over 16 bn liters per year.