Big Data Integration and Processing

Start Date: 09/13/2020

Course Type: Common Course

Course Link:

Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.

About Course

At the end of the course, you will be able to: *Retrieve data from example database and big data management systems *Describe the connections between data management operations and the big data processing patterns needed to utilize them in large-scale analytical applications *Identify when a big data problem needs data integration *Execute simple big data integration and processing on Hadoop and Spark platforms This course is for those new to data science. Completion of Intro to Big Data is recommended. No prior programming experience is needed, although the ability to install applications and utilize a virtual machine is necessary to complete the hands-on assignments. Refer to the specialization technical requirements for complete hardware and software specifications. Hardware Requirements: (A) Quad Core Processor (VT-x or AMD-V support recommended), 64-bit; (B) 8 GB RAM; (C) 20 GB disk free. How to find your hardware information: (Windows): Open System by clicking the Start button, right-clicking Computer, and then clicking Properties; (Mac): Open Overview by clicking on the Apple menu and clicking “About This Mac.” Most computers with 8 GB RAM purchased in the last 3 years will meet the minimum requirements.You will need a high speed internet connection because you will be downloading files up to 4 Gb in size. Software Requirements: This course relies on several open-source software tools, including Apache Hadoop. All required software can be downloaded and installed free of charge (except for data charges from your internet provider). Software requirements include: Windows 7+, Mac OS X 10.10+, Ubuntu 14.04+ or CentOS 6+ VirtualBox 5+.

Course Syllabus

Welcome to the third course in the Big Data Specialization. This week you will be introduced to basic concepts in big data integration and processing. You will be guided through installing the Cloudera VM, downloading the data sets to be used for this course, and learning how to run the Jupyter server.

Deep Learning Specialization on Coursera

Course Introduction

Big Data Integration and Processing In this course, you will learn about how to use Big Data Integration and Processing on a computer. You will learn about how data is often either "lazy" or incompletely processed by many modern processors. You will learn about how to use tools to help you deal with both "missing" and "over-counted" data. This course also covers how modern data-intensive applications (smart phones, big data/big data servers) are often implemented in the cloud, using multiple processors, memory, and disk space to provide more performance. You will use tools to help you deal with both "missing" and "over-counted" data. You will need a computer with a strong, but stable, Internet connection, and a solid CPU and GPU. This will vary depending on your computer, software, and hardware, but generally speaking you will need a Core i5 or i7 processor, a 16 GB SD card, a 16 GB USB flash drive, a fast Internet connection, and a good set of HD video editing tools. You will also need a free 16 GB USB flash drive with a clean copy of your Windows 7 or 8.1 operating system, and a copy of the HD movie "Stitcher" app. You will need about 8 GB of free disk space. This course is aimed at anyone interested in data science, but it also applies to anyone working in data science or data acquisition.Understanding Data & Cache Machine vs

Course Tag

Big Data Mongodb Splunk Apache Spark

Related Wiki Topic

Article Example
Big data Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.
Data processing Data processing is, generally, "the collection and manipulation of items of data to produce meaningful information."
Big data Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation, such as multilinear subspace learning. Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed databases, cloud-based infrastructure (applications, storage and computing resources) and the Internet.
Holistic Data Management "Data mapping", "data validation", "data integration","data processing"
Data processing The term "data processing" has mostly been subsumed by the newer and somewhat more general term "information technology" (IT). The term "data processing" is presently considered sometimes to have a negative connotation, suggesting use of older technologies. As an example, during 1996 the "Data Processing Management Association" (DPMA) changed its name to the "Association of Information Technology Professionals". Nevertheless, the terms are approximately synonymous.
Core data integration Because it is difficult to promptly roll out a centrally managed data integration solution that anticipates and meets all data integration requirements across an organization, IT engineers and even business users create edge data integration, using technology that may be incompatible with that used at the core. In contrast to a core data integration, an edge data integration is not centrally planned and is generally completed with a smaller budget and a tighter deadline.
Data processing For science or engineering, the terms "data processing" and "information systems" are considered too broad, and the more specialized term data analysis is typically used.
Big data 2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end user by using a front-end application server.
Data processing The term Data processing (DP) has also been used previously to refer to a department within an organization responsible for the operation of data processing applications.
Ontology-based data integration Ontology-based data integration involves the use of ontology(s) to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process.
Core data integration Core data integration is the use of data integration technology for a significant, centrally planned and managed IT initiative within a company. Examples of core data integration initiatives could include:
Big data Big data sets come with algorithmic challenges that previously did not exist. Hence, there is a need to fundamentally change the processing ways.
Data processing Data processing may involve various processes, including:
Big data Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Challenges include capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, updating and information privacy. The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."
Data pre-processing If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set. Kotsiantis et al. (2006) present a well-known algorithm for each step of data pre-processing.
Big data Ulf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a "fad" in scientific research. Researcher Danah Boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about actually handling the huge amounts of data. This approach may lead to results bias in one way or another. Integration across heterogeneous data resources—some that might be considered big data and others not—presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.
Edge data integration It has been claimed that edge data integration do not typically require large budgets and centrally managed technologies, which is in contrast to a core data integration.
Data processing (disambiguation) Data processing most often refers to Electronic data processing, computer processes that convert data into information or knowledge.
Big data The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (Ssd) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.
Data integration Data integration involves combining data residing in different sources and providing users with a unified view of these data. This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories, for example) domains. Data integration appears with increasing frequency as the volume and the need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved.