Data Manipulation at Scale: Systems and Algorithms

Start Date: 07/05/2020

Course Type: Common Course

Course Link:

About Course

Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales. In this course, you will learn the landscape of relevant systems, the principles on which they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how practical systems were derived from the frontier of research in computer science and what systems are coming on the horizon. Cloud computing, SQL and NoSQL databases, MapReduce and the ecosystem it spawned, Spark and its contemporaries, and specialized systems for graphs and arrays will be covered. You will also learn the history and context of data science, the skills, challenges, and methodologies the term implies, and how to structure a data science project. At the end of this course, you will be able to: Learning Goals: 1. Describe common patterns, challenges, and approaches associated with data science projects, and what makes them different from projects in related fields. 2. Identify and use the programming models associated with scalable data manipulation, including relational algebra, mapreduce, and other data flow models. 3. Use database technology adapted for large-scale analytics, including the concepts driving parallel databases, parallel query processing, and in-database analytics 4. Evaluate key-value stores and NoSQL systems, describe their tradeoffs with comparable systems, the details of important examples in the space, and future trends. 5. “Think” in MapReduce to effectively write algorithms for systems including Hadoop and Spark. You will understand their limitations, design details, their relationship to databases, and their associated ecosystem of algorithms, extensions, and languages. write programs in Spark 6. Describe the landscape of specialized Big Data systems for graphs, arrays, and streams

Course Syllabus

Understand the terminology and recurring principles associated with data science, and understand the structure of data science projects and emerging methodologies to approach them. Why does this emerging field exist? How does it relate to other fields? How does this course distinguish itself? What do data science projects look like, and how should they be approached? What are some examples of data science projects?

Coursera Plus banner featuring three learners and university partner logos

Course Introduction

Data Manipulation at Scale: Systems and Algorithms This course introduces the basic tools used to manipulate data at various scales, from small bits and bytes to terabytes and petabytes. We'll start with a description of what data is, what makes data special, and how modern algorithms work. We'll cover a variety of techniques for manipulating data, including how to deal with large tables, how to deal with large graphs, and how to deal with large floats. One of the best parts of this course is learning how to use existing tools to deal with really hard problems, so we don't need to go out of our way to download any special software or use special equipment. The course is mostly self-contained, and assumes that you already know how to use common databases for data manipulation. If you do not know how to use a database properly, this course will walk you through how to fix the problem. We will use freely available data management tools, and show how to find and fix problems in popular software.Week 1: Getting Started Week 2: Listening and Understanding Week 3: Working with Data Week 4: Working with Other People Data Collection for Business Processes This course provides an introduction to process data collection and analysis, focusing on the business process as a whole. The emphasis in this part of the course is primarily on how process data analysis and data collection can help you improve process efficiency and business processes management. We

Course Tag

Relational Algebra Python Programming Mapreduce SQL

Related Wiki Topic

Article Example
Bit manipulation Bit manipulation is the act of algorithmically manipulating bits or other pieces of data shorter than a word. Computer programming tasks that require bit manipulation include low-level device control, error detection and correction algorithms, data compression, encryption algorithms, and optimization. For most other tasks, modern programming languages allow the programmer to work directly with abstractions instead of bits that represent those abstractions. Source code that does bit manipulation makes use of the bitwise operations: AND, OR, XOR, NOT, and bit shifts.
Dictionary of Algorithms and Data Structures The Dictionary of Algorithms and Data Structures is a dictionary style reference for many of the algorithms, algorithmic techniques, archetypal problems, and data structures found in the field of computer science. The dictionary is maintained by Paul E. Black and Vreda Pieterse, and is hosted by the Software and Systems Division, Information Technology Laboratory, a part of the National Institute of Standards and Technology. The new host is the FASTAR research group. It was created in September 1998.
Data manipulation language Data manipulation languages are divided into two types, procedural programming and declarative programming.
List of terms relating to algorithms and data structures It defines a large number of terms relating to algorithms and data structures. For algorithms and data structures not necessarily mentioned here, see list of algorithms and list of data structures.
Telephone and Data Systems Telephone and Data Systems, Inc.'s current executive officers are:
Telephone and Data Systems Telephone and Data Systems, Inc.'s current board of directors are:
Ultra-large-scale systems Kevin Sullivan has stated that the US healthcare system is "clearly an ultra-large-scale system" and that building national scale cyber-infrastructure for healthcare "demands not just a rigorous, modern software and systems engineering effort, but an approach at the cutting edge of our understanding of information processing systems and their development and deployment in complex socio-technical environments".
Data manipulation language Data manipulation language comprises the SQL data change statements, which modify stored data but not the schema or database objects. Manipulation of persistent database objects, e.g., tables or stored procedures, via the SQL schema statements, rather than the data stored within them, is considered to be part of a separate data definition language. In SQL these two categories are similar in their detailed syntax, data types, expressions etc., but distinct in their overall function.
Data model Managing large quantities of structured and unstructured data is a primary function of information systems. Data models describe the structure, manipulation and integrity aspects of the data stored in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, and video.
Hitachi Data Systems Hitachi Data Systems high-end and mid-range modular storage systems were complemented by software for storage management, content management, business continuity, replication, data protection, and IT operations.
Ultra-large-scale systems The term ultra-large-scale system was introduced in a 2006 report from the Software Engineering Institute at Carnegie Mellon University authored by Linda Northrop and colleagues. The report explained that software intensive systems are reaching unprecedented scales (by measures including lines of code; numbers of users and stakeholders; purposes the system is put to; amounts of data stored, accessed, manipulated, and refined; numbers of connections and interdependencies among components; and numbers of hardware elements). When systems become ultra-large-scale, traditional approaches to engineering and management will no longer be adequate. The report argues that the problem is no longer of engineering systems or system of systems, but of engineering "socio-technical ecosystems".
Hitachi Data Systems Hitachi Data Systems (HDS) is a company that provides modular mid-range and high-end computer data storage systems, software and services. It is a wholly owned subsidiary of Hitachi Ltd. and part of the Hitachi Information Systems & Telecommunications Division.
Ultra-large-scale systems Other domains said to be seeing the rise of ultra-large-scale systems include government, transport systems (for example air traffic control systems), energy distribution systems (for example smart grids) and large enterprises.
Data manipulation language A data manipulation language (DML) is a family of syntax elements similar to a computer programming language used for selecting, inserting, deleting and updating data in a database. Performing read-only queries of data is sometimes also considered a component of DML.
Library of Efficient Data types and Algorithms The Library of Efficient Data types and Algorithms (LEDA) is a proprietarily-licensed software library providing C++ implementations of a broad variety of algorithms for graph theory and computational geometry. It was originally developed by the Max Planck Institute for Informatics Saarbrücken. Since 2001, LEDA is further developed and distributed by the Algorithmic Solutions Software GmbH.
Large-scale Complex IT Systems The UK Large-Scale Complex IT Systems (LSCITS) Initiative is a research and graduate education programme focusing on the problems of developing large-scale, complex IT systems (also referred to as Ultra-large-scale systems or ULSS). The initiative is funded by the EPSRC, with more than ten million pounds of funding awarded between 2006 and 2013.
Telephone and Data Systems Telephone and Data Systems, Inc. is a Chicago-based telecommunications service company providing wireless, local- and long-distance telephone, broadband and video services to approximately 7 million customers in 36 states through its business units TDS Telecom and U.S. Cellular ().
Data manipulation language Data manipulation languages were initially only used within computer programs, but with the advent of SQL have come to be used interactively by database administrators.
Ultra-large-scale systems Ultra-large-scale systems hold the characteristics of systems of systems (systems that have: operationally independent sub-systems; managerially independent components and sub-systems; evolutionary development; emergent behavior; and geographic distribution). But in addition to these, the Northrop report argues that a ULSS will:
Bit manipulation "Bit twiddling" and "bit bashing" are often used interchangeably with bit manipulation, but sometimes exclusively refer to clever or non-obvious ways or uses of bit manipulation, or tedious or challenging low-level device control data manipulation tasks.