Command Line Tools for Genomic Data Science

Start Date: 11/05/2018

Course Type: Common Course

Course Link:

About Course

Introduces to the commands that you need to manage and analyze directories, files, and large sets of genomic data. This is the fourth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Course Introduction

Introduces to the commands that you need to manage and analyze directories, files, and large sets of

Compression of Genomic Re-Sequencing Data High-throughput sequencing technologies have led to a dramatic decline of genome sequencing costs and to an astonishingly rapid accumulation of genomic data. These technologies are enabling ambitious genome sequencing endeavours, such as the 1000 Genomes Project and 1001 ("Arabidopsis thaliana") Genomes Project. The storage and transfer of the tremendous amount of genomic data have become a mainstream problem, motivating the development of high-performance compression tools designed specifically for genomic data. A recent surge of interest in the development of novel algorithms and tools for storing and managing genomic re-sequencing data emphasizes the growing demand for efficient methods for genomic data compression.
Oracle Data Guard Oracle provides both graphical user interface (GUI) and command-line (CLI) tools for managing Data Guard configurations.
Compression of Genomic Re-Sequencing Data While standard data compression tools (e.g., zip and rar) are being used to compress sequence data (e.g., GenBank flat files), this approach has been criticized to be extravagant because genomic sequences often contain repetitive content (e.g., microsatellite sequences) or many sequences exhibit high levels of similarity (e.g., multiple genome sequences from the same species). Additionally, the statistical and information-theoretic properties of genomic sequences can potentially be exploited for compressing sequencing data.
The Genomic HyperBrowser The Genomic HyperBrowser is a web-based system for statistical analysis of genomic annotation data.
Open science data In 2015 the World Data System of the International Council for Science adopted a new set of Data Sharing Principles to embody the spirit of 'open science'. These Principles are in line with data policies of national and international initiatives and they express core ethical commitments operationalized in the WDS Certification of trusted data repositories and service.
Command-line interface The command line provides an interface between programs as well as the user. In this sense, a command line is an alternative to a dialog box. Editors and data-bases present a command line, in which alternate command processors might run. On the other hand, one might have options on the command line which opens a dialog box. The latest version of 'Take Command' has this feature. DBase used a dialog box to construct command lines, which could be further edited before use.
Data science The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published "Concise Survey of Computer Methods", which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of applications.
Data science In 2013, the IEEE Task Force on Data Science and Advanced Analytics was launched, and the first international conference: IEEE International Conference on Data Science and Advanced Analytics was launched in 2014. In 2014, the American Statistical Association section on Statistical Learning and Data Mining renamed its journal to "Statistical Analysis and Data Mining: The ASA Data Science Journal" and in 2016 changed its section name to "Statistical Learning and Data Science". In 2015, the International Journal on Data Science and Analytics was launched by Springer to publish original work on data science and big data analytics. 2013 the first "European Conference on Data Analysis (ECDA)" was organised in Luxembourg establishing the European Association for Data Science (EuADS) in August 2015. In September 2015 the Gesellschaft für Klassifikation (GfKl) added to the name of the Society "Data Science Society" at the third ECDA conference at the University of Essex, Colchester, UK.
Compression of Genomic Re-Sequencing Data The compression ratio of currently available genomic data compression tools ranges between 65-fold and 1,200-fold for human genomes. Very close variants or revisions of the same genome can be compressed very efficiently (for example, 18,133 compression ratio was reported for two revisions of the same A. thaliana genome, which are 99.999% identical). However, such compression is not indicative of the typical compression ratio for different genomes (individuals) of the same organism. The most common encoding scheme amongst these tools is Huffman coding, which is used for lossless data compression.
Data science he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists.
ACE (genomic file format) The ACE file format is a specification for storing data about genomic contigs.
Comparative genomic hybridization The arrayMap web site offers access to pre-processed copy number profiles as well as clinical annotations from some ten thousands of genomic arrays, as well as online visualisation tools and programmatic data access.
Data science In April 2002, the International Council for Science: Committee on Data for Science and Technology (CODATA) started the "Data Science Journal", a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. Shortly thereafter, in January 2003, Columbia University began publishing "The Journal of Data Science", which provided a platform for all data workers to present their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. In 2005, The National Science Board published "Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century" defining data scientists as "the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection" whose primary activity is to "conduct creative inquiry and analysis."
Genomic counseling Genomic counseling is the process by which a person gets informed about his or her genome. In contrast to genetic counseling, which focuses on Mendelian diseases and typically involves person-to-person communication with a medical genetics expert, genomic counseling is not limited to currently clinically relevant information and includes other genomic information that is of interest for the informed person, such as increased risk for complex disease (for example diabetes or obesity), genetically determined non-disease related traits (for example baldness), or genetic genealogy data. Given the less sensitive nature of this information, genomic advice can be given impersonally, for example over the internet (virtual genomic counseling).
Genomic convergence Genomic convergence is a multifactor approach used in genetic research that combines different kinds of genetic data analysis to identify and prioritize susceptibility genes for a complex disease.
Take Command (command line interpreter) Take Command is a command-line interpreter for the Microsoft Windows line of operating systems. Its advantages over the regular command shell are analogous to those of 4DOS over the codice_1 supplied with MS-DOS.
Comparison of data modeling tools This article is a comparison of data modeling tools which are notable, including standalone, conventional data modeling tools and modeling tools supporting data modeling as part of a larger modeling environment.
Space Telescope Science Data Analysis System The Space Telescope Science Data Analysis System (STSDAS) is an IRAF-based suite of astronomical software for reducing and analyzing astronomical data. It contains general purpose tools and packages for processing data from the Hubble Space Telescope. STSDAS is produced by Space Telescope Science Institute (STScI). The STSDAS software is in the public domain and the source code is available.
Data science Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.
Data science In 2001, William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics," which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique. In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory.