Genome Assembly Programming Challenge

Start Date: 07/05/2020

Course Type: Common Course

Course Link: https://www.coursera.org/learn/assembling-genomes

About Course

In Spring 2011, thousands of people in Germany were hospitalized with a deadly disease that started as food poisoning with bloody diarrhea and often led to kidney failure. It was the beginning of the deadliest outbreak in recent history, caused by a mysterious bacterial strain that we will refer to as E. coli X. Soon, German officials linked the outbreak to a restaurant in Lübeck, where nearly 20% of the patrons had developed bloody diarrhea in a single week. At this point, biologists knew that they were facing a previously unknown pathogen and that traditional methods would not suffice – computational biologists would be needed to assemble and analyze the genome of the newly emerged pathogen. To investigate the evolutionary origin and pathogenic potential of the outbreak strain, researchers started a crowdsourced research program. They released bacterial DNA sequencing data from one of a patient, which elicited a burst of analyses carried out by computational biologists on four continents. They even used GitHub for the project: https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki The 2011 German outbreak represented an early example of epidemiologists collaborating with computational biologists to stop an outbreak. In this Genome Assembly Programming Challenge, you will follow in the footsteps of the bioinformaticians investigating the outbreak by developing a program to assemble the genome of the E. coli X from millions of overlapping substrings of the E.coli X genome. Do you have technical problems? Write to us: coursera@hse.ru

Course Syllabus

In April 2011, hundreds of people in Germany were hospitalized with a deadly disease that often started as food poisoning with bloody diarrhea. It was the beginning of the deadliest outbreak in recent history, caused by a mysterious bacterial strain that we will refer to as E. coli X. Within a few months, the outbreak had infected thousands and killed 53 people. To prevent the further spread of the outbreak, computational biologists all over the world had to answer the question “What is the genome sequence of E. coli X?” in order to figure out what new genes it acquired to become pathogenic. The 2011 German outbreak represented an early example of epidemiologists collaborating with computational biologists to stop an outbreak. In this Genome Assembly Programming Challenge, you will follow in the footsteps of the bioinformaticians investigating the outbreak by developing a program to assemble the genome of the deadly E. coli X strain. However, before you embark on building a program for assembling the E. coli X strain, we have to explain some genomic concepts and warm you up by having you solve a simpler problem of assembling a small virus.

Coursera Plus banner featuring three learners and university partner logos

Course Introduction

Genome Assembly Programming Challenge This coding challenge is aimed at exploring how the genome assembly algorithm used in assembly programming is implemented, and how it reaches the correct place for all the genes to be put together. We have to take into account all the genes that are expressed in the organism, and all the genes that are "lost" or expressed in a defective state. We will also consider the various steps involved in the actual assembly and precompilation of the genome, and how these steps influence the final assembly produced by the computer. You will need to have a basic knowledge of assembly and a working knowledge of python. You will first of all use the assembly utilities to compile and run programs, and then use the "make" tool to manually compile the final assembly file to run in a simulator. We will also use the "gene-set-list" command-line argument to list all the genes that are expressed in a given tissue. We will also use the "gene-set-test" command-line argument to see which genes are expressed in a given tissue. You will then use the "gene-print" command-line argument to see the genes that are expressed in a given tissue, and also how they are linked together to form functional genomes. Finally, you will use the "gene-set-test-recruiter" and "gene-set-test-candidate" commands-line arguments to manually recruit candidates to your gene(s) using in your

Course Tag

Data Structure Algorithms Algorithm Design String (Computer Science)

Related Wiki Topic

Article Example
Hybrid genome assembly National Center for Biotechnology Information: Genome Assembly
Hybrid genome assembly Virtual Poster: Hybrid Genome Assembly of a Nocturnal Lemur
Hybrid genome assembly Comparing the assembly constructed using the hybrid approach to the assembly created using the traditional reference genome approach showed that, with the availability of a reference genome, it is more beneficial to utilize an hybrid de novo assembly strategy as it preserves more genome sequences.
Hybrid genome assembly Genome assembly is normally done by one of two methods: assembly using a reference genome as a scaffold, or "de novo" assembly. The scaffolding approach can be useful if the genome of a similar organism has been previously sequenced. This process involves assembling the genome of interest by comparing it to a known genome or scaffold. "De novo" genome assembly is used when the genome to be assembled is not similar to any other organisms whose genomes have been previously sequenced. This process is carried out by assembling single reads into contiguous sequences (contigs) which are then extended in the 3’ and 5’ directions by overlapping other sequences. The latter is preferred because it allows for the conservation of more sequences.
Hybrid genome assembly One hybrid approach to genome assembly involves supplementing short, accurate second-generation sequencing data (i.e. from IonTorrent, Illumina or Roche 454) with long less accurate third generation sequencing data (i.e. from PacBio RS) to resolve complex repeated DNA segments. The main limitation of single-molecule third-generation sequencing that prevents it from being used alone is its relatively low accuracy, which causes inherent errors in the sequenced DNA. Using solely second-generation sequencing technologies for genome assembly can miss or lead to the incomplete assembly of important aspects of the genome. Supplementation of third generation reads with short, high-accuracy second generation sequences can overcome these inherent errors and completed crucial details of the genome. This approach has been used to sequence the genomes of some bacterial species including a strain of "Vibrio cholerae". Algorithms specific for this type of hybrid genome assembly have been developed, such as the PacBio corrected Reads algorithm.
Hybrid genome assembly Although next generation sequencing technology is now capable of producing millions of reads, the assembly of these reads can cause a bottleneck in the entire genome assembly process. As such, extensive research is being done to develop new techniques and algorithms to streamline the genome assembly process and make it a more computationally efficient process and to increase the accuracy of the process as a whole.
Hybrid genome assembly There are inherent challenges when utilizing sequence reads from various technologies to assemble a sequenced genome; data coming from different sequencers can have different characteristics. An example of this can be seen when using the overlap-layout-consensus (OLC) method of genome assembly, which can be difficult when using reads of substantially different lengths. Currently, this challenge is being overcome by using multiple genome assembly programs. An example of this can be seen in Goldberg et al. where the authors paired 454 reads with Sanger reads. The 454 reads were first assemble using the Newbler assembler (which is optimized to use short reads) generating pseudo reads that were then paired with the longer Sanger reads and assembled using the Celera assembler.
Hybrid genome assembly The idea of using multiple sequencing technologies to facilitate genome assembly may become an idea of the past as the quality of long sequencing reads (hundreds or thousands of base pairs) approaches and exceeds the quality of current second generation sequencing reads. The computational difficulties that are encountered during genome assembly will also become a concept of the past as computation efficiency and performance increases. The development of more efficient sequencing algorithms and assembly programs is needed to develop more effective assembly approaches that can tandemly incorporate sequencing reads from multiple technologies.
Hybrid genome assembly This study employs a hybrid genome assembly approach that only uses sequencing reads generated using SOLiD sequencing (a second-generation sequencing technology). The genome of "C. pseudotuberculosis" was assembled twice: once using a classical reference genome approach, and once using a hybrid approach. The hybrid approach consisted of three contiguous steps. Firstly, contigs were generated de novo, secondly, the contigs were ordered and concatenated into supercontigs, and, thirdly, the gaps between contigs were closed using an iterative approach. The initial de novo assembly of contigs was achieved in parallel using Velvet, which assembles contigs by manipulating De Bruijn graphs, and Edena, which is an OLC-based assembler
Hybrid genome assembly The authors of this paper present Cerulean, a hybrid genome assembly program that differs from traditional hybrid assembly approaches. Normally, hybrid assembly involved mapping short high quality reads to long low quality reads, but this still introduces errors in the assembled genomes. This process is also computationally expensive and require a large amount of running time, even for relatively small bacterial genomes.
Hybrid genome assembly In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is magnitudes smaller than the average size of a genome (the genome of the octoploid plant "Paris japonica" is 149 billion base pairs ). This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.
Hybrid genome assembly Hybrid genome assembly can also be accomplished using the Eulerian path approach. In this approach, the length of the assembled sequences does not matter as once a k-mer spectrum has been constructed, the lengths of the reads are irrelevant.
Hybrid genome assembly Many of the current limitations in genomic research revolve around the ability to produce large amounts of high quality sequencing data and to assemble entire genomes of organisms of interest. Developing more effective hybrid genome assembly strategies is taking the next step in advancing sequence assembly technology and these strategies are guaranteed to become more effective as more powerful technologies emerge.
Hybrid genome assembly This method was tested by assembling the genome of an ‘’Escherichia coli’’ strain. First, short reads were assembled using the ABySS assembler. These reads were then mapped to the long reads using BLASR. The results from the ABySS assembly were used to create the assembly graph, which were used to generate scaffolds using the filtered BLASR data .
Reference genome A simple way to measure genome length is to count the number of base pairs in the assembly.
Genome Reference Consortium As of June 2015, the major assembly release for human, mouse and zebrafish are GRCh38, GRCm38 and GRCz10 respectively. Major assembly releases do not follow a fixed cycle, however there are "minor" assembly updates in the form of genome patches which either correct errors in the assembly or add additional alternate loci. These assemblies are represented in various genome browsers and databases including Ensembl, those in NCBI and UCSC Genome Browser.
Hybrid genome assembly The "de novo" assembly of DNA sequences is a very computationally challenging process and can fall into the NP-hard class of problems if the Hamiltonian-cycle approach is used. This is because millions of sequences must be assembled to reconstruct a genome. Within genomes, there are often tandem repeats of DNA segments that can be thousands of base pairs in length, which can cause problems during assembly.
Genome Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome. The Human Genome Project was organized to map and to sequence the human genome. A fundamental step in the project was the release of a detailed genomic map by Jean Weissenbach and his team at the Genoscope in Paris.
Hybrid genome assembly This study employed two different methods for hybrid genome assembly: a scaffolding approach that supplemented currently available sequenced contigs with PacBio reads, as well as an error correction approach to improve the assembly of bacterial genomes. The first approach in this study started with high-quality contigs constructed from sequencing reads from second-generation (Illumina and 454) technology. These contigs were supplemented by aligning them to PacBio long reads to achieve linear scaffolds that were gap-filled using PacBio long reads. These scaffolds were then supplemented again, but using PacBio strobe reads (multiple subreads from a single contiguous fragment of DNA ) to achieve a final, high-quality assembly. This approach was used to sequence the genome of a strain of "Vibrio cholerae" that was responsible for a cholera outbreak in Haiti.
Hybrid genome assembly This study also shows that using a lower coverage of corrected long reads is similar to using a higher coverage of shorter reads; 13x PBcR data (corrected using 50x Illumina data) was comparable to an assembly constructed using 100x paired-end Illumina reads. The N50 for the corrected PBcR data was also longer than the Illumina data (4.65MBp compared to 3.32 Mbp for the Illumina reads). A similar trend was seen in the sequencing of the "Escherichia coli" JM221 genome: a 25x PBcR assembly had a N50 triple that of 50x 454 assembly.