DNA sequencing: Decoding the genetic blueprint
DNA sequencing is a technique that determines the precise order of nucleotides - adenine (A), thymine (T), cytosine (C), and guanine (G) - in a DNA strand. This sequence constitutes the genetic blueprint of an organism, encoding instructions for the development, function, and reproduction of all living beings.
The specific sequence of the nucleotide bases encodes genetic information, which directs the production of proteins in each cell. These proteins determine cell identity, organismal traits, and overall biological function. Aberrations in this process may lead to diseased states. Variations in their arrangement lead to genetic diversity both within a species and between different species. By decoding this blueprint, scientists can gain insights into genetic variations, evolutionary relationships, and the molecular basis of diseases.
Advances in DNA sequencing have underpinned and revolutionized fields such as genomics, personalized medicine, forensic science, and biotechnology, enabling more accurate diagnoses, targeted therapies, and a deeper understanding of life’s complexity.
Understanding the basics of DNA
DNA is a double-stranded polynucleotide chain that carries the genetic information necessary for the development and functioning of an organism. It is composed of nucleotides, which are considered the building blocks of DNA.
Chemically, the backbone of DNA consists of two strands, each made up of nucleotides, which comprise a deoxyribose sugar, a phosphate group, and a nitrogenous base. Each sugar molecule is attached to one of the following four nitrogenous bases: adenine, cytosine, guanine, or thymine. Chemical linkages between the bases hold together the two strands of DNA. Cytosine pairs with guanine through a triple hydrogen bond, while adenine pairs with thymine through a double hydrogen bond.
The DNA strand is formed as nucleotides link together into chains. This linkage occurs between the phosphate group of one nucleotide and the sugar molecule of the next, creating an alternating sugar-phosphate backbone. These connections ensure the stability and structure of the DNA double helix.
DNA is responsible for transporting hereditary elements or genetic instructions from parents to offspring. It carries genetic instructions that determine the traits and characteristics of living organisms. This ensures the continuity of genetic information across generations.
The "DNA-RNA-protein" pathway, which entails the transcription of genetic information from DNA into RNA and subsequent translation into proteins, is a fundamental tenant of molecular biology. An organism’s structure, function, and, ultimately, its cellular organization are all determined by these proteins, which act as its building blocks. Beyond heredity, DNA plays a vital role in protein synthesis. It contains specific sequences, or genes, responsible for the expression of proteins. The process of converting DNA into protein occurs in two main steps:
- Transcription, which converts DNA to RNA.
- Translation, which converts RNA to protein.
These processes are essential for gene expression and the functioning of all living cells, making DNA the cornerstone of biological life.
Genome sequencing
The genome is the complete set of genes or genetic material (DNA and RNA) present in an organism. Genome sequencing involves determining the entire DNA (or RNA) sequence of an organism, providing a comprehensive view of its genetic information.
Genome sequencing is now performed using automated DNA sequencing techniques and computer software. The process involves multiple stages, including:
- Sample collection and preparation: Sample collection and preparation are the initial stages involved in DNA sequencing. These include a collection of DNA samples from various biological sources (such as blood, saliva, or tissue) and amplification and fragmentation of that sample, depending upon requirements, before sequencing. Amplification (via PCR) is required only for second-generation sequencing (NGS) but is not needed for third-generation sequencing (long-read sequencing like nanopore).
- Clone preparation: The organism’s DNA is fragmented and introduced into cloning vectors to create a library of DNA clones. It is an important step in older approaches like clone-by-clone sequencing.
- Sequencing: After preparing the DNA sample, it is sequenced through NGS or Sanger’s method. The sequencing platforms then detect nucleotide incorporation and convert the signals into raw sequence data.
- DNA sequence collection: Each clone is sequenced, and a large amount of sequencing data is generated.
- Data analysis: A continuous, comprehensive sequence of the genome is produced by aligning and piecing together the overlapping sequences from many clones using specialized tools.
- Data repository: The assembled data can then be deposited into public repositories, facilitating open access for researchers worldwide.
Whole genome sequencing (WGS) allows the mapping of all the genes, regulatory regions, and other elements of DNA that contribute to the different characteristics of organisms. The Human Genome Project was initiated in 1990, and in 2003, a genome sequence encompassing more than 90% of the human genome was generated. The first complete sequence of a human genome was reported in 2022.
Key methods of DNA sequencing
DNA sequencing technologies have evolved through three generations, each bringing significant advancements in throughput, cost-efficiency, and technological capabilities.
First-generation sequencing: Sanger sequencing
Initially, DNA sequencing was achieved by methods like Sanger sequencing and focused on sequencing short fragments of DNA. Sanger sequencing, which utilizes the chain termination method, was developed by Frederick Sanger in 1977. This technique is based on the principle that elongated nucleotides terminate DNA synthesis at specific points when synthetic dideoxynucleotides (ddNTPs), such as ddCTP, ddGTP, ddATP, ddTTP, are incorporated into the growing DNA strand.
As ddNTPs lack a 3’ hydroxyl group in the deoxyribose sugar, further elongation of the DNA is prevented, and fragments of varying lengths are produced. These fragments are then separated based on size using gel electrophoresis or capillary electrophoresis. The sequence is then assembled base-by-base based on the difference in fragment lengths at which the chain was terminated.
Sanger sequencing is extremely precise and remains the gold standard for smaller-scale sequencing operations, such as sequencing individual genes. However, it is labor-intensive, time-consuming, and more expensive than modern approaches such as next-generation sequencing (NGS), limiting its application in large-scale projects. The labor-intensive nature of Sanger sequencing has made it less feasible for sequencing of entire genomes or large sets of genetic data.
Capillary electrophoresis (CE) is a technique used to separate components of a chemical mixture within a narrow capillary tube under the influence of an electric field. CE separates DNA fragments based on size. When DNA is inserted in a capillary tube and an electric field is applied, the fragments flow at varying speeds. The smaller fragments move faster than the larger ones, which can then be precisely separated within the capillaries using a gel-like matrix.
CE is also commonly used for Sanger sequencing, which studies the sequence of DNA fragments. CE is also widely employed to measure fragment length, with applications such as short tandem repeat profiling, which is used in forensic science and paternity testing1.
Second and third-generation sequencing: Next-generation sequencing (NGS)
The second generation of DNA sequencing brought significant advances, marked by increased throughput and drastically reduced costs and turnaround times. This generation enabled the sequencing of whole genomes and transcriptomes, which are collections of all RNA transcripts transcribed by a single cell or a population of cells at a given point in time. These advances made large-scale sequencing projects more feasible and efficient, allowing researchers to explore genetic data on a much larger scale.
The third generation of DNA sequencing introduced single-molecule sequencing, which does not require prior amplification of DNA. This generation continues to push technological boundaries, offering real-time sequencing with greater precision and speed. The ability to sequence DNA molecules directly and in real-time has opened novel possibilities for genetic research, providing more detailed and accurate insights into the genetic makeup of organisms.
The advancements in the second and third generations are collectively referred to as next-generation sequencing (NGS) and have revolutionized genomics research by making complex sequencing tasks faster, more affordable, and more accurate.
NGS is a set of high-throughput technologies used to rapidly sequence either long-reads or short-reads, allowing sequencing of as much or as little of the genome as desired. Unlike Sanger sequencing, which processes one DNA fragment at a time, NGS sequences millions of DNA fragments at once.
Sanger sequencing has a read depth of just 1 (or 2 in the case of bidirectional sequencing), with highly accurate individual reads, usually of length 800 base pairs at once, while having a low depth of coverage, and requires 7 hours to complete. On the other hand, NGS technologies, such as sequencing by synthesis, take anywhere from 56 hours to 14 days, depending on the platform and sequencing depth, and nanopore sequencing takes nearly 0.5 to 4 hours per run.
NGS makes it possible to sequence millions of DNA fragments in massively parallel sequencing, with coverage depending on the chosen sequencing depth, which can be high in NGS. NGS readings are aligned into a consensus sequence, and errors, including those from PCR amplification, are eliminated through statistical analyses. This means the accuracy of current NGS technologies is over 99%.
Sanger sequencing is primarily used for targeted sequencing of small regions of DNA, whereas NGS is widely used for genomic analyses such as whole genome sequencing, exome sequencing, and RNA sequencing due to its ability to sequence large amounts of data quickly and cost-effectively.
The NGS technique consists of multiple steps:
-
- DNA extraction.
- Library preparation (including DNA fragmentation and tagging adapters).
- Sequencing.
Data analysis (aligning sequences to a reference genome and finding differences).
Long-read sequencing techniques: Third-generation long-read sequencing technologies (such as nanopore sequencing, which involves threading DNA molecules through nanopores and detecting changes in electrical current to determine the sequence) enable reading longer stretches of DNA in a single pass. This method is beneficial for sequencing complex portions of the genome, particularly repeated sequences.
Data analysis and interpretation in bioinformatics
After sequencing is complete, the data can be processed and evaluated using bioinformatics techniques. Bioinformatics analysis comprises quality control, genomic alignment, variant calling, and functional annotation. The goal of interpretation is to detect genetic alterations, understand their significance in the organism’s body, and associate them with diseases or phenotypic features.
The sequence information can also be applied to fields like species identification (comparing the DNA of unknown organismswith sequences of known species and determining the species based on distinctive variances in their DNA sequence), pharmacogenomics (the effect of a patient’s genome on their response to medicines), ancestry analysis (tracing genetic lineage and population history), and forensic investigations (analyzing DNA for criminal or identity verification).
Importance and applications of DNA sequencing
DNA sequencing has become indispensable in science and medicine, providing critical insights into genetic information and driving advancements in research, diagnostics, and therapeutic development.
- DNA sequencing has enabled the identification of various genetic disorders and the development of strategies to treat them. One of the strategies is gene therapy, which replaces defective genes associated with certain genetic disorders. This technology also aids in designing drugs that target specific genes responsible for disease progression.
- It has also contributed to the development of genetically modified organisms (GMOs).
- Evolutionary relationships among animal species can be studied through DNA sequencing. By examining DNA sequences, scientists can reconstruct the evolutionary history or phylogeny of different animal species, improving their understanding of their relationships.
- DNA sequencing has a wide range of applications in fields such as biotechnology, forensics, virology, and biological systematics, providing solutions to real-world challenges. For example, rapid sequencing of the SARS-CoV-2 genome during the COVID-19 pandemic enabled scientists to track mutations, develop diagnostic tests, and design vaccines.
- It enhances agriculture practices by facilitating the effective breeding of plants and animals, reducing the likelihood of disease outbreaks, and improving yield quality.
Genomics and population genetics
- DNA sequencing enables a comprehensive investigation of the genetic sequence of an organism, resulting in a greater understanding of the organism’s genome structure and composition. It contributes to the identification of genes responsible for traits, diseases, and evolutionary processes.
- Population genetics uses sequencing to investigate genetic differences within and between populations, revealing how species change and adapt over time.
Personalized medicine and pharmacogenomics
- DNA sequencing is used to support the development of personalized medicines by determining drug efficacy based on a patient’s genetic profile.
- It is widely used in the diagnosis of genetic defects, for detecting mutations in cancer-related genes, and for developing individualized treatment programs. It plays a significant role in pharmacogenomics, which is the use of DNA sequencing to study how genetic variants affect pharmacological reactions, resulting in more effective and less harmful treatments.
Evolutionary biology and phylogenetics
- DNA sequencing is used to compare the genetic sequences of various species, allowing scientists to track evolutionary links and comprehend genetic variations across time.
- Phylogenetics, a field of evolutionary biology, uses sequencing data to create genetic trees that demonstrate species shared ancestry and divergence.
Forensics and criminal investigations
- DNA sequencing plays a vital role in forensic science, enabling accurate identification in criminal investigations. It helps investigate crimes by analyzing DNA samples discovered at crime scenes and comparing them with suspect profiles, among other things. Additionally, DNA sequencing is widely used in paternity testing and other methods of determining parentage
Agricultural and environmental applications
- DNA sequencing enhances agricultural efficiency and sustainability by improving breeding programs for both crops and livestock. It helps identify desirable traits, such as disease resistance or increased productivity, by analyzing genetic diversity.
- In environmental science, it is used to investigate biodiversity, monitor endangered species, and study microbial populations that may interact with ecosystems. Furthermore, it ensures food safety by detecting genetically modified organisms in edible products.
Challenges in DNA sequencing
- Technical challenges in accuracy and read length: The DNA sequencing process faces the challenge of achieving both high accuracy and long lengths. While longer reads are better suited to sequencing complex genomic regions, they are also more prone to errors than shorter ones although the higher read depth, by sequencing the same region multiple times, enables comparative error correction and enhances accuracy. This is especially useful for short-read sequencing, which requires less complicated error correction algorithms.
- Cost and accessibility issues: While the cost of sequencing has reduced significantly over the years- from approximately $450 million for the final human genome draft in 2003 to under $1,500 by late 2015 and around $600 currently- the cost of sequencing tools and library preparation for NGS has remained constant, posing a hurdle to the widespread adoption of these newer DNA sequencing technologies. Making sequencing more cost-effective and accessible is a significant challenge for the industry.
- Data management and bioinformatics challenges: DNA sequencing generates large amounts of data, which can cause storage and administration issues, as well as analytical hurdles. The cost of high-performance computing and/or cloud computing resources also serves as another challenge, particularly for smaller institutions and businesses.
- Ethical considerations and privacy concerns: The increasing acceptance of DNA sequencing poses ethical concerns, including the possibility of genetic discrimination and privacy violations. While there are guidelines to ensure the responsible and ethical use of genetic information, ensuring that the guidelines are the same in different countries remains challenging, affecting international collaboration.
Future trends in DNA sequencing
New techniques, such as CRISPR-based sequencing, are enhancing DNA sequencing by enabling precise genome editing and analysis of specific regions. Additionally, advances in NGS have improved accuracy, throughput, cost-effectiveness, and scalability, along with more efficient data analysis.
Third-generation sequencing technologies, such as nanopore sequencing, offer long-read capabilities, shorter turnaround times, and portability. These features have the potential to transform clinical diagnostics, microbiome research, and real-time disease monitoring. AI and ML are revolutionizing DNA sequencing by advancing data interpretation. These technologies can improve variation recognition accuracy, speed up data processing, and reveal previously unavailable insights. Collectively, these technologies will shape the future of personalized medicine and genetic research.
FAQs
What are the advantages of de novo sequencing over other methods?
De novo sequencing has the main advantage of directly inferring full-length or partial tag-based peptide sequences from experimental tandem mass spectrometry spectra without the need for a reference database.
What are the privacy concerns related to DNA sequencing?
DNA sequencing poses serious privacy concerns since it may expose sensitive medical information. Genetic databases can be used to reveal identities or connect personal health information to public profiles. Individuals can be identified using bioinformatics systems, which raises the risk of data breaches. Third parties may use genomic data to derive health information, influencing insurance, employment, or legal results.
How can individuals access DNA sequencingservices?
Individuals can get DNA sequencing services from direct-to-consumer companies that provide at-home kits. Additionally, healthcare providers or genetic counselors may prescribe sequencing for medical purposes, which is often done through specialist clinics or laboratories. Some research facilities offer sequencing services for specialized investigations.
References
- Karger, B. L., & Guttman, A. DNA sequencing by CE. Electrophoresis, 30 Suppl 1(Suppl 1), S196–S202. (2009). https://doi.org/10.1002/elps.200900218