JavaScript is disabled in your browser. Please enable JavaScript to view this website.

Protein sequencing: Methods and their role in biotechnology

Protein sequencing is the process of determining the amino acid composition and their specific order in a protein.

Search our range of proteins and peptides

View Products
button-secondary

Protein sequencing is important for understanding protein functions, interactions, and their roles in biological processes. It has revolutionized biotechnology by enabling the design of engineered proteins, biopharmaceuticals, and enzymes tailored for specific applications. It also plays a vital role in personalized medicine and drug development by identifying protein structures important for therapeutic innovations.

Protein sequencing enables the comprehensive characterization of proteins within the proteome, facilitating the characterization of subcellular complexes and novel gene products.

Basics of proteins and their sequencing

Proteins are fundamental biochemical molecules, essential for nearly every process that occurs within cells, and amino acids are the building blocks of protein structure, playing a pivotal role in determining protein function and interaction.

Protein sequencing allows precise identification and the order and composition of these amino acids within a protein. It plays an important role in understanding protein structure and function, revealing how proteins play vital roles in various activities of life, like catalyzing metabolic reactions, replication of DNA and response to stimuli.

Structure and function of proteins

Proteins are composed of long chains of amino acid residues arranged in specific sequences linked by peptide bonds, which fold into unique three-dimensional shapes that define their function. Regular conformations in some local regions characterize their structures.

Protein structure is organized into four levels:

In a natural environment, spontaneous folds are observed in protein-forming tertiary structure, also known as native structure. The factors responsible for the folds within their native structure are due to the presence of inter-residue interactions comprising hydrogen bonds, hydrophobic interactions, ionic bonds and van der Waals forces.

Proteins play an essential role within the organism, such as transportation of molecules, providing structure to cells, responding to stimuli, facilitating cell signaling, regulating gene expression, supporting immune responses and catalyzing various types of metabolic reactions.

They are categorized based on their roles, such as enzymes that accelerate chemical reactions (eg, amylase) and structural proteins like keratin that provide support. Other types include transport proteins (aquaporins), motor proteins (actin and myosin), regulatory proteins (insulin), and immune regulating proteins (interleukin-2).

Significance of amino acids in protein sequencing

Amino acids, the building blocks of proteins, consist of a carboxyl group (-COOH), a basic or amine group (-NH2) attached to the α-carbon of an organic acid, and a unique R group, which distinguishes each of the 20 standard amino acids and influences the properties of proteins, including their structure and functions.

Determination of the precise order of amino acids helps in the process of protein sequencing as it reveals the protein folds and various types of interactive forces associated with the framing of the protein structure. This sequence, or primary structure, lays the foundation for higher-order structures like α-helices, β-sheets, and complex protein assemblies. Accurate sequencing is essential in biology, medicine, and biotechnology to uncover protein functions, identify mutations, and develop targeted therapies.

Methods of protein sequencing

All protein sequencing methods are broadly categorized into two groups in which one group provides the N-terminus sequence of the protein, whereas the second group of sequencing is used to identify the entire amino acid sequence of the protein. Mass spectrometry and Edman degradation are fundamental techniques in proteomics, enabling detailed protein sequencing and characterization through advanced analytical methodologies. While MS excels in high-throughput analysis of complex samples, Edman degradation provides unmatched precision for determining N-terminal sequences of short proteins or peptides.

Mass Spectrometry (MS) for protein sequencing

MS is a versatile and indispensable tool in proteomics and is widely used for protein sequencing and identification, enabling the detailed analysis of proteins through various advanced techniques and methodologies.

Principles of mass spectrometry for protein sequencing

MS is a powerful analytical technique that identifies and quantifies proteins by measuring the mass-to-charge (m/z) ratio of ionized molecules. The process begins with the ionization of proteins or peptides, followed by their separation using mass analyzers based on m/z ratios, and detection by ion detectors. The spectrometer possesses the ability to readily acquire the ionized spectrum for detecting the ionized peptides.

Advanced techniques like tandem mass spectrometry (MS/MS) enable further fragmentation of ions, providing detailed peptide or protein sequence information. This information is matched against databases to predict complete protein sequences, making MS indispensable in proteomics.

Types of mass spectrometry used in protein analysis

MS encompasses diverse techniques that enable detailed analysis of proteins, from identifying their molecular mass to mapping post-translational modifications (PTMs) and sequencing peptide fragments.

Matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF)
Liquid chromatography-tandem mass spectrometry (LC-MS/MS)
MS/MS

Advantages of mass spectrometry

Limitations of mass spectrometry

Edman degradation for protein sequencing

Edman degradation is a precise sequencing method that helps in the determination of the amino acid composition being present within the proteins of 50 amino acids long from the N-terminus to the C-terminus of the amino acid sequence for determining amino acid sequences in proteins1. It is ideal for short sequences but is limited by low throughput, time intensity, and inability to analyze blocked or complex proteins.

Principles and process of Edman degradation

Edman degradation is a sequencing method that identifies the N-terminal amino acid sequence of proteins or peptides. The technique involves sequential chemical reactions and identification steps to determine the amino acid sequence. The process begins with the reaction of the N-terminal amino group with phenylisothiocyanate (PITC) to form a PITC-protein complex.

This complex is then treated with trichloroacetic acid, cleaving off the first amino acid as a PITC-amino acid derivative. The derivative is converted into a stable phenylthiohydantoin (PTH)–amino acid, which is identified using high-performance liquid chromatography (HPLC). By repeating this cycle, the sequence of amino acids is determined step-by-step from the protein's N-terminal.

Edman degradation is especially useful for analyzing proteins without database information, identifying mutations, or studying proteins with unknown sequences.

Advantages and limitations

Edman sequencing offers high accuracy in determining the N-terminal amino acid sequence, making it ideal for single-chain peptides and short sequences. It directly identifies amino acids without relying on databases, providing precise results even for novel or modified proteins.

The method is inefficient for analyzing long or complex protein sequences due to its low throughput and time-intensive process (approximately 45 minutes per residue). Additionally, it cannot sequence proteins with blocked N-terminal groups or handle simultaneous analysis of multiple proteins effectively.

Comparison of MS and Edman degradation

When comparing MS and Edman degradation, their strengths and limitations vary significantly across sample requirements, precision, sensitivity, suitability for protein complexity, cost, accessibility, and technical constraints.

Sample requirements, precision, and sensitivity

MS requires smaller sample amounts and is highly sensitive, capable of detecting proteins in attomolar concentrations. It provides exceptional precision for identifying and quantifying proteins, including PTMs. In contrast, Edman degradation requires larger and highly purified samples, making it less suitable for analyzing complex mixtures but offering unmatched precision for short, linear sequences.

Suitability for complex vs. simple protein sequences

MS is ideal for analyzing complex protein mixtures, providing comprehensive data on molecular mass, modifications, and sequences through high-throughput capabilities. Its ability to handle large-scale proteomics studies makes it invaluable for complex samples. Edman degradation, however, excels in sequencing simple proteins or peptides with unblocked N-termini but struggles with long (~>50-70 amino acids) and blocked or highly complex protein sequences.

Cost, accessibility, and technical limitations

MS is cost-effective and widely accessible, with reagent and instrument costs per analysis being relatively low compared to Edman degradation. Edman degradation, on the other hand, is significantly more expensive, and requires specialized equipment not commonly available in all laboratories. MS instruments are versatile and widely used across various applications, but their initial setup and maintenance can still be prohibitively expensive for smaller facilities. In terms of accessibility, MS is more commonly found in research labs due to its broader utility and compatibility with automated systems.

Technically, MS is limited by complex spectrum interpretation and its reliance on database matches, which can hinder de novo sequencing. Edman degradation provides accurate N-terminal sequencing but is limited to unmodified peptides with free N-terminus and has difficulty with peptides longer than 50-70 residues. Additionally, Edman degradation is a slower process, taking hours to a day per peptide, whereas MS can analyze thousands of samples in a day.

Advances in protein sequencing techniques

Next-generation protein sequencing (NGPS), single-molecule analysis, and AI-powered tools are advancing proteomics by allowing more accurate decoding of protein structures, PTMs, and cellular complexity.

AQC (amino acid and protein detection reagent, ab145409) is designed for precise amino acid and protein sequence analysis using HPLC with fluorescence detection.

Emerging NGPS technologies

NGPS and single-molecule protein sequencing are transformative biotechnological advancements that enable precise protein analysis, revolutionizing therapeutics, diagnostics, and proteomics by providing unparalleled insights into cellular and molecular complexity.

NGPS and its impact on biotechnology

NGPS is transforming biotechnology by enabling direct analysis of protein sequences and PTMs, which nucleotide sequencing cannot capture. Utilizing advancements in MS and bioinformatics, NGPS provides precise de novo sequencing, particularly valuable when nucleic material is scarce or degraded. This protein sequencing technology complements traditional DNA sequencing, enriching discovery pipelines in therapeutics, diagnostics, and reagent development2.

NGPS accelerates drug discovery by characterizing proteins at the atomic level and identifying novel therapeutic antibodies for patentable biologics. In diagnostics, NGPS allows the sequencing of complex polyclonal antibodies, aiding in the creation of stable, specific reagents. Its ability to decode hypervariable regions, such as antibody CDRs, offers important insights for disease monitoring and therapeutic development.

Single-molecule protein sequencing

Single-molecule protein sequencing represents a groundbreaking advance in proteomics, enabling the analysis of individual protein molecules with unprecedented precision. Unlike traditional methods that require bulk sample processing, this approach identifies amino acid sequences and PTMs directly3.

Emerging technologies, such as nanopores, fluorosequencing, and MS adaptations, make it possible to analyze proteins at the single-molecule level, revealing insights inaccessible through transcriptomics. Nanopore-based methods detect changes in ionic current as individual amino acids pass through a pore, while fluorosequencing leverages fluorescent probes to identify amino acids in sequential cycles. MS adaptations allow precise single-molecule ionization and fragmentation, pushing the boundaries of sensitivity and resolution.

This innovation holds transformative potential for understanding cellular heterogeneity, mapping proteoforms, and improving diagnostics by detecting proteins in minute or complex biological samples.

Advances in high-resolution MS

Advances in high-resolution MS have significantly enhanced pharmaceutical analysis by improving resolution, speed, mobility, and accuracy. High-resolution MS enables precise compound identification through accurate mass measurements and isotope patterns, offering greater confidence in analyte characterization. Its ability to analyze complex molecules and provide detailed data supports drug discovery, development, and quality control processes. With its superior performance compared to low-resolution systems, high-resolution MS is transforming laboratory workflows.

MS/MS and de novo sequencing

MS/MS is a key analytical method in proteomics, enabling the fragmentation of precursor ions and the analysis of resulting fragments to determine peptide and protein sequences. De novo sequencing is used to interpret MS/MS data without a reference database, essential for studying proteins from unknown genomes, novel splice variants, or antibodies. MS/MS workflows include bottom-up proteomics, which analyzes digested peptides, and top-down proteomics, which focuses on intact proteins, revealing PTMs and isoforms.

Software, AI, and machine learning tools in de novo sequencing

Software tools and machine learning algorithms play an important role in de novo sequencing, enabling precise analysis of MS/MS data. Some tools utilize advanced algorithms to decode peptide sequences directly from MS/MS data. AI-driven platforms leverage probabilistic modeling and learning techniques to improve sequence prediction and accuracy.

Search our range of protein purification kits

View Products
button-secondary

Applications of protein sequencing in biotechnology

Protein sequencing underpins advancements in drug discovery, genetic engineering, diagnostics, bioinformatics, and structural biology by enabling precise protein analysis, uncovering biomarkers, guiding biological design, and facilitating the study of molecular functions and structures.

Drug discovery and development

Protein sequencing plays a vital role in drug discovery and development by revealing the primary structure of target proteins, enabling the design of drugs that interact specifically with these molecules. It facilitates the identification of therapeutic proteins, such as monoclonal antibodies, and supports the development of biologics and small-molecule drugs for targeted treatment of diseases like cancer.

Genetic engineering and synthetic biology

Protein sequencing plays an important role in genetic engineering and synthetic biology by determining precise amino acid sequences, which guide the creation and refinement of biological parts and systems. This information enables researchers to design new biological functionalities and develop predictive models for more efficient bioengineering.

Clinical diagnostics and biomarker discovery

Protein sequencing enables the identification of novel biomarkers by analyzing protein variants and proteoforms, aiding early disease detection and progression monitoring for conditions like cancer and Alzheimer’s. This approach improves clinical diagnostics by uncovering disease-specific protein signatures, paving the way for reliable diagnostic tests and targeted therapies.

Bioinformatics and structural biology

Protein sequencing in bioinformatics enables the analysis of vast protein data, helping to decode their functions, interactions, and evolutionary significance. In structural biology, protein sequencing provides the foundation for understanding three-dimensional structures, aiding in the study of molecular mechanisms and drug design.

Integration of protein sequencing with other technologies

Advances in omics technologies and computational tools have transformed our ability to analyze and interpret complex biological data, paving the way for groundbreaking discoveries and applications.

Combining protein sequencing with genomics and transcriptomics

Combining protein sequencing with genomics and transcriptomics provides a comprehensive approach to understanding complex biological systems and improving diagnostic accuracy for genetic disorders. For instance, in Mendelian diseases, this integration enables the identification of pathogenic genetic variants, validation of their functional impacts through RNA expression and protein analysis, and discovery of novel disease-associated genes, significantly enhancing the resolution of unsolved cases.

High-throughput sequencing and bioinformatics

High-throughput sequencing, combined with bioinformatics, revolutionizes the analysis of biological systems by enabling the rapid and large-scale sequencing of DNA, RNA, and proteins. Bioinformatics processes this extensive data using advanced algorithms, databases, and computational tools to predict functional roles, identify patterns, and integrate experimental findings, facilitating applications in genomics, proteomics, and systems biology.

Challenges and future directions in protein sequencing

Current challenges in protein sequencing encompass technical limitations, the complexity of data handling, and issues with protein modifications and variability, while future advancements focus on improving accuracy, efficiency, and scalability through innovative technologies and methodologies.

Current challenges in protein sequencing

Protein sequencing and analysis are fraught with challenges stemming from both technical limitations and the inherent complexity of proteins and their modifications.

Technical limitations and handling data complexity

Protein sequencing faces several technical barriers, including the need for highly sensitive tools to accurately detect and sequence low-abundance proteins within complex biological samples. The immense volume of data generated during high-throughput sequencing requires sophisticated bioinformatics pipelines to process, interpret, and integrate findings. Additionally, ensuring reproducibility and consistency across experiments remains a challenge due to variations in sequencing platforms and protocols.

Issues with PTMs and protein variability

PTMs, such as phosphorylation, acetylation, and glycosylation, significantly expand protein diversity but also complicate their analysis. These modifications are often transient and occur in low stoichiometry, making their detection difficult. Furthermore, protein variability stemming from alternative splicing, isoform expression, and environmental factors adds another layer of complexity to comprehensive protein profiling.

Future improvements in sequencing methods

Future advancements in protein sequencing methods aim to address current challenges in accuracy, efficiency, and scalability. Improvements in nanopore technology, such as thinner membranes and better electrophoretic controls, could enhance amino acid resolution and throughput. Recognition tunneling may benefit from advancements in probe sensitivity and integration with exopeptidase microreactors for real-time sequencing of peptide fragments.

Enhanced labeling techniques and engineered enzymes could increase the efficiency of Edman degradation, reducing sequencing errors and reaction time. ClpXP-based methods could see gains through improved protein tagging and microfluidic integration for parallel processing. Innovations in fluorosequencing, particularly for identifying low-abundance proteins, could expand its clinical and diagnostic applications. Finally, microfluidic platforms offer the potential for high-throughput, single-molecule analysis, facilitating the study of heterogeneous and scarce protein samples.

FAQs

What are the key differences between DNA sequencing and protein sequencing?

DNA sequencing identifies the precise order of nucleotide bases (A, T, C, G) in a DNA molecule, leveraging methods like PCR for amplification and high-throughput sequencing technologies for parallel processing. In contrast, protein sequencing determines the linear arrangement of amino acids in a protein, which is more complex due to post-translational modifications (PTMs), diverse chemical properties of amino acids, and the lack of an amplification mechanism akin to PCR.

What are the ethical considerations in protein sequencing and biotechnology research?

Ethical considerations in protein sequencing and biotechnology research include protecting individual privacy, ensuring informed consent, and addressing the potential misuse of sensitive data, such as identifiable health information. Additionally, issues like equitable access to benefits, responsible handling of incidental findings, and balancing innovation with regulatory compliance highlight the need for ethical frameworks and international collaboration.

What are the common applications for protein sequencing in industry and research?

Protein sequencing is widely used in drug development, recombinant protein synthesis, and functional genomics to understand protein structure, function, and interactions. It also aids in diagnosing genetic diseases, studying protein folding, analyzing evolutionary relationships, and characterizing proteins for industrial and therapeutic applications.

References

  1. Dewangan, Y., Berdimurodov, E., Verma, D. K. Amino acids: Classification, synthesis methods, reactions, and determination. Handbook of biomolecules. 3-23 (2023). https://doi.org/10.1016/B978-0-323-91684-4.00015-3
  2. Bennett, H. M., Stephenson, W., Rose, C. M., et al. Single-cell proteomics enabled by next-generation sequencing or mass spectrometry. Nature Methods20, 363-374 (2023).
  3. Alfaro, J.A., Bohländer, P., Dai, M.  et al.  The emerging landscape of single-molecule protein sequencing technologies.  Nat Methods.  18, 604–617 (2021).