1. Introduction
Metagenomic sequencing has emerged as a revolutionary tool in microbial diagnostics because it can profile all DNA present in a sample without requiring prior knowledge of which organism might be present. This capability is especially critical in clinical settings where patients present with infections that are difficult to diagnose through traditional culture‑based or targeted molecular methods. However, one pervasive and practical challenge has stood in the way of realizing the full potential of metagenomics: the overwhelming abundance of host DNA in clinical specimens. In human‑derived samples — such as blood, cerebrospinal fluid, sputum, synovial fluid, or tissue biopsies — human DNA often outnumbers microbial DNA by orders of magnitude (Chiu & Miller, 2019). Because the human genome (~3.2 gigabases) is roughly a thousand times larger than the average bacterial genome (~3.6 megabases), even trace amounts of host cells can dominate sequencing data (Chiu & Miller, 2019). This leads to wasted sequencing capacity, higher costs, reduced sensitivity for pathogen detection, and increased computational burden during data analysis.
Researchers conducting systematic reviews and meta‑analyses of the metagenomic sequencing literature consistently identify host DNA as a critical bottleneck. Sequencing reads dominated by host DNA can obscure low‑abundance pathogen signals, making it difficult — and sometimes impossible — to detect clinically relevant organisms, especially in samples with low microbial biomass (Simner, Miller, & Carroll, 2018). What follows is a comprehensive, humanized narrative that synthesizes over a decade of research on why host depletion is important, the range of available strategies to address it, and the ongoing limitations that must be addressed to improve clinical metagenomic workflows.
Host DNA inundation is not merely a technical nuisance — it is central to the sensitivity and reliability of metagenomic diagnostics. Even with deep sequencing, if the majority of reads derive from the patient’s genome, the effective coverage of microbial genomes shrinks dramatically (Wilson et al., 2019). In severe cases where microbial content is low — such as early infection or post‑antibiotic treatment — pathogen DNA can be buried amidst billions of human reads (Hasan et al., 2016). Computational filtering techniques can remove human reads post‑sequencing, but this is inherently inefficient. For every microbial read recovered, many sequencing resources have already been expended on host DNA that adds no diagnostic value, driving up both cost and turnaround time (Simner et al., 2018).To counter this problem at its source, a range of pre‑extraction techniques have been developed. These methods aim to enrich microbial DNA or deplete host DNA before sequencing or even before DNA extraction.
Physical separation techniques exploit fundamental differences between host cells and microbes. For example, filtration through pores sized at 0.2–0.45 µm can remove larger host cells while allowing smaller bacteria and viruses to pass through (Yang et al., 2018). Differential centrifugation further enriches microbial particles based on density. While useful, these approaches are imperfect. Some microbes, particularly those that are larger or form aggregates, can be physically lost along with host cells, and extracellular DNA from lysed host cells remains unsolved by size‑based methods alone (Horz et al., 2010).
Recognizing structural differences in cell walls and membranes has enabled more selective depletion strategies. Host cells, with relatively fragile plasma membranes, can be lysed using reagents like saponin or even osmotic shock, whereas most microbes — with rigid cell walls — remain intact (Hasan et al., 2016; Fittipaldi, Nocker, & Codony, 2012). Once host cells are lysed, nucleases such as Benzonase or DNase I digest the liberated host DNA into fragments too small to be sequenced effectively (Hasan et al., 2016).An alternative is the use of Propidium Monoazide (PMA), a membrane‑impermeable DNA intercalator. PMA covalently modifies extracellular DNA — including host DNA — preventing its amplification and sequencing (Fittipaldi et al., 2012). This approach enriches for DNA still contained within intact microbial cells, improving downstream detection sensitivity.
A fundamental biological difference between eukaryotic and prokaryotic DNA is methylation. In human DNA, CpG sites are frequently methylated to regulate gene expression. Commercial kits exploit this by using methyl‑CpG binding domain (MBD) proteins or other affinity reagents to selectively bind and remove methylated vertebrate DNA from the pool (Bird, 1986; Yeoh, 2021). Because microbial genomes have distinct methylation patterns, this method can enrich for non‑host DNA.Restriction enzymes that recognize methylation motifs can also be deployed. For example, DpnI selectively targets adenine‑methylated sequences prevalent in many bacterial genomes while largely ignoring human DNA (Di Cenzo & Finan, 2017). In this manner, microbial DNA can be physically separated from host sequences post‑extraction.
CRISPR/Cas systems, renowned for their gene‑editing capabilities, can be repurposed for host DNA depletion. Guide RNAs can be designed to target abundant human repetitive elements — such as Alu sequences — which constitute a large fraction of the human genome (Carpenter et al., 2018). Cas nucleases then cleave these targeted host sequences, allowing them to be preferentially degraded or filtered out before sequencing.
Nanopore sequencing platforms have unlocked a real‑time approach known as selective sequencing. As DNA strands pass through nanopores, their electrical current signatures can be mapped to reference genomes in real time. If a read matches human DNA, software such as Readfish or UNCALLED can reverse voltage to eject the strand from the pore, conserving sequencing capacity for non‑host DNA (Loose, Malla, & Stout, 2016; Charalampous et al., 2019).Microfluidic technologies partition individual DNA molecules into ultra‑small droplets, enabling highly controlled whole‑genome amplification (WGA) for extremely low biomass samples. These droplets reduce amplification bias and decrease the risk of contamination compared to standard macroscale methods (Anscombe et al., 2018; Abate et al., 2013).Selective lysis protocols, for example, can inadvertently destroy sensitive microbes along with host cells. Organisms such as Mycoplasma or certain parasites that lack robust cell walls may be lost, biasing the detected community (Hasan et al., 2016).Multiple rounds of lysis, centrifugation, or enzymatic treatment inevitably result in some loss of DNA. In already low biomass clinical samples — such as vitreous fluid or cerebrospinal fluid — this loss can be catastrophic, reducing the available microbial DNA to levels below detection (Nelson et al., 2019).
Not all microbes exhibit methylation patterns that differ cleanly from humans. Some bacteria and fungi share methylation features with their hosts, making them difficult to enrich through methylation‑based methods (Fong et al., 2020).
Host DNA depletion is a foundational component of clinical metagenomic sequencing. Without it, the power of metagenomics to detect pathogens directly from patient specimens is severely limited. Through fundamental strategies spanning physical separation, selective lysis, enzymatic degradation, methylation differentiation, and emerging real‑time sequencing technologies, researchers have developed a sophisticated toolbox to tackle this challenge. Yet, each method carries its own limitations, and no single approach works universally across all sample types or clinical scenarios. As both sequencing technologies and analytical methods advance, integration of multiple host depletion strategies — tailored to specific clinical samples and pathogens — is likely to improve diagnostic sensitivity, reduce bias, and lower costs. Continued collaboration between clinicians, microbiologists, and engineers will be essential to refine these strategies and unlock the full potential of metagenomic sequencing for patient care.