Tag: consciousness

  • Building Brains from Polymers: The Quiet Revolution in Organic Neuromorphic Computing

    Building Brains from Polymers: The Quiet Revolution in Organic Neuromorphic Computing

    In Altered Carbon the author wrote,

    For all that we have done, as a civilization, as individuals, the universe is not stable, and nor is any single thing within it. Stars consume themselves, the universe itself rushes apart, and we ourselves are composed of matter in constant flux. Colonies of cells in temporary alliance, replicating and decaying and housed within, an incandescent cloud of electrical impulse and precariously stacked carbon code memory. This is reality, this is self knowledge, and the perception of it will, of course, make you dizzy.

    Of all the colonies of cells, neurons are rather strangest of them all. My first love in all of mathematical biology was the Hodgkin-Huxley model.

    Every time you read a book, add pinch of smoked paprika to your sauce, or yell at your kids, your brain performs a computational magic. It processes vast streams of sensory data and makes split-second decisions, all on roughly 20 watts of power. You need less power than a dim light bulb and the end of a seedy alley in a Batman movie.

    Modern AI, by contrast, can demand megawatts. A Large Language Model training consumes more electricity than a small town uses in a year. Yeah, you are directly causing global warming by brainstorming with your Claude on how to get away with murder, or escape to Timbuktu.

    This staggering mismatch has forced engineers to ask a basic question: what if we stopped trying to simulate the brain on conventional hardware and instead built hardware that works the way human brain actually does?

    This is where dreams of neuromorphic computing comes in. And in the last three years, a wave of breakthroughs, many originating from labs in China, has brought us closer to answering it than I had realized. So I thought I will write about it and spread my ignorance.

    The von Neumann Bottleneck: Why silicon and your grey matter differ

    Every conventional computer, from your phone to a data center, is built on the von Neumann architecture proposed in 1945. Its defining feature is a rigid separation between the processor (which computes) and memory (which stores). Data must constantly shuttle back and forth between the two over a shared bus. Obviously this is a mighty fine design. We all know that von Neumann was an alien whose mathematical super-intelligence make your and my brain look like a toaster next to a Voyager. Nevertheless, the von Neumann architecture creates an inherent bottleneck of data flow and data processing.

    Your brain has no such bottleneck. Its roughly 1012 neurons and 1015 synapses perform processing and storage in the same physical location. roughly. When a synapse strengthens in communication, a process called long-term potentiation, it is simultaneously computing and remembering. There is no bus and the Computation is memory.

    This architectural difference has profound consequences. The brain operates at frequencies around 10 Hz — millions of times slower than a modern CPU clock — yet outperforms supercomputers on tasks like real-time sensory integration. It achieves this through massive parallelism and extreme energy efficiency. A single synaptic event consumes on the order of femtojoules.

    Neuromorphic engineering aims to close this gap by building physical hardware that mimics these principles.

    A summary table ma be useful to anchor it:

    Compute systemEnergy per synaptic event
    Biological synapse~1–10 femtojoules (10-15 J)
    Best OECT artificial synapse~femtojoules to picojoules
    GPU (floating-point multiply)~picojoules to nanojoules
    Full LLM inference per token~joules (whole system)

    The first serious attempts at neuromorphic hardware used conventional silicon. Projects like IBM’s TrueNorth, Intel’s Loihi, Stanford’s Neurogrid, and the University of Manchester’s SpiNNaker demonstrated that spiking neural networks could be implemented in CMOS circuits, achieving orders-of-magnitude improvements in energy efficiency per event compared to traditional processors.

    But silicon neuromorphic chips have a fundamental problem: emulating a single synapse or neuron in CMOS typically requires more than ten transistors. This makes large-scale integration expensive, energy-hungry, and difficult to scale. Moreover, silicon is rigid, brittle, and biologically incompatible. Wouldn’t it be great if we could create direct interfaces with living tissue?

    Organic & Polymeric Electronics

    Can we make chips out of plastic material? If we could then such flexible electronics would mimic true biological neurons. This have been the dream, and recent advances in n-type polymers brings it much closer to reality1.

    Organic mixed ionic-electronic conductors (OMIECs) are polymers that conduct both electrons and ions simultaneously. This dual conductivity is is the key property that makes them uniquely suited for neuromorphic hardware. Biological neurons and synapses operate through ion flow of sodium, potassium, calcium, and chloride ions crossing membranes through voltage-gated channels. OMIECs can mimic the same electrochemical language.

    The ionic conductance story of a neuronal spike, roughly

    The organic electrochemical transistor (OECT) — a three-terminal device built from these materials — has emerged as the workhorse of organic neuromorphic electronics. A typical OECT consists of a polymer channel (the source-drain path), an electrolyte, and a gate electrode. When a voltage is applied to the gate, ions from the electrolyte migrate into the polymer channel, doping or de-doping it and changing its conductivity. The channel itself acts as an artificial synaptic cleft; the gate electrode mimics the presynaptic membrane; the drain collects the postsynaptic current.

    The physics of this process can be described as follows. In the Bernards-Malliaras model, the drain current IDSI_{DS} in the linear regime follows2:

    IDS=μCWTL(VTHVGS+VDS2)VDSI_{DS} = \mu C^* \frac{W T}{L} \left( V_{TH} – V_{GS} + \frac{V_{DS}}{2} \right) V_{DS}

    where μ\mu is the carrier mobility, CC^* is the volumetric capacitance of the channel material, W,TW, T, and LL are channel width, thickness, and length, and VTHV_{TH} is the threshold voltage. It is the firing threshold equivalent of a neuron.

    VGSV_{GS} is the gate-to-source voltage. It controls ion injection. This is the input knob: positive VGS pushes cations into the channel de-doping it. Compared to biological neurons, this is the presynaptic signal and equivalent to the neurotransmitter release!

    VDSV_{DS} is the drain-to-source voltage. It drives the current along the channel. It pulls carriers from source to drain and acts as as the resting membrane potential.

    In contrast, a regular MOSFET has a gate insulator which is a thin oxide. Charges pile up at the surface of the semiconductor, right at the oxide interface. The gate controls a 2D sheet of charge. This is interfacial doping.

    An OECT is fundamentally different. There’s no insulator. The gate touches an electrolyte (salt water, a gel, a hydrogel), which touches the polymer channel directly. When you apply a gate voltage, ions physically migrate into the bulk of the polymer, doping or de-doping the entire volume. This is volumetric doping, which gives OECTs their hallmark properties: exceptionally high transconductance, low operating voltages (often below 1V), and a natural capacity for analog, multi-level conductance states.

    A key figure of merit for OECT channel materials is the product μC\mu C^*, which captures the combined electronic and ionic performance. It’s the carrier mobility times the volumetric capacitance.

    In neuron terms: μCmeasures how responsive the material is:

    • Can ions get in easily? (C) — analogous to how many ion channels a membrane has
    • Can the electrical consequence propagate quickly? (μ) — analogous to how well-myelinated the axon is

    And here is the recent advancement no one is talking about. Early polymeric materials like PEDOT:PSS achieved values around 50 Fcm1V1s1F \, cm^{-1} V^{-1} s^{-1}. Recent glycolated polythiophenes like p(g3T2-Te) have pushed this to 483 Fcm1V1s1F \, cm^{-1} V^{-1} s^{-1}, a nearly tenfold improvement in under a decade.

    Adopted from Ref. Xiang et. al.[2]

    Artificial synapses

    The most fundamental requirement for any neuromorphic device is synaptic plasticity. This is the ability to strengthen or weaken a connection based on activity, the physical basis of learning and memory.

    In biological synapses, plasticity comes in two flavors. Short-term plasticity (STP) lasts milliseconds to seconds and reflects transient neurotransmitter dynamics. Long-term plasticity (LTP) persists for hours or longer and involves structural changes at the synapse — gene expression, protein synthesis, physical remodeling of dendritic spines.

    OECT-based artificial synapses replicate both. STP arises naturally from the slow kinetics of ion injection and extraction: apply a voltage pulse to the gate, and ions dope the channel, changing its conductance. Remove the pulse, and the ions gradually diffuse back. The conductance decays to baseline over seconds. This is volatile memory, analogous to a fleeting sensory impression.

    Achieving LTP is harder and more interesting. Several strategies have emerged:

    • Electropolymerization. In a fascinating device reported by Gerasimov et al.3 an OECT built from the polymer ETE-S exhibited evolvable synaptic behavior. They showed that ow gate voltages produced conventional STP through reversible electrochemical doping. Interestingly, sustained moderate voltages instead triggered electropolymerization of additional monomer within the channel, permanently altering its conductance. The synapse literally grew new conductive pathways during operation, which is strong parallel to biological synaptogenesis!
    • Ion-blocking layers: Inserting physical barriers — metal-organic frameworks, lipid bilayers, nanofibrous polymer coatings — between the channel and electrolyte slows ion diffusion and traps charges in place after programming. A PEDOT:PSS/PEI architecture achieved non-volatile retention exceeding 25 hours, with the electrostatic potential barrier between oxidation states preventing spontaneous discharge4.
    • Crystallinity engineering5: Wang et al. built a vertical OECT whose channel has both crystalline and amorphous regions. Low gate voltages dope only the disordered amorphous zones — ions slip in easily but slip out just as fast. Volatile, achieveing STP. Crank the voltage higher and ions force their way into the tightly packed crystalline domains, where they get stuck. Non-volatile, achieving LTP. One device, two memory modes, selected purely by how hard you push. It stores 1,024 distinct conductance states and holds them for over 10,000 seconds; a 10-bit analog memory in a single transistor!

    Artificial Neurons: Making Plastics Spike

    Building artificial neurons from organic materials is not easy. We need to replicate the complex dynamics of biological action potentials!

    The dominant circuit architecture is the leaky integrate-and-fire (LIF) model, implemented using complementary OECTs arranged as an Axon-Hillock (A-H) circuit. A membrane capacitance integrates input current. When the accumulated voltage crosses a threshold, a complementary inverter (built from paired p-type and n-type OECTs) fires a sharp output spike. A reset transistor then discharges the capacitor, restoring the resting state. The cycle repeats.

    Early OECT neurons based on this design operated at frequencies below 2.4 Hz, which is far too slow to match even the slowest biological neurons. The breakthrough came with polymer engineering.

    In a 2025 study published in PNAS, Yao et al.6 introduced Homo-gDPPTz, a new n-type OMIEC designed to closely match the performance of its p-type complement (gDPP-g2T) in vertical OECT complementary inverters. The resulting organic electrochemical neuron achieved a firing frequency spanning 0.13 to 147.1 Hz that can be calibrated — a range that covers the spectrum from slow vasoconstrictor neurons (below 1 Hz) to fast-spiking cortical neurons (over 100 Hz). This is more than 50 times broader than any previous OECT neuron circuit. The device footprint was less than 37 mm2mm^2, and energy consumption was just 4.7 nanojoules per spike.

    The neuron was then integrated with pressure and strain sensors to build a complete neuromorphic perception system. Mechanical stimuli from a conductive foam sensor or a printed strain sensor were converted into input currents, which the OECT neuron transformed into frequency-modulated spike trains. These spikes were fed into an artificial synapse (also a vOECT), which modulated its postsynaptic current based on spike frequency, demonstrating the full sensing-encoding-processing loop of biological neural perception. Simulations of a spiking neural network built on these device parameters achieved 96% accuracy on handwritten digit classification.

    An even more biologically faithful architecture was recently built using the conductance-based OECN (c-OECN). This emulated sodium and potassium ion channels, reproducing 15 of 20 known neuronal features including depolarization, repolarization, hyperpolarization, and threshold-dependent firing7. This design was coupled to a Na+ sensor and used to trigger vagus nerve stimulation in mice; a fascinating demonstration of closed-loop physiological regulation driven by an organic artificial neuron!

    The Chinese Contribution: Printing Brains on Plastic

    Just as Chinese researchers have made rapid progress in efficient and open-source LLMs, they have also made astounding progress in polymeric materials and engineering of neuromorphic computing.

    Regionally controlled ion doping8. Li, Zhang et al. demonstrated an elegant strategy for building computing and memory on the same chip using identical materials. By controlling the thickness of inkjet-printed hydrogel electrolytes — thick (multi-layer) for computing units, thin (single-layer) for memory units — they created two functionally distinct OECT types from the same PEDOT:PSS channel material. The ion-rich OECT (thick electrolyte) features rapid ion transport, sub-millisecond response, and high transconductance, serving as a computing neuron. The ion-deficient OECT (thin electrolyte) has slow ion dynamics, wide hysteresis, and 300-second state retention, serving as a memory element.

    Wow! Integrating volatile computing elements and non-volatile memory elements typically requires different materials, different fabrication processes, and complex circuit architectures. But this work shows that the spatial distribution of the electrolyte alone is sufficient.

    Stretchable neuromorphic chips: Dai et al.9 created the first intrinsically stretchable neuromorphic device using the polymer p(gT2), an organo-hydrogel electrolyte, and vertically grown gold nanowire electrodes embedded in PDMS. The device maintained over 800 distinct conductance states, switching endurance exceeding 108 cycles, and state retention over 104 seconds — all while being stretched to 100% strain. A 3-by-3 prototype array performed vector-matrix multiplication (the fundamental operation of neural networks) on skin, and simulated classification of ECG signals into five cardiac categories maintained approximately 90% accuracy even when the hardware was physically deformed during inference. This is the first demonstration that neuromorphic computation and mechanical stretchability can coexist without performance compromise — a prerequisite for truly wearable brain-machine interfaces.

    Why Organic polymers? The Case for Wetware

    There are so many advantages of organic neuromorphic electronics over silicon alternatives:

    Biocompatibility. OECTs operate in aqueous electrolytes at sub-volt potentials, conditions compatible with living cells and tissues. Silicon neuromorphic chips require high voltages and rigid encapsulation that damages biological interfaces. Organic devices can directly interface with neurons, epithelial cells, and biological fluids. A PEDOT:PSS OECT has been used to record dopamine release from PC-12 cells in real time, with the neurotransmitter itself modulating the synaptic weight — a true biohybrid synapse10.

    Multimodal sensing. Because the gate of an OECT can be functionalized with enzymes, antibodies, aptamers, or chemically sensitive materials, the same device can respond to electrical, chemical, mechanical, and optical inputs. A single OECT has been shown to simultaneously process pressure, light, and neurotransmitter signals, performing multisensory integration in hardware — something that requires complex multi-chip architectures in silicon.

    Energy efficiency. OECT synapses operate at femtojoule to picojoule energy per event, comparable to biological synapses.

    Fabrication. Organic semiconductors can be printed using inkjet, screen printing, or aerosol jet deposition on flexible plastic substrates at room temperature. No cleanroom required. No vacuum deposition. No billion-dollar fab. This matters enormously for the envisioned applications: disposable biosensors, on-skin health monitors, and large-area flexible electronics.

    The hard problems to solve that fascinates me

    Despite the rapid progress, organic neuromorphic electronics face real challenges.

    Speed. The ionic transport that gives OECTs their neuromorphic properties is inherently slower than electronic switching. The fastest OECT synapses operate at around 200 nanoseconds; the fastest neurons reach approximately 500 Hz. This is sufficient for biological interfaces but too slow for the megahertz-and-above clock rates needed for general computing. Organic neuromorphic hardware is not going to replace GPUs and may be they need not. There are so many amazing niche applications where speed requirements align with biological timescales.

    Stability. Organic materials degrade. PEDOT:PSS is sensitive to humidity. Many OMIECs swell excessively in aqueous environments. Long-term drift in conductance states undermines the reliability of analog memory. Cross-linking, encapsulation, and materials engineering (hydrophobic side chains, ladder polymers) are improving stability, but shelf lives of months, not decades, remain the norm. is that bad? Well, not if they are cheap to manufacture and replace.

    Scale. The largest demonstrated organic neuromorphic arrays are still small. Achieving the thousands or millions of devices needed for practical neural networks requires advances in high-resolution patterning, device-to-device uniformity, and interconnect engineering. The inkjet printing approach is promising for this reason.

    Framework. The neuromorphic devices being built don’t map cleanly onto existing machine learning frameworks. The non-linearity and asymmetry of analog weight updates, the stochastic variability between devices, and the different timescales of volatile and non-volatile states all require rethinking algorithms from the ground up.

    So where are we heading?

    In the near-term future we don’t expect polymers to replace silicon. However, there is potentially a more transformative future of intelligent interfaces between the digital and biological worlds!

    Imagine a patch on your skin that continuously monitors your ECG, classifies arrhythmias in real time using on-device neuromorphic inference, and communicates only abnormalities — consuming nanowatts, conforming to your body’s movements, never needing to stream raw data to the cloud. Imagine it tells you about your stress level, biomarkers for cardiovascular health, all unobtrusively.

    Imagine neural implants that record brain activity and process it locally — filtering noise, detecting seizure precursors, triggering responsive neurostimulation — all using devices that speak the same ionic language as the neurons they interface with!

    Imagine artificial skin for prosthetic limbs that senses pressure, texture, and temperature, converts these stimuli into frequency-coded spike trains, and transmits them to peripheral nerves in a format the nervous system can natively interpret.

    These applications don’t require faster-than-silicon speed. They require biocompatibility, energy efficiency, mechanical flexibility, and the ability to process sensory information the way biology does — all areas where organic neuromorphic electronics have a structural advantage.

    The brain didn’t evolve to maximize clock speed. It evolved to survive in a noisy, unpredictable, energy-constrained environment by integrating sensing, processing, and memory into a single adaptive substrate. For the first time, we’re building electronic systems on the same design principles — using materials that bend, stretch, and operate in the wet, salty, ion-rich environment of the living body.

    The future of polymer electronics isn’t about replacing silicon. It is perhaps about performing wonders silicon can’t.

    I lay still for a while, picking up the scattered garments of my mind and trying to assemble some kind of reasonable outfit from them.

    ~Altered Carbon

    References

    1. Li, Qifan, et al. A water-processable n-type polymeric ink with conductivities exceeding 1,000 S cm-1Matter (2026). ↩︎
    2. Xiang, K., Song, J., Liu, H., Chen, J. & Yan, F. Organic Electrochemical Transistors for Neuromorphic Devices and Applications, Adv. Mater. 38, e15532 (2026). ↩︎
    3. Gerasimov, Jennifer Y., et al. An evolvable organic electrochemical transistor for neuromorphic applications. Advanced Science 6.7 (2019): 1801339. ↩︎
    4. Van De Burgt, Yoeri, et al. A non-volatile organic electrochemical device as a low-voltage artificial synapse for neuromorphic computing. Nature materials 16.4 (2017): 414-418. ↩︎
    5. Wang, Shijie, et al. An organic electrochemical transistor for multi-modal sensing, memory and processing. Nature Electronics 6.4 (2023): 281-291. ↩︎
    6. Yao, Y., Pankow, R. M., Huang, W. et al. An organic electrochemical neuron for a neuromorphic perception system, PNAS, 122, e2414879122 (2025). ↩︎
    7. Harikesh, Padinhare Cholakkal, et al. Ion-tunable antiambipolarity in mixed ion–electron conducting polymers enables biorealistic organic electrochemical neurons. Nature materials 22.2 (2023): 242-248. ↩︎
    8. Li, M., Zhang, W. et al. Regionally controlled ion-doping of organic electrochemical transistors for computing-memory co-integrated neuromorphic systems. NPJ Flex. Electronics,10, 11 (2026). ↩︎
    9. Dai, S. et al. Intrinsically stretchable neuromorphic devices for on-body processing of health data with artificial intelligence, Matter, 5, 3375-3390 (2022). ↩︎
    10. Keene, Scott T., et al. A biohybrid synapse with neurotransmitter-mediated plasticity. Nature Materials 19.9 (2020): 969-973. ↩︎

  • What if we could predict real-world properties of polymeric molecules from their chemical sequences alone?

    What if we could predict real-world properties of polymeric molecules from their chemical sequences alone?

    Polymeric molecules are all around is and in us. It is hardly surprising that a large fraction of life’s molecules carrying information are polymeric, from DNA, RNA to proteins, lipids and peptides.

    During my PhD I fell in love with polymers. (I had started my Phd work in Quantum Information but would quickly switched to soft-matter physics). I worked on Vulcanization Transition, a second-order phase transition in which a random melt of polymers, like natural rubber, can be chemically cross-linked to form random solids. I later became fascinated by gels, glassy solids, and the deep connections of their physics to percolation theory, random-resistor networks and jamming transition.

    Over the years, I met another fascinating polymer called oligonucleotide: bits of RNA, double-stranded or single stranded (shRNA & siRNA) and eventually bits of DNA (Anti-Sense Oligonucleotides, or ASOs). Oligonucleotides are informational drugs. They carry the genetic information they are destined to modulate.

    We all witness the impact of another such informational medicine during Covid-19, the synthetic mRNA polymer creating the right fragment of a protein to vaccinate. If you think about it, 3 of the 4 medicine modalities are polymers: peptides, antibodies, nucleic acids. Small molecules are the only exception; they carry nebulous information lacking focus and interact with almost everything.

    Yet, we understand so much and so little about polymers! When I cofounded Creyon, my dream was to engineer one kind of polymer really well: oligonucleotides. These are bits of nucleic acids that are chemically modified to make them drug-like (functionalization), that can be sent to a cell or tissue and precisely control gene expression! (Isn’t it marvelous that A, C, G, T code, a quad, instead of a bit, could do that? It could manipulate the very information in genes that I need to even see this screen?) These functionalizations—chemical modifications of the base, linker of sugar unit of the nucleic acid— could fundamentally change their biological, physical, biochemical properties. They could make these polymers more or less viscous, soluble, serum stable, immunotoxic, bioavailable; sometimes modulating pharmacology measures across four orders of magnitude by a single modification on the same base sequence! We were engineering these molecules & manipulating the information in the informational drug across several axis. We learned how to make the information allele-selective, well-tolerated, have higher affinity, have higher on-rate or activity, and so on.

    Lately, I have expanded the scope of that lifelong dream of controlling information flow. The scope is not just human biology and disease, but what more can sequences do and how well can we create sequences? Obviously, a lot of changed in the last 2 yrs! As society we have marveled at what AI can do when fed a large corpus of textual sequences. Who knew LLMs could get this good at writing not just text sequences but logical sequences of codes?

    Polymers are just chemical sequences.

    Well, one challenge is data. Where are all the data to learn the properties of molecules? Molecules inhabit a very special world. Unlike textual sequences where correlations are hard to quantify but easy to sense, correlations in molecules follow the laws of quantum physics, easy to validate and quantify, but hard to sense by intuition.

    The bad news is we need to create these physics-faithful datasets. But the good news is the correlations are nearsighted, as Walter Kohn called it Quantum Nearsightedness.

    We started dreaming that we should be able to predict physical properties, like conformations, free energy etc. purely from chemical sequence of polymers. As with any dream, you need good partners in crime! David Pekker Todd Martinez

    We tried to take the simplest first step.

    Can we predict the thermal ensemble of polymer conformations from their sequence alone?

    Well, we asked ourselves, what is a realistic system that will stress test this unreasonable dream? We tinkered with some internal data, but settled on a large dataset on MD trajectories of peptides that was freely available (mdCATH dataset). The trajectory sampling in this data is almost certainly not ergodic, but hey, beggers can’t be choosers, right? Do you have 1-5 Million GPU hours to spare? If it were fully ergodic, we would have gotten very close to computing free energy of peptides directly from sequence. Wild, right?

    What we discovered, once we figured out a few critical things in how to include the physics the right way into Diffusion Transformers, that we were able to predict the conformation ensemble as a function of temperature. We did some other work internally to convince ourselves we could do this for other systems too, like for concentration dependence.

    So why care?

    Turns out, properties of polymers are driven by their conformations and free energy. Ask any peptide chemist and she will tell you that controlling the degrees of freedom (by macrocyclization) is what with you do once you have a lead molecule to stare at and a glass of wine to place some educated chemical bets. Ask a nucleic acid chemist, and she will tell you that a blessed hairpin structure is the reason that an aptamer is a molecular beacon.

    But here is the inconvenient truth. Oligo-length (meaning ~10-100 monomer long) polymers (peptides included) are very often highly flexible, and it makes no sense to anchor your expectations of their properties on a single low-energy conformation. Larger proteins are probably a bit different; some of them are folded by chaperones, and it makes sense to use an AlphaFold/ESM/SimpleFold predicted single or closely related structure.

    So what next? Well, if we can predict physical properties from sequences, I think an analogy is worth entertaining:

    If LLMs understand text and we are increasingly fasciated by teaching LLMs Physical world (Newtonian) what does it take for a Molecular AI model like ours to understand the Quantum World of molecules? How much data? What kind of “sensors” are analogous to the Physical AI sensors and cameras?

    And most importantly, were is the limits of molecular engineering? Will you laugh at me if we dream about predicting viscosity? Conductivity? If we engineer the perfect conductive polymer using such generative tools? The perfect tissue-targeting molecule? The perfect precision medicine, ready to be printed?

    Read the paper and criticize. We are just getting warmed up! Send your comments!

  • Tensor Product Attention: Curiosities abound

    Tensor Product Attention: Curiosities abound

    A recent paper Tensor Product Attention Is All You Need1 grabbed my attention. Over the last year, I have been exploring and investigating ways to reinterpret attention mechanism, mainly for my own edification. What correlations do a transformer really capture? And unsurprisingly, I have been looking at using intuition from the physics of correlated systems.

    Firstly, attention mechanism is often written in a mathematically confusing and redundant way in the machine learning literature. The notation is often obfuscated by implementation quirks of matrix multiplications on GPUs. So let’s set up the notation, and simplify.

    In the notes below, I will ignore position encoding. RoPE or learnable additive position encodings do not change the foundational mathematical intuitions I am trying to convey here — it is a distraction.

    I use \ell for layer index and hh for head index.

    The key quantity is the residual stream, XX^\ell. This matrix is getting transformed by attention and MLP blocks. The embedding dimension dmodeld_\textrm{model} is the size of the vector space in which tokens are being embedded.

    We need a few other matrices to really explain what’s going on.

    Note that in ML/ AI papers the Query, Value and Key matrices are always written separately, but in essence, we are low-rank decomposing (as product of rectangular matrices) two matrices, 𝐖QK,h,𝐖OV,h\mathbf{W}_{QK}^{\ell,h} \, \, , \mathbf{W}_{OV}^{\ell,h}. This will be clear when we write attention is terms of these matrices — 

    Attn(𝐱i)=h=1Hj=1n[softmaxj(𝐱i𝐖QK,h𝐱jdhead)]𝐖OV,h𝐱j\begin{aligned} \text{Attn}^{\ell}(\mathbf{x}_i) = \sum_{h=1}^{H} \sum_{j=1}^{n} \left[ \text{softmax}_j \left( \frac{\mathbf{x}_i^\top \mathbf{W}_{\text{QK}}^{\ell,h} \mathbf{x}_{j}}{\sqrt{d_{\text{head}}}} \right) \right] \mathbf{W}_{\text{OV}}^{\ell,h} \mathbf{x}_j \end{aligned}

    The attention operator Attn\textrm{Attn}^\ell at layer \ell is a sum over individual attention heads, hh, with HH total heads. Note, here I choose to call the operator the net function that returns a vector of same size as 𝐱i\mathbf{x}_i — one can choose to add this back to the residual XX^\ell. Some architectures do so, others send it through the MLP operator. There are a lot of different transformer architectures out there in the various LLMs, and for the purpose of this discussion, it’s unimportant. Moreover, the papers have a bewildering range of definitions of what part of is called attention, which is why I bored you with setting up notation. You are welcome.

    Note that the number of heads and head dimensions are chosen such that we always have dmodel×dmodeld_{\text{model}} \times d_{\text{model}} matrices in the above expression.

    The only correlation between tokens explored in an transformer is pairwise. The MLP operator acts on the per-token embedding 𝐱i\mathbf{x}_i and do not mix 𝐱i\mathbf{x}_i and 𝐱j\mathbf{x}_j. In the Attention operator softmaxj\textrm{softmax}_j term is a normalized weight — and every other token embedding 𝐱j\mathbf{x}_j in the context window is getting summed over by this weight multiple by a linear transformation matrix. It is really quite simple.

    Well, one may wonder — why only pairwise correlations? And, why only the above functional form for pairwise correlations?

    A digression — for physicists like me, any time we see pairwise correlations, we think about Potts model, a generalization of the Ising Model which is perhaps better known. In the q-state Potts model the “spins” are unit vectors that point in q symmetric directions of a hypertetrahedron in q-1 dimensions, see here2. In the classical Potts model these vectors interact only if their “spins” (state) are the same.

    Can we draw an analogy with Potts Model? Yes, of course! Well, a paper3 already did a version of it—with a Potts Model where the interactions are not restricted to same “spins” but mix “spins”. It’s an enticing direction to study the dynamics of transformers using such mappings.

    OK, end of digression.

    The Memory Bottleneck in Modern Transformers

    Large language models face a critical scalability challenge: the Key-Value (KV) cache. During autoregressive generation, standard Multi-Head Attention (MHA) stores keys and values for all previously generated tokens, consuming memory that grows linearly with sequence length:

    MemoryMHAn×H×dhead\text{Memory}_{\text{MHA}} \sim n \times H \times d_\text{head}

    See table to to recall notation. For a model with H=32H = 32 and dhead=128d_\text{head} = 128 processing a n=105n = 10^5 token context, this amounts to over 800MB just for the KV cache of a single layer!

    The fundamental question is whether we must store the full H×dheadH \times d_\text{head}representation for each token, or whether a more compact factorized representation can capture the essential structure with minimal information loss.

    Tensor Decompositions: A Primer

    Before diving into Tensor Product Attention (TPA), we need to understand the landscape of tensor decomposition methods. A tensor is simply a multi-dimensional array—scalars are 0-order tensors, vectors are 1st-order, matrices are 2nd-order, and so on.

    CP Decomposition (CANDECOMP/PARAFAC)

    The most common Tensor Decomposition is probably the CP decomposition.

    Definition (CP Decomposition): A third-order tensor 𝒳I×J×K\mathcal{X} \in \mathbb{R}^{I \times J \times K} has a rank-RR CP decomposition if it can be written as:

    𝒳=r=1R𝐚r𝐛r𝐜r \mathcal{X} = \sum_{r=1}^{R} \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r where 𝐚rI\mathbf{a}_r \in \mathbb{R}^I, 𝐛rJ\mathbf{b}_r \in \mathbb{R}^J, 𝐜rK\mathbf{c}_r \in \mathbb{R}^K and \circ denotes the outer product.

    Element wise, Equivalently, for indices i,j,ki,j,k :

    𝒳ijk=r=1Rairbjrckr\mathcal{X}_{ijk} = \sum_{r=1}^{R} a_{ir} b_{jr} c_{kr}

    The CP decomposition represents a tensor as a sum of rank-1 tensors (outer products of vectors). This is the natural generalization of matrix SVD to higher orders, though unlike SVD, computing the optimal CP decomposition is NP-hard. Yeah, sucks, right?

    Tucker Decomposition

    Another popular tensor decomposition method is the Tucker Decomposition.

    Definition (Tucker Decomposition): A Tucker decomposition factorizes a tensor into a core tensor 𝒢R1×R2×R3\mathcal{G} \in \mathbb{R}^{R_1 \times R_2 \times R_3} and factor matrices along each mode: 𝒳=𝒢×1𝐀×2𝐁×3𝐂 \mathcal{X} = \mathcal{G} \times_1 \mathbf{A} \times_2 \mathbf{B} \times_3 \mathbf{C} where 𝐀I×R1\mathbf{A} \in \mathbb{R}^{I \times R_1}, 𝐁J×R2\mathbf{B} \in \mathbb{R}^{J \times R_2} , 𝐂K×R3\mathbf{C} \in \mathbb{R}^{K \times R_3} and ×n\times_ndenotes the mode-nn product.

    More directly, the decomposition is — 

    𝒳pqr=iR1jR2kR3𝒢ijk𝐀pi𝐁qj𝐂rk\mathcal{X}_{p q r} = \sum_{i}^{R_1} \sum_{j}^{R_2} \sum_{k}^{R_3}\mathcal{G}_{i j k}\, \mathbf{A}_{pi} \,\mathbf{B}_{qj} \mathbf{C}_{rk}

    The Tucker decomposition generalizes CP by allowing a dense core tensor. Note that the the sizes R1,R2,R3R_1, R_2, R_3 is obviously within the sizes I,J,KI, J, K of the tensor dimensions— a common choice is R1=R2=R3=min(I,J,K)R_1 = R_2 = R_3 = \text{min} ( I, J, K) . When tensor 𝒢\mathcal{G} is super-diagonal (non-zero only when all indices are equal), Tucker reduces to CP.

    Tensor Train Decomposition

    The tensor decomposition most familiar to physicists is probably the tensor train decomposition.

    Definition (Tensor Train): A tensor train (TT) or Matrix Product State (MPS) represents a dd-dimensional tensor as a product of matrices —

    𝒳i1,i2,,id=𝐆i1[1]𝐆i2[2]𝐆id[d]\mathcal{X}_{i_1, i_2, \ldots, i_d} = \mathbf{G}^{[1]}_{i_1} \mathbf{G}^{[2]}_{i_2} \cdots \mathbf{G}^{[d]}_{i_d}

    where 𝐆ik[k]rk1×rk\mathbf{G}^{[k]}_{i_k} \in \mathbb{R}^{r_{k-1} \times r_k} with r0=rd=1r_0 = r_d = 1. The parameters {r1,,rk,,rd1}\{r_1, \ldots, r_k, \ldots, r_{d-1}\}are called bond dimensions or TT-ranks.

    This is the same structure used to represent quantum many-body states in physics.

    Tensor Product Attention: The Core Claim

    Now we arrive at the key contribution of the TPA paper. Instead of storing full query, key, and value matrices, TPA represents them using contextual low-rank factorizations.

    Standard Multi-head Attention

    For token ii with embedding 𝐱i\mathbf{x}_i, layer \ell and head h{1,,H}h \in \{ 1, \dots, H \}

    𝐪i,h=𝐖Q,h𝐱idhead𝐤i,h=𝐖K,h𝐱idhead𝐯i,h=𝐖V,h𝐱tdhead\begin{align} \mathbf{q}_i^{\ell,h} = \mathbf{W}_Q^{\ell,h} \mathbf{x}_i \in \mathbb{R}^{d_{\text{head}}} \\ \mathbf{k}_i^{\ell,h} = \mathbf{W}_K^{\ell,h} \mathbf{x}_i \in \mathbb{R}^{d_{\text{head}}} \\ \mathbf{v}_i^{\ell,h} = \mathbf{W}_V^{\ell,h} \mathbf{x}_t \in \mathbb{R}^{d_{\text{head}}} \end{align}

    We can stack all the heads into matrices, note that now the matrices are not just weights, but weights multiplied by embeddings—

    𝐐i=[𝐪i1𝐪i2𝐪iH]H×dhead\begin{equation} \mathbf{Q}_i = \begin{bmatrix} \mathbf{q}_i^1 \\ \mathbf{q}_i^2 \\ \vdots \\ \mathbf{q}_i^H \end{bmatrix} \in \mathbb{R}^{H \times d_{\text{head}}} \end{equation}

    TPA

    TPA factorizes the stacked query/key/value matrices as rank-RR sums of outer products.

    𝐐i=1RQr=1RQ𝐚rQ(𝐱i)𝐛rQ(𝐱i)H×dhead\begin{equation} \mathbf{Q}_i = \frac{1}{R_Q} \sum_{r=1}^{R_Q} \mathbf{a}^Q_r(\mathbf{x}_i) \otimes \mathbf{b}^Q_r(\mathbf{x}_i) \in \mathbb{R}^{H \times d_{\text{head}}} \end{equation}

    Note that the dimensions work out, for clarity — 

    𝐱idmodel(input)𝐖ra,Q𝐱i=𝐚rQ(𝐱t)H(head factor)𝐖rb,Q𝐱i=𝐛rQ(𝐱t)dhead(feature factor)𝐚rQ𝐛rQ=H×dhead(outer product)1RQr=1RQ𝐚rQ𝐛rQ=𝐐iH×dhead\begin{align} \mathbf{x}_i \in \mathbb{R}^{d_{\text{model}}} \quad \text{(input)} \\ \mathbf{W}^{a,Q}_r \mathbf{x}_i = \mathbf{a}^Q_r(\mathbf{x}_t) \in \mathbb{R}^{H} \quad \text{(head factor)} \\ \mathbf{W}^{b,Q}_r \mathbf{x}_i = \mathbf{b}^Q_r(\mathbf{x}_t) \in \mathbb{R}^{d_{\text{head}}} \quad \text{(feature factor)} \\ \mathbf{a}^Q_r \otimes \mathbf{b}^Q_r = \mathbb{R}^{H \times d_{\text{head}}} \quad \text{(outer product)} \\ \frac{1}{R_Q}\sum_{r=1}^{R_Q} \mathbf{a}^Q_r \otimes \mathbf{b}^Q_r \, = \mathbf{Q}_i \in \mathbb{R}^{H \times d_{\text{head}}} \quad \checkmark \end{align}

    So for standard MHA, each head independently projects the input—

    𝐪ih=𝐖Qh𝐱i\begin{equation} \mathbf{q}_i^h = \mathbf{W}_Q^h \mathbf{x}_i \end{equation}

    whereas for TPA, all heads share RQR_Q feature vectors, weighted differently per head,

    𝐪ih=1RQr=1RQ[𝐚rQ(𝐱i)]hhead-specific weight𝐛rQ(𝐱i)shared feature vector\begin{equation} \mathbf{q}_i^h = \frac{1}{R_Q} \sum_{r=1}^{R_Q} \underbrace{[\mathbf{a}^Q_r(\mathbf{x}_i)]_h}_{\text{head-specific weight}} \cdot \underbrace{\mathbf{b}^Q_r(\mathbf{x}_i)}_{\text{shared feature vector}} \end{equation}

    The Key Idea: Instead of H independent dheadd_\text{head} -dimensional vectors (one per head), TPA uses— 

    • RQR_Q shared feature vectors 𝐛rQdhead\mathbf{b}^Q_r \in \mathbb{R}^{d_{\text{head}}}
    • RQR_Q weight vectors 𝐚rQH\mathbf{a}^Q_r \in \mathbb{R}^H— one scalar per head, determining how much each head uses each feature

    where RQHR_Q \ll H, therefore leading to parameter efficiency. Obviously, we have similar things going on for 𝐊i\mathbf{K}_i and 𝐕i\mathbf{V}_i.

    Parameter counts

    For MHA, we total number of parameters for queries only (similar for Keys and Values) are H×dhead×dmodel=dmodel2H \times d_\text{head} \times d_\text{model} = d^2_\text{model}

    For TPA we have— 

    • Head factors: RQR_Q matrices of size H×dmodelH \times d_\text{model}
    • Feature factors: RQR_Q matrices of size dhead×dmodeld_\text{head} \times d_\text{model}
    • Total parameters— RQ(H+dhead)dmodelR_Q (H + d_\text{head} ) d_\text{model}

    Example with typical paper values: H=32H=32, dhead=128d_{\text{head}}=128, dmodel=4096d_{\text{model}}=4096, RQ=6\boxed{R_Q=6}:

    • MHA: 32×128×4096=16,777,21632 \times 128 \times 4096 = 16{,}777{,}216 parameters
    • TPA: 6×4096×(32+128)=3,932,1606 \times 4096 \times (32 + 128) = 3{,}932{,}160 parameters
    • TPA uses ~23% of MHA’s parameters

    Note: Unlike LoRA which factorizes weights, TPA factorizes activations. This means the factorization is contextual—it depends on the input token 𝐱i\mathbf{x}_i. It’s a very interesting idea in how to capture input-dependent structure while maintaining compression!

    Memory Reduction

    The major advantage claimed by the paper is the memory saving in KV cache. My interest in this paper is beyond this, to study other forms of attention, but it’s useful to note the memory arguments.

    From standard MHA we have— 

    • Store 𝐊iH×dhead\mathbf{K}_i \in \mathbb{R}^{H \times d_{\text{head}}} and 𝐕iH×dhead\mathbf{V}_i \in \mathbb{R}^{H \times d_{\text{head}}}
    • Total: 2×H×dhead=2dmodel2 \times H \times d_{\text{head}} = 2d_{\text{model}}

    TPA stores only the factors— 

    • Store {𝐚rK(𝐱i)}r=1RK\{\mathbf{a}^K_r(\mathbf{x}_i)\}_{r=1}^{R_K} and {𝐛rK(𝐱i)}r=1RK\{\mathbf{b}^K_r(\mathbf{x}_i)\}_{r=1}^{R_K}for keys
    • Store {𝐚rV(𝐱i)}r=1RV\{\mathbf{a}^V_r(\mathbf{x}_i)\}_{r=1}^{R_V} and {𝐛rV(𝐱i)}r=1RV\{\mathbf{b}^V_r(\mathbf{x}_i)\}_{r=1}^{R_V}for values
    • Total: (RK+RV)(H+dhead)(R_K + R_V)(H + d_{\text{head}})

    The compression ratio is

    ρ=(RK+RV)(H+dhead)2Hdhead\rho = \frac{(R_K + R_V)(H + d_{\text{head}})}{2H \, d_{\text{head}}}

    Concrete example: H=32,dhead=128,RK=RV=1H = 32, d_{\text{head}} = 128, R_K = R_V = 1:

    • TPA cache =2×(32+128)=320= 2 \times (32 + 128) = 320 values per token
    • MHA cache =2×32×128=8192= 2 \times 32 \times 128 = 8192 values per token

    so TPA leads to 96%96 \% memory reduction! For context window of 100,000 tokens, MHA needs 1.6 GB of memory wheres TPA needs 64 MB of memory! (both per layer)

    Connection to MPS

    Another way to look at TPA is recasting it as a MPS. Per head, instead of the term 𝐱i𝐖QK,h𝐱j\mathbf{x}_{i}\mathbf{W}_{\text{QK}}^{\ell,h} \mathbf{x}_{j} in MHA, for TPA we have

    (𝐪ih)𝐤jh=(1RQr=1RQ[𝐚rQ]h𝐛rQ)(1RKs=1RK[𝐚sK]h𝐛sK)=1RQRKr=1RQs=1RK([𝐚rQ]h𝐛rQ)([𝐚sK]h𝐛sK)=r=1RQs=1RK[𝐚rQ(𝐱i)]h[𝐚sK(𝐱j)]hhead-space mixing(𝐛rQ(𝐱i))𝐛sK(𝐱j)feature-space contraction\begin{align} (\mathbf{q}_i^h)^\top \cdot \mathbf{k}_j^h = \left(\frac{1}{R_Q} \sum_{r=1}^{R_Q} [\mathbf{a}^Q_r]_h \cdot \mathbf{b}^Q_r\right)^\top \cdot \left(\frac{1}{R_K} \sum_{s=1}^{R_K} [\mathbf{a}^K_s]_h \cdot \mathbf{b}^K_s\right) \\ = \frac{1}{R_Q R_K} \sum_{r=1}^{R_Q} \sum_{s=1}^{R_K} ([\mathbf{a}^Q_r]_h \cdot \mathbf{b}^Q_r)^\top \cdot ([\mathbf{a}^K_s]_h \cdot \mathbf{b}^K_s) \\ =\sum_{r=1}^{R_Q} \sum_{s=1}^{R_K} \underbrace{[\mathbf{a}^Q_r(\mathbf{x}_i)]_h \cdot [\mathbf{a}^K_s(\mathbf{x}_j)]_h}_{\text{head-space mixing}} \cdot \underbrace{(\mathbf{b}^Q_r(\mathbf{x}_i))^\top \cdot \mathbf{b}^K_s(\mathbf{x}_j)}_{\text{feature-space contraction}} \end{align}

    We now we are getting somewhere, right? That’s a very different take on the attention matrix capturing token-token correlations!

    • Rank indices (r,s)(r,s) play the role of bond indices in MPS
    • r=1RQs=1RK\sum_{r=1}^{R_Q} \sum_{s=1}^{R_K}is the bond cotraction
    • Low ranks RQ,RKR_Q, R_K is equivalent to low bond dimension and increased efficiency and high bond dimension leads to more expressiveness

    Copy Tensor

    We can look at the above expression in terms of copy tensors in Tensor Networks. A copy tensor4 allows for reusing information. For a vector 𝐚d\mathbf{a} \in \mathbb{R}^d, the copy operation is represented by a diagonal tensor, 𝒞ij=δij\mathcal{C}_{ij} = \delta_{ij} , the Kronecker delta. In other words, a copy tensor allows a single input to be reused in multiple tensor contractions.

    Note what’s happening in TPA! The same input vector 𝐱i\mathbf{x}_i is used 2RQ2 R_Q times for Query, and so on for Key and Value — 

    𝐱i𝐖1a,Q𝐚1Q(𝐱i)H𝐱i𝐖1b,Q𝐛1Q(𝐱i)dhead𝐱i𝐖RQa,Q𝐚RQQ(𝐱i)H𝐱i𝐖RQb,Q𝐛RQQ(𝐱i)dhead\begin{align} \mathbf{x}_i \xrightarrow{\mathbf{W}^{a,Q}_1} \mathbf{a}^Q_1(\mathbf{x}_i) \in \mathbb{R}^H \\ \mathbf{x}_i \xrightarrow{\mathbf{W}^{b,Q}_1} \mathbf{b}^Q_1(\mathbf{x}_i) \in \mathbb{R}^{d_{\text{head}}} \\ \vdots \\ \mathbf{x}_i \xrightarrow{\mathbf{W}^{a,Q}_{R_Q}} \mathbf{a}^Q_{R_Q}(\mathbf{x}_i) \in \mathbb{R}^H \\ \mathbf{x}_i \xrightarrow{\mathbf{W}^{b,Q}_{R_Q}} \mathbf{b}^Q_{R_Q}(\mathbf{x}_i) \in \mathbb{R}^{d_{\text{head}}} \end{align}

    Instead of computing H independent projections (standard MHA), TPA computes 2RQ2 R_Q projections and cleverly recombines them. When RQHR_Q \ll H, this architecture is much more efficient while maintaining expressiveness of a Tensor Network (outer product).

    Few other things…

    • The paper shows that TPA is compatible with RoPE embedding. RoPE only acts on the 𝐛\mathbf{b} vectors. The keys are pre-rotated and stored, so no rotation is needed during decoding. Only the current query needs to be rotated. Neat!
    • Remarkably, standard attention mechanisms are non-contextual variants of TPA! They show that both GQA (Grouped Query Attention) and MQA (Multi-Query Attention) are simply poor man’s version of TPA with 𝐚\mathbf{a} being independent of 𝐱i\mathbf{x}_i !

    I loved the paper. The key lessons:

    1. Structure matters: Exploiting low-rank structure in attention patterns enables massive compression
    2. Contextual factorization: Factorizing activations (not weights) is a very interesting concept
    3. Model performance and memory needs: As with several other work recently, the belief that larger context window either means larger models, or we need to compromise on expressivity of the correlations captured in attention, may be incorrect

    As we push toward longer contexts and larger models, principled compression techniques like TPA is a fruitful area of research. The tensor network perspective suggests we’ve only begun to explore the space of possible architectures!

    References

    1. Zhang, Yifan, et al. “Tensor product attention is all you need.” arXiv preprint arXiv:2501.06425 (2025). ↩︎
    2. Wu, Fa-Yueh. “The Potts Model.” Reviews of modern physics 54.1 (1982): 235. ↩︎
    3. Rende, Riccardo, et al. “Mapping of attention mechanisms to a generalized Potts Model.” Physical Review Research 6.2 (2024): 023057. ↩︎
    4. Glasser, Ivan, Nicola Pancotti, and J. Ignacio Cirac. “From probabilistic graphical models to generalized tensor networks for supervised learning.” IEEE Access 8 (2020): 68169-68182. ↩︎

  • Carrier-free mRNA delivery with Aptamers: Nucleic acid is all you need

    Carrier-free mRNA delivery with Aptamers: Nucleic acid is all you need

    Folks who have been dreaming happily like I have over the past decade that nucleic acids are the right substrate for engineering medicines, well, here is one more evidence that we might just be right with our obsession with this marvelous polymer of life!

    Here is how I evangelized my obsession amongst colleagues.

    Silicon Valley is silicon valley and not germanium valley — germanium just wasn’t the right substrate though the first transistor was made of germanium after all, see here for the first paper and here for a lovely history of the transistor.

    Aren’t you glad? — Germanium Valley just doesn’t quite have the right euphony, does it?

    Nucleic acids are the right substrate for genetic and gene-centric medicines and I don’t think either small molecules or proteins are. Those are the germanium of genetic medicines — they may work but the sooner you use silicon the sooner we will solve all human diseases. Yeah, I am opinionated!

    Circularized RNA + cell-type targeting aptamer

    A fascinating paper quietly appeared on BioRxiv1 about a month or so back. It’s a collaboration amongst multiple groups in China, with Weihong Tan as the PI.

    They report first-in-human testing of very curious idea I had toyed with for a while now as a high-risk high-reward R&D project. They created aptamer-embedded circular RNAs (Apt-circRNAs). What’s wild is that they tested the concept in Phase 1 human trial right away from what would otherwise still be a marvelous proof-of-concept tested in ex vivo (blood) setting or in in vivo (humanized rodent) studies.

    They got human clinical data. Wow!

    The study combines two distinct and established ideas in nucleic acids—

    • Circularizing of synthetic mRNA to enhance stability (the payload)
    • Use of aptamers as a targeting molecule for cell-type specific delivery

    This a totally crazy pace of testing out platform ideas. For those of you who do not work in the field — why is this significant?

    Current mRNA vaccines like those for COVID-19 rely on LNPs for delivery, which can sometimes cause immunogenicity and predominantly accumulates in the liver. The Apt-circRNA platform is clever: the RNA molecule itself contains targeting information (receptor-targeting aptamers) to achieve cell-type-specific delivery, eliminating the need for synthetic carriers like LNPs to gift wrap the RNA.

    The Three-Module Design

    The Apt-circRNA platform elegantly integrates three functional modules into a single RNA molecule—

    Targeting Module

    The authors embedded dendritic cell (DC)-specific aptamers at precise locations within the circular RNA scaffold. They tested three targeting aptamers—nucleolin (nuc), transferrin receptor (waz), and DEC-205 (also called CD-205) (min2).

    • Recall that DEC-205 (also called CD-205) is a cell surface-receptor (and endocytic receptor) highly expressed in immature Dendritic Cells (DCs).
    • Transferrin receptor (TfR) is highly expressed in mature DCs and is crucial for iron uptake
    • Nucleolin is also a cell surface receptor in endothelian cells and DCs. It can internalize from cell-surface to the nucleus

    They used the Waz aptamer sequence for TfR—Waz was created by a Matt Levy whom I had hired at Creyon Bio, and who has lead the aptamer team and created a diversity of cell-type specific aptamers since. We know this aptamer well!234.

    Waz aptamer sequence from Ref. 3

    The Waz aptamer has chemical modifications as far as I remember—2’F modified C/Us and probably 2’OMe for some positions. The study uses native RNA—so there are no chemical modifications on the aptamer sequence. It’s an important distinction to keep in mind.

    The DEC-205 aptamer min2 was also discovered my Matt’s group.5 The sequence from Fig.1 of Ref. 4 is —

    Min2 aptamer Ref 4

    The waz aptamer showed superior binding to both murine and human DCs. Through some optimization, the study determined that a bispecific combination of 5 waz and 4 min2 aptamers yielded optimal antigen presentation. Intersting! Thats a lot of aptamers decorating the RNA!

    Stable expression framework

    The circular RNA architecture provides inherent nuclease resistance by eliminating free 5’/3′ termini. The study demsntrates that the Apt-circRNA maintains structural integrity for over 24 hours and dramatically outperforming N1-methylpseudouridine-modified linear mRNA (m1Ψ-mRNA). This confirms older work that circularization of mRNA helps in extending half-life67. The construct also remains stable across pH 4.0-8.0 which critical for endosomal trafficking.

    Antigen-Encoding Region

    An Internal Ribosome Entry Site (IRES) enables cap-independent translation, while codon-optimized sequences encode tumor-specific antigens. The modular design permits flexible incorporation of diverse antigens, demonstrated with ovalbumin peptides ranging from 8 to 386 amino acids.

    The Clever Circularization Strategy

    Here’s where the molecular engineering gets quite ingenious! The team adapted permuted intron-exon (PIE) ribozyme systems from two sources: Anabaena pre-tRNA and T4 bacteriophage td intron. The key innovation: they engineered the aptamer’s stem-loop structure to serve as the circularization site without mutating the aptamer sequence itself!

    The process works by introducing a cleavage site within the aptamer’s loop region, then engineering the group I intron’s P1 and P10 guide sequences to complement sequences flanking the aptamer cleavage site. This enables precise, ribozyme-catalyzed splicing at the predefined loop site, generating Apt-circRNA products free of residual intron sequences.

    Cute!

    What’s the bio-distribution of Apt-circRNA?

    PET Imaging reveals precise lymph node targeting!

    They used radio-labeled Apt-circRNA and positron emission tomography (PET) to track spatial and temporal distribution.

    PET imaging showed predominant renal accumulation with no notable off-target accumulation in liver, spleen, heart, or other major organs. This specificity is striking and addresses a major concern with LNP-based systems, which accumulate significantly in liver and spleen. The renal accumulation is perhaps owing to renal clearance of such a relatively smaller molecular weight payload?

    They also ran Cy5-labelled study—6 hours post-injection revealed that cy5-labelled Apt-circRNA preferentially accumulated in dendritic cells compared to B cells and macrophages. Apt-circRNA was efficiently internalized by DCs at the injection site!

    This contrasts sharply with LNP-circRNA, which primarily remained as intact nanoparticles near the injection site before uptake by both B cells and DCs within lymph nodes. This is consistent with the expectation that aptamer-mediated recognition enables direct DC internalization and lymph node trafficking.

    The study also looked at immuno-stimulatory responses. Worth a read.

    Surprisingly, Luminex assays revealed that Apt-circRNA elicited lower systemic levels of reactogenicity-associated chemokines and cytokines than LNP-circRNA except IL-12. Apt-circRNA also demonstrated reduced cytotoxicity in BMDCs (murine bone-marrow derived DCs) compared to LNP-circRNA. I would have expected the opposite—recall that native RNA could invoke innate response from pathways that sniff out cytosolic RNA.

    First-in-Human Clinical Trial

    The authors initiated a Phase I clinical trial at Zhejiang Xiaoshan Hospital with remarkable speed, testing Apt-circRNA-KR2 in healthy volunteers. Here KR2 refers to the RNA payload expressing the mutations (G12D and G12V) most common in KRAS gene. They want to elicit a T-cell response in KRAS-mutant cancer.

    G12D and G12V are two of the most common single amino acid substitutions at codon 12 of the KRAS gene. The G12D indicates a substitution of the normal amino acid Glycine (G) with Aspartic acid (D) at position 12, while G12V indicates a substitution with Valine (V)

    The trial enrolled 12 healthy volunteers total. Though a small cohort, it’s very encouraging!

    • Single-dose escalation cohort: 9 volunteers received 50, 100, or 250 μg doses (n=3/group)
    • Multi-dose cohort: 3 HLA-A*02:01 or HLA-A*11:01-positive volunteers received 250 μg on days 1, 7, and 13
    • Only 1 of 12 participants experienced any adverse event—transient flu-like symptoms resolving within 12 hours
    • Zero injection-site reactions (0/12)
    • Zero grade ≥2 adverse events (0/12)
    • All hematologic parameters, immune cell subsets, and cytokines remained within normal ranges through 180 days

    The safety profile is quite surprising and compares very favorably to LNP-mRNA vaccines, which commonly cause injection-site reactions, fever and systemic symptoms.

    High notes

    Unlike previous ‘naked mRNA’ approaches that lack stability and targeting, it’s a clever idea to encode aptamer within the RNA sequence itself. However, I am not convinced this is necessary and it will prohibit chemical modifications that can stabilize the aptamer, increase affinity and half-life. One could conjugate the aptamer to the circular RNA and modularize the system further.

    The dual aptamer strategy is definitely very interesting. The combination of waz (TfR-targeting) and min2 (DEC-205-targeting) creates a bispecific design that enhances both binding affinity and functional outcomes. Are these serving different purposes in directly the payload to endocytic compartments?

    Manufacturing could be quite scalable, as is! The in vitro transcription and ribozyme-mediated circularization can be performed at scale without the complex formulation processes required for LNPs. The >80% circularization efficiency is commercially viable. Also, perhaps this advantage is a counter-point to the first point I made about chemically-modified aptamers, for which the aptamer would have to separately synthesized and somehow conjugated to the RNA at specific sites that does not disrupt expression. Moreover, with multiple aptamers to decorate the RNA, it’s messy. Efficiency and purity of product would be a challenge.

    Cellular uptake was still a bit low, and I suspect this is because the aptamers were not really optimized. They took existing aptamer sequences and slapped it on. Flow cytometry data showed that even in draining lymph nodes only a fraction of DCs take up Apt-circRNA. This may necessitate higher doses or more frequent administration compared to LNP-formulated mRNA, partially offsetting the manufacturing advantages. But I strongly believe this is a solvable engineering problem.

    Also, the naked RNA formulation and size leads to rapid renal clearance. One could incorporate sugar / lipid modifications to the construct but then the modules need to be separate—the circularized RNA and the aptamer (chemically modified and say, with added ligands like PEG attached, or albumin binding aptamers). Totally doable!

    The bigger picture

    For decades, the field has focused on engineering increasingly sophisticated nanocarriers—optimizing lipid chemistry, surface modifications etc. What is thought provoking here is, Why engineer a carrier when you can engineer the RNA itself?

    It’s a tantalizing possibility—Nucleic acid is all you need.

    References

    1. https://www.biorxiv.org/content/10.1101/2025.09.28.679023v1 ↩︎
    2. https://patents.google.com/patent/US9439973B2/en ↩︎
    3. https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(17)30047-1 ↩︎
    4. https://www.nature.com/articles/s41467-018-04691-x ↩︎
    5. https://www.cell.com/molecular-therapy-family/molecular-therapy/fulltext/S1525-0016(16)30728-6 ↩︎
    6. https://www.nature.com/articles/s41467-018-05096-6
      ↩︎
    7. https://www.nature.com/articles/s41587-022-01393-0
      ↩︎