top of page

Nelson Advisors Big Questions in HealthTech Series: Is the LLM an interface or the decision maker?

  • Writer: Nelson Advisors
    Nelson Advisors
  • 1 minute ago
  • 13 min read
Nelson Advisors Big Questions in HealthTech Series: Is the LLM an interface or the decision-maker?
Nelson Advisors Big Questions in HealthTech Series: Is the LLM an interface or the decision-maker?


The structural evolution of large language models has shattered the primitive paradigm of the conversational chatbot, prompting a fundamental re-evaluation of system design. Modern software engineering faces a critical dichotomy: should the language model serve as an intuitive cognitive interface, a translation layer interpreting human intent into structured machine directives, or should it function as a sovereign, system-level decision-maker, capable of autonomous planning and resource allocation?


Resolving this question requires dissecting the mathematical limits of autoregressive transformers, examining the structural parallels between generative systems and operating system kernels, and analysing emerging neuro-symbolic frameworks.


This structural debate coincides with the emergence of what industry architects term "Software 3.0". In traditional software paradigms, the application's logic is entirely deterministic and hard-coded by developers who map every execution pathway. In contrast, the AI-native software model reorganises this hierarchy by positioning the language model as a universal controller. In this architectural pattern, traditional application boundaries dissolve; individual applications are recast as a unified registry of plug-ins orchestrated by a central model.


Rather than requiring human operators to manually coordinate tasks across disconnected software silos, the model-as-controller receives unstructured natural language, reasons over intent and composes sequential actions across diverse plugin APIs. Yet, this flexibility introduces severe logical non-determinism, forcing system designers to rigorously map the operational boundaries where a model excels as a semantic interface and where it fails as an unconstrained decision-maker.


The LLM as Cognitive Interface: Semantic Translation, Intent Parsing and AutoFormalisation


To evaluate the role of the model as an interface, system engineers must analyse how it bridges the semantic gap between high-dimensional human intent and low-level, deterministic computing protocols. Autoregressive language models excel at processing language because human communication is built on low-dimensional, compressible patterns. Rather than acting as precise databases, these networks function as vast, non-veridical memories that probabilistically reconstruct outputs token by token, a process characterised as approximate retrieval. This probabilistic quality makes them exceptionally flexible as interfaces; they parse human commands, analyse user intent and translate unstructured requirements into domain-specific languages, SQL queries, or API parameters.


This translation paradigm is highly effective in infrastructure orchestration and data analytics. For instance, in cluster scheduling, Kubernetes schedulers have been successfully augmented with language-model-based intent analysers. This architecture leverages the model strictly as a translation layer that interprets unstructured natural language annotations representing soft-affinity preferences (e.g., placing workloads near specific data sources or on lightly loaded nodes) and parses them into low-level scheduling directives, achieving an empirical accuracy greater than 95%. Similarly, in product analytics, systems maintain a deterministic data ingestion pipeline while employing the language model exclusively at query-time to translate natural language questions into ClickHouse SQL. By isolating the model from the database write-path and enforcing strict schema validation on the generated queries, the database remains a stable source of truth while benefiting from an intuitive, conversational interface.


Beyond query translation, the model-as-interface plays a critical role in "AutoFormalisation". This refers to the automated translation of informal natural language descriptions into formalised symbolic representations, such as first-order logic, mathematical proofs, or Planning Domain Definition Language (PDDL) models, which can then be processed by deterministic symbolic solvers.


This hybrid design allows systems to achieve "epistemic humility". As demonstrated by policy and development research systems like AVA (built on World Bank reports), the integration of structured retrieval pipelines with language-model interfaces allows the system to enforce strict citation verifiability and "reasoned abstention", the capability of the interface to decline answering queries when the underlying grounded data is insufficient, preventing the hallucinations common in unconstrained generative models.


System-Level Implementations: The Large Language Model as an Operating System Kernel


While the translation layer paradigm frames the model as an interface, an alternative system-level framework likens the Large Language Model to an operating system kernel. Pioneered by researchers and formalised in Artificial Intelligent Operating System (AIOS) literature, this perspective treats the model as a core computational processor rather than a simple text-generation utility. Within this system architecture, traditional hardware abstractions find direct cognitive equivalents.


Traditional Operating System Component

AIOS / LLM Operating System Equivalent

System-Level Operational Role

Central Processing Unit (CPU) / Kernel

Large Language Model Core

Executes core cognitive operations, processes intent, and arbitrates system actions

Random-Access Memory (RAM)

Context Window

Volatile working memory; handles immediate context selection and active data processing

Hard Disk Storage / File System

External Storage / Retrieval-Augmented Vector Stores

Persistent long-term storage of documents, logs, and historical context

Peripheral Devices

Hardware Tools / Actuators

Connects the system to the physical world (e.g., robotic arms, sensors, on-vehicle cameras)

Programming Libraries / APIs

Software Tools / SDK Plug-ins

Extends core capabilities to execute arithmetic, code generation, or database writes

User Commands / Shell Executables

Natural Language Prompts

Initiates system actions and configures operating environments

FIFO Schedulers / Thread Queues

Reasoning Loops / System Calls

Coordinates concurrent agent executions, optimises prompt pipelines, and prevents CUDA crashes


In this AIOS framework, resource management and scheduling diverge fundamentally from traditional computing. Instead of managing raw hardware clock cycles and memory addresses, the AIOS kernel manages context windows and tool tokens, orchestrating execution through a continuous reasoning loop rather than a FIFO queue.


Standard agent orchestrators (such as early implementations of Autogen or Langchain) run directly on host-level environments and execute model API calls via brute-force trial-and-error. Under concurrent execution workloads, this approach causes GPU memory saturation, triggering CUDA exceptions that force expensive tensor deallocations and multiple retry cycles.


The AIOS kernel layer resolves this resource bottleneck by isolating agent applications from direct system-level resources. The kernel decomposes incoming agent requests into standardised system calls (syscalls) which are systematically scheduled across separate storage, memory, and tool managers.


By orchestrating syscall execution across a unified interface, the scheduler prevents concurrent agent requests from flooding the model, resulting in up to a x2.1times increase in execution speed for serving concurrent agents. This architecture enables a new AIOS-Agent ecosystem, where specialised Agent Applications (AAPs), such as trip planners, financial advisors, or medical consulting agents, are deployed as OS-native applications that leverage the intelligent scheduling and tool-execution capabilities of the underlying kernel.


The Physical Realisation: Dedicated Coprocessor Hardware for Local Inference


The conceptualisation of the model as an operating system kernel is driving a corresponding shift in physical computer architecture: the emergence of dedicated neural processing units (NPUs) acting as specialised language model coprocessors. Hardware designers are developing application-specific integrated circuits (ASICs) optimised to run specific open-weight models locally.


A prime example is Rockchip's RK182X coprocessor, which incorporates ultra-high-bandwidth memory to run models like Qwen 2.5 7B at $50 { tokens per second (TPS)}for token decoding and up to $800 { TPS} for prompt processing. To understand the efficiency of these hardware designs, one must analyse the mathematical divergence between the two primary phases of inference: Prompt Processing (PP) and Token Generation (TG).


Operational Phase

Computational Bottleneck

Hardware Resource Dependency

Local Hardware Optimization Mechanics

Prompt Processing (PP)

Compute-Limited

Parallel Tensor Core Execution (FLOPs)

Loads the model weights into cache once; processes all context tokens in parallel across highly parallelised, low-precision tensor cores

Token Generation (TG)

Memory-Bandwidth Limited

Memory Access Speed and Bus Bandwidth (GB/s)

Must sequentially load every single layer of network weights from RAM to generate a single token, bottlenecked by system memory bus speeds


Traditional NPUs typically utilise the host system's slow main memory, rendering them ineffective during the memory-bandwidth-bound token generation phase. Coprocessors like the RK182X bypass this constraint by pairing the NPU with dedicated, high-speed on-chip memory of ultra high-bandwidth LPDDR), allowing Q4-quantised models to load weights almost instantaneously near the arithmetic logic units (ALUs).


By offloading matrix multiplications to low-power ASICs that consume up to 90% less power than standard GPU setups, devices can maintain large context windows and execute background reasoning routines locally on edge devices without thermal throttling or remote server dependency.


Mathematical and Empirical Limits of Autonomous Decision-Making


Despite the appeal of the "model as operating system" paradigm, treating autoregressive language models as sovereign, autonomous decision-makers introduces severe mathematical and logical failures. By definition, autoregressive transformers predict the next token sequentially, running in constant computational time per step.


However, mathematical reasoning, planning, and logical verification are often NP-hard or semi-decidable problems that require variable-time combinatorial search. A system constrained to constant-time state transitions cannot perform principled logical deduction; it can only simulate reasoning by retrieving and recombining patterns from its training corpora.


Because of this limitation, the performance boundaries of these systems are highly irregular. While a model may answer complex, Olympiad-level questions that resemble patterns in its pre-training data, it can simultaneously fail at basic arithmetic operations.


Rigorous evaluations on classical planning benchmarks, such as the blocks-world problems in the International Planning Competition (IPC), highlight this deficit. When evaluated in an autonomous planning mode on platforms like PlanBench, even advanced models like GPT-4 only generate valid, executable plans approximately 12% of the time.


Evaluation Benchmark

Primary Reasoning Modality Tested

Model Performance Characteristics

Underling Computational Failure

PlanBench (IPC Blocks-World)

Combinatorial Search and Subgoal Sequencing

GPT-4 achieves ~12% autonomous plan correctness; performance drops near zero under term obfuscation

Relies on approximate retrieval of memorised plan structures rather than active logical sequencing

ACPBench Hard

Generative Action, Change, and Multi-Step Planning

Frontier models, including early deliberate reasoning models, score below 65% across most tasks

Struggle to resolve causal action-precondition dependencies in a generative format

AIME 2024

Advanced Mathematical and Deductive Reasoning

Standard GPT-4o scores ~12% pass@1; deliberate reasoning models (o1) achieve up to 74% pass@1

Standard models fail on complex multi-step chains; reasoning models succeed by scaling inference-time search tokens

OPT-BENCH

Continuous Optimization vs. Discrete Combinatorial Reasoning

Strong on continuous inductive tuning (ML hyperparameter optimization); poor on NP-hard discrete search

Lacks internal execution verification models to navigate brittle, discrete search spaces


The reliance of autoregressive models on memorised structures is further demonstrated by domain obfuscation tests. When standard planning terms (e.g., "stack," "unstack," "block") are replaced with random, non-semantic strings, a change that does not affect deterministic symbolic solvers, the model's planning accuracy collapses entirely.


Furthermore, the common architectural assumption that models can self-correct through iterative evaluation loops is flawed. Because autoregressive models cannot reliably verify their own solutions, iterative self-critique often degrades plan quality. Lacking an internal logical model, the system frequently abandons correct intermediate solutions and replaces them with incorrect alternatives, leading to cascading errors.


Yann LeCun's Alternative: World Models and Non-Generative Architectures


This computational deficit forms the basis of Yann LeCun's critique of the generative AI paradigm. LeCun argues that language is a low-dimensional, highly compressed representation of human intelligence, whereas the real world is continuous, noisy, high-dimensional, and sensory-rich.


Autoregressive models operate purely in this discrete textual space, making next-word predictions without developing an internal, causal model of physical reality. While a model can statistically associate terms like "glass" and "shatter," it lacks the fundamental physical common sense possessed by a house cat, which can predict the gravitational and mechanical consequences of its actions in the physical world.


To achieve true System-2 planning and reasoning, LeCun advocates for objective-driven AI built on Joint Embedding Predictive Architectures (JEPA) rather than generative models. Instead of predicting raw pixels or words, JEPA-style architectures learn to predict abstract, high-level representations of the world, filtering out irrelevant noise (such as the movement of leaves on a tree) to focus on predictable, causally relevant information.


These models serve as predictive world models, allowing an AI agent to simulate the outcomes of its actions internally and optimise plans before executing them in the physical environment.


This paradigm enables hierarchical planning, which is the capability to plan actions at varying levels of abstraction. When planning a trip from New York to Paris, a human does not plan the precise sequence of muscle movements required to walk; instead, the trip is decomposed into higher-level logical chunks (e.g., drive to the airport, board the plane, land in Paris).


Modern Vision-Language World Models (VLWM) like Virgo attempt to realize this by learning an action policy (representing reactive System-1 behavior) alongside a predictive dynamics model (representing reflective System-2 behavior). By compressing sensory video data into structured abstractions (e.g., a "Tree of Captions") and utilising self-supervised critics to evaluate hypothetical future states, these models allow agents to perform internal trial-and-error to find cost-minimising action plans, bringing a level of physical grounding and logical consistency that pure language models cannot achieve.


Hybrid Architectures: Neuro-Symbolic Synthesis and Deliberate Inference


To balance the flexibility of language models with the logical precision required for enterprise execution, system architects categorise application capabilities along a spectrum of six distinct levels of autonomy.

This classification maps the boundary where human-defined, deterministic constraints end and non-deterministic model actions begin.


Autonomy Level

Core Technical Abstraction

Control Flow Architecture

Primary Security & Operational Risk

Level 1: Code

Traditional Software

Explicit, hard-coded, deterministic logic written by developers

Low; standard software testing and compiler constraints apply

Level 2: LLM Call

Isolated Call

Model outputs a single prediction for a predefined, isolated step

Minimal; easily validated by standard text extraction rules

Level 3: Chain

Fixed Pipeline

Output of step $n$ feeds directly as input to step $n+1$ in a static sequence

Low; data-flow boundaries are predictable and sandboxed

Level 4: Router

Acyclic Selection

Model evaluates context and selects from a predefined set of acyclic pathways

Moderate; requires strict validation of the selected execution path

Level 5: State Machine

Cyclic Workflow

Model dynamically decides the next execution step in a workflow that includes loops

High; workflows can loop indefinitely, requiring runtime execution budgets

Level 6: Autonomous

Unconstrained Agent

Model independently determines goals, selects tools, and executes actions

Severe; requires persistent sandboxing and emergency kill-switches


To safely deploy systems at Levels 5 and 6, modern software engineering rejects simple pipelines in favour of neuro-symbolic designs that couple the semantic capabilities of models with external, deterministic verification engines. Three paradigms illustrate this integration:


The LLM-Modulo Framework


This framework establishes a tight, bi-directional loop where the model functions as an approximate proposal generator or domain translator. The actual validation of plans is managed by a bank of external critics. Hard Critics, such as classical PDDL planners or verification tools like VAL, evaluate plans for causal correctness, physical feasibility, and resource constraints. Soft Critics, often driven by separate, specialized vision-language models, evaluate abstract qualities such as style, conformance, and user preferences.


If a hard critic identifies a logical error, it generates precise symbolic feedback. The model consumes this feedback and generates a refined candidate plan. Crucially, expert humans are excluded from this inner planning loop, interacting only in the outer loop to define domain specifications and preference models, preventing cognitive fatigue and the "Clever Hans" effect.

Agentic Fast-Slow Planning (AFSP)


Inspired by dual-process cognition, this framework decouples perception, reasoning, and control across distinct timescales to ensure physical safety in real-time systems.


The system is split into two bridges:


  • Perception2Decision: A local, edge-based vision-language model topology detector compresses raw physical inputs into compact egocentric topology graphs, which are then transmitted to a cloud-based language model to generate high-level symbolic driving directives. This reduces bandwidth and latency while maintaining operational interpretability.


  • Decision2Trajectory: High-level symbolic directives are converted into physically feasible paths using a classical search algorithm that embeds soft costs derived from the model's directives into geometric trajectory optimization. An online Agentic Refinement Module monitors execution and dynamically tunes hyper-parameters using feedback and memory, ensuring the system adapts to environmental changes without bypassing classical safety bounds.


Deliberate Inference Reasoning Models


Models such as OpenAI o1, DeepSeek-R1, and QwQ attempt to internalise System-2 deliberate thought during inference through test-time compute scaling. Instead of predicting the next token instantly, these models are trained via reinforcement learning algorithms, such as Group Relative Policy Optimisation (GRPO), to generate extensive internal scratchpads before committing to a final, visible answer.

GRPO reduces training overhead by dropping the traditional value-function model, estimating baseline rewards across a group of sampled answers to penalise logical inconsistencies and reward correct, verifiable solutions.


Through this reinforcement loop, reasoning models naturally learn to explore alternative pathways, check intermediate steps, and backtrack when an error is detected. While this approach significantly improves performance on complex mathematical and coding benchmarks, it introduces an "overthinking" efficiency bottleneck, where models consume excessive tokens and compute resources solving simple tasks that could be resolved instantly with minimal token generation.


Security Implications of the Autonomy Shift: Vulnerability Analysis and Threat Mitigation


As systems transition from using models as cognitive interfaces to deploying them as active decision-makers with tool-execution privileges, the primary threat vector shifts from content moderation to critical system-level vulnerabilities.


In traditional computing, code injection exploits deterministic parsing bugs to execute compiled binaries. In agentic computing, prompt injection exploits the model's inability to structurally separate trusted instruction sets from untrusted data inputs.

In agentic systems, prompts serve as non-deterministic programs written in natural language. Direct prompt injections occur when a user actively inputs commands to override safety bounds. Far more dangerous, however, are indirect prompt injections, where malicious instructions are hidden inside external resources parsed by the agent at runtime, such as emails, PDF documents, or web pages.


When an agent is granted tool access, a successful indirect prompt injection can hijack its goals, manipulating it into performing unauthorised API calls, executing production database writes, or exfiltrating sensitive company files.


This vulnerability is highly evident in modern, high-privilege agentic coding environments and editors like Cursor. Designed with system-level access to execute terminal commands, edit local files, and interact with external systems, these agents can be targeted by poisoning external development resources, such as configuration files, repository documentation, or Model Context Protocol (MCP) server definitions.

When the agentic editor processes these poisoned resources during standard development routines, the embedded instructions hijack its execution flow. This converts the agent into an attacker's terminal, enabling remote command execution, code injection into production repositories, and local machine compromise without any human clicking a link or running an exploit binary.


To defend agentic environments against these vulnerabilities, security teams must deploy multi-layered defense frameworks like MAESTRO. First, developers must enforce the principle of least-privilege, ensuring that agent credentials are sandboxed and restricted to read-only access where possible.


Second, systems must implement cryptographic prompt signing workflows, where trusted instructions are signed with a corporate cryptographic key. When instructions pass through multiple agents, the execution kernel verifies the signatures; if untrusted user data attempts to append system-level directives, the signature validation fails, and the execution is terminated.


Finally, designers must place human-in-the-loop approval gates before any high-stakes, irreversible tool executions are processed, establishing an administrative barrier against runaway or hallucinatory actions.


Conclusions and Strategic Recommendations


The technical evidence establishes that the Large Language Model is fundamentally suited to serve as a cognitive interface and approximate proposal engine, rather than a sovereign, unconstrained decision-maker. While operating-system-level integrations and test-time compute scaling have expanded the capability of these models to navigate complex environments, their underlying probabilistic nature prevents them from providing the absolute logical guarantees required for high-risk system automation. When deployed as interfaces, language models act as powerful cognitive orthotics, translating human intent into actionable configurations and helping to bridge complex semantic gaps.


However, the sovereign decision-making authority must reside in model-based, symbolic, or human-controlled verification layers. Modern hybrid frameworks like the LLM-Modulo and Agentic Fast-Slow architectures demonstrate that the future of resilient system design does not rely on scaling models indefinitely. Instead, it lies in the structured integration of neural and symbolic components, leveraging the semantic versatility of language models to map intent, while relying on formal verification systems to execute and validate actions. By enforcing these architectural boundaries, engineers can safely build intelligent, highly adaptive systems that preserve security, predictability, and logical soundness in production environments.


Nelson Advisors > European MedTech and HealthTech Investment Banking

 

Nelson Advisors specialise in Mergers and Acquisitions, Partnerships and Investments for Digital Health, HealthTech, Health IT, Consumer HealthTech, Healthcare Cybersecurity, Healthcare AI companies. www.nelsonadvisors.co.uk


Nelson Advisors regularly publish Thought Leadership articles covering market insights, trends, analysis & predictions @ https://www.healthcare.digital 

 

Nelson Advisors publish Europe’s leading HealthTech and MedTech M&A Newsletter every week, subscribe today! https://lnkd.in/e5hTp_xb 

 

Nelson Advisors pride ourselves on our DNA as ‘Founders advising Founders.’ We partner with entrepreneurs, boards and investors to maximise shareholder value and investment returns. www.nelsonadvisors.co.uk



Nelson Advisors LLP

 

Hale House, 76-78 Portland Place, Marylebone, London, W1B 1NT




Meet Nelson Advisors @ 2026 Events

 

Digital Health Rewired > March 2026 > Birmingham, UK 

 

NHS ConfedExpo  > June 2026 > Manchester, UK 

 

HLTH Europe > June 2026, Amsterdam, Netherlands

 

HIMSS AI in Healthcare > July 2026, New York, USA

 

Bits & Pretzels > September 2026, Munich, Germany  

 

World Health Summit 2026 > October 2026, Berlin, Germany

 

HealthInvestor Healthcare Summit > October 2026, London, UK 


HLTH USA 2026 > October 2026, USA

 

Barclays Health Elevate > October 2026, London, UK 

 

Web Summit 2026 > November 2026, Lisbon, Portugal  

 

MEDICA 2026 > November 2026, Düsseldorf, Germany

 

Venture Capital World Summit > December 2026 Toronto, Canada


Nelson Advisors specialise in Mergers and Acquisitions, Partnerships and Investments for Digital Health, HealthTech, Health IT, Consumer HealthTech, Healthcare Cybersecurity, Healthcare AI companies. www.nelsonadvisors.co.uk
Nelson Advisors specialise in Mergers and Acquisitions, Partnerships and Investments for Digital Health, HealthTech, Health IT, Consumer HealthTech, Healthcare Cybersecurity, Healthcare AI companies. www.nelsonadvisors.co.uk

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page