Reinforcement Learning for Dynamic Wealth Optimization and Asset-Liability Management

Using continuous-action TD3 reinforcement learning agents to evaluate optimal lifetime investment and consumption paths under stochastic lifecycle constraints.

Business Case

Merton’s portfolio framework faces steep scalability barriers when real-world life choices such as children, health shocks, promotions, or inheritances—are introduced. Evaluating these path-dependent vectors traditionally requires backward propagation from death to youth, executing dense Monte Carlo simulations at each age step. This triggers the 'Curse of Dimensionality,' driving the need for alternative methodologies capable of managing long-duration (30-to-60-year) pension liabilities.

Outcome

To address these limitations, we explored a novel approach leveraging a Twin Delayed DDPG (TD3) reinforcement learning framework. Instead of relying on rigid state grids or backward induction loops, our model treats lifecycle asset allocation as a continuous Markov Decision Process (MDP). Tested across distinct demographic cohorts, our Proof of Concept (PoC) demonstrated stable policy convergence, showing that the agent can autonomously optimize dynamic investment and consumption decisions while matching structural liabilities.

Detailed Report

Introduction to Lifetime Portfolio Optimization

The allocation of capital over a human lifecycle represents a foundational problem in quantitative economics, originally formalized by Robert Merton in 1969. In its purest form, the Merton portfolio problem seeks to find the mathematical policy where an individual or institutional asset manager continuously divides wealth between risky and risk-free assets to maximize expected lifetime utility.

However, a steep divergence occurs between textbook economic abstractions and empirical reality. Traditional models assume smooth, continuous transitions. Real life is inherently discontinuous. For an institutional pension fund or an individual investor, wealth trajectories are continually disrupted by non-linear shocks: the financial impact of children, abrupt health crises, career promotions, or the injection of capital via inheritance.

When these path-dependent life choices and long-duration liabilities are introduced, classical closed-form solutions do not work. Resolving these systems through traditional numerical methods requires an immense computational architecture, running massive, multi-tiered Monte Carlo simulations at every discrete age node and propagating calculations backward from death to youth. Our proof of concept explores how deep reinforcement learning offers a flexible, scalable alternative.

The Classical Merton Problem Framework

To understand the boundaries of traditional solutions, we begin with the continuous-time stochastic control framework. Let $X_{t}$ represent the total wealth of the agent at time $t$ . The wealth dynamics are driven by a stochastic differential equation (SDE):

d X_{t} = [r_{t} X_{t} + π_{t} (μ_{t} - r_{t}) - c_{t}] d t + π_{t} σ_{t} d B_{t}

Where:

$r_{t}$ : The risk-free interest rate.
$μ_{t}$ : The expected return of the risky asset.
$σ_{t}$ : The volatility of the risky asset.
$π_{t}$ : The absolute capital allocated to the risky asset at time $t$ .
$c_{t}$ : The consumption rate.
$d B_{t}$ : A standard Brownian motion capturing market uncertainty.

The objective is to maximize the expected discounted utility of consumption over a finite horizon $T$ (representing lifespan), plus the terminal utility of wealth (the bequest or inheritance function):

{π_{t}, c_{t}}_{t = 0}^{T} max E [\int_{0}^{T} e^{- ρt} U (c_{t}) d t + e^{- ρT} B (X_{T})]

Where $ρ$ represents the subjective discount factor (the rate of time preference).

Utility & Risk Aversion Formulations

The behavioral characteristics of the investor are dictated by their utility function, $U (c)$ . In our exploratory pipeline, we structured the environment to support the two most prominent risk profiles in mathematical finance:

Constant Relative Risk Aversion (CRRA): Assumes that the investor’s risk tolerance grows proportionally with wealth.

U (c) = \frac{c ^{1 - γ}}{1 - γ}

Where $γ > 0$ ( $γ \neq = 1$ ) is the coefficient of relative risk aversion.

Constant Absolute Risk Aversion (CARA): Assumes that risk tolerance remains independent of total wealth accumulation.

U (c) = - \frac{1}{α} e^{- α c}

Where $α > 0$ represents the coefficient of absolute risk aversion.

The Computational Bottleneck: The Curse of Dimensionality

Under a simple asset structure with zero background income, the value function can be derived analytically by solving the Hamilton-Jacobi-Bellman (HJB) partial differential equation. However, the moment we introduce real-world frictions, analytical tractability breaks completely. The Traditional Dynamic Programming Flow looks like this:

Death (Age 85) ---> Age 84 ---> Age 83 ... ---> Youth (Age 25)
   |                |            |
   +-- (Monte Carlo Simulation at each discrete state grid node)

Path Dependency: Parameters like a promotion or a chronic illness shift the agent’s baseline income drift permanently. This requires adding extra state dimensions to track history.
Grid Explosion: If we discretize the problem into grids to run backward dynamic programming induction, adding state variables causes an exponential explosion in nodes. If wealth requires $100$ grid points, adding 4 binary life parameters (e.g., child status, health state, promotion level, inheritance tracking) scales the node evaluations per age step to $100 \times 2^{4} = 1, 600$ nodes.
Nested Monte Carlo Loops: Because expectations cannot be computed analytically across non-linear shocks, every single node in that grid requires a nested Monte Carlo simulation to evaluate the transition probabilities to the next age step.

This computational barrier is what prompted our research into model-free Deep Reinforcement Learning (TD3), which bypasses grid-based discretization entirely by treating the lifespans as continuous tracks.

Time Horizon (Years) 45

Wealth Grid Points 100

Nested Monte Carlo Paths (M) 10,000

Layer Stochastic Parameters (State Frictions)

Children (2x Nodes) Health Shock (2x Nodes) Career Promotion (2x Nodes) Inheritance (2x Nodes)

HIGH LATENCY / EXPONENTIAL GRID REGIME

State Vector Dimensions 1D

Nodes Per Age Step 100

Total Path Evaluations 45,000,000

Nested backward propagation equations require solving M matrix updates sequentially across each age step.

Deep Reinforcement Learning Framework (TD3)

To navigate the high-dimensional, path-dependent nature of real-world lifecycles, our framework reframes asset allocation and consumption modeling as a model-free, continuous Markov Decision Process (MDP). Instead of discretizing states onto a rigid grid, a Deep Reinforcement Learning (DRL) agent interacts with a continuous simulation environment, observing transitions and gathering experiences to optimize its strategy natively.

+-------------------------------------------------------------+
|                         ENVIRONMENT                         |
|   Market Dynamics (SDE)  +  Socio-Economic Life Events      |
+-------------------------------------------------------------+
          ^                                         |
          | Portfolio Allocations                   | State Vector
          | & Consumption Rates (A_t)               | (S_t)
          |                                         v
+-------------------------------------------------------------+
|                         TD3 AGENT                           |
|      Actor Network   =======>   Clipped Twin Critics        |
+-------------------------------------------------------------+

Formulating the Markov Decision Process

The environment is governed by a time-step horizon representing an individual’s financial year ( $t = 1, 2, \dots, T$ ). At each step, the interaction is parameterized by a tuple $(S_{t}, A_{t}, R_{t}, S_{t + 1})$ :

The State Space ( $S_{t} \in R^{d}$ ): A continuous-categorical vector capturing the real-time financial and demographic status of the cohort:

S_{t} = [X_{t}, t, I_{t}, E, Flags_{t}]^{T}

$X_{t}$ : Current wealth accumulation.
$t$ : Current age of the individual.
$I_{t}$ : Dynamic labor income or pension cash flow.
$E$ : Categorical educational attainment baseline (e.g., high school vs. university).
$Flags_{t}$ : Binary status indicators tracking active life events (e.g., presence of children, active health shocks, promotion status, inherited capital tracking).

The Action Space ( $A_{t} \in R^{2}$ ): A continuous control vector containing the agent’s decisions for that period:

A_{t} = [π_{t}, c_{t}]^{T}

$π_{t} \in [0, 1.5]$ : The portfolio allocation weight in the risky asset (allowing for up to $50%$ leverage).
$c_{t} \in (0, X_{t}]$ : The continuous consumption rate for the active period.

The Reward Function ( $R_{t}$ ): The feedback mechanism designed to maximize the mathematical utility of consumption while penalizing financial insolvency or failure to match liabilities:

R_{t} = U (c_{t}) - ψ \cdot I (X_{t} < 0) - ω \cdot max (0, L_{t} - X_{t})

Where $U (c_{t})$ represents the active utility choice (CRRA or CARA), $I$ is an indicator function triggering a severe penalty $ψ$ for bankruptcy, and $ω$ is an institutional penalty matching unmet structural liabilities $L_{t}$ .

The Twin Delayed DDPG (TD3) Architecture

Standard continuous action algorithms like Deep Deterministic Policy Gradient (DDPG) frequently fail in highly volatile financial environments. DDPG suffers from severe overestimation bias, where the critic networks consistently overvalue the expected future reward of specific asset allocations, leading to policy sub-optimization and premature divergence.

To ensure stable policy learning across 30-to-60-year horizons, we deployed a Twin Delayed DDPG (TD3) architecture. TD3 introduces three critical algorithmic modifications to stabilize value function approximation:

1. Clipped Double-Q Learning

The agent maintains two independent critic networks, $Q_{ϕ_{1}} (s, a)$ and $Q_{ϕ_{2}} (s, a)$ , alongside their corresponding target networks. When calculating the target value $y_{t}$ for the Bellman backup update, the architecture selects the minimum estimated value between the two critics:

y_{t} = R_{t} + β i = 1, 2 min Q_{ϕ_{i}, targ} (S_{t + 1}, \tilde{a}_{t + 1})

Where $β$ is the stochastic discount factor. Taking the minimum value actively counters the overestimation bias by favoring conservative wealth-growth estimates over aggressive, volatile projections.

2. Target Policy Smoothing

Financial markets are noisy, meaning highly similar state vectors can yield radically different rewards. To prevent the policy from overfitting to narrow, high-yielding training paths, TD3 adds a small, clipped noise vector to the target action:

\tilde{a}_{t + 1} = clip (π_{θ, targ} (S_{t + 1}) + ϵ, a_{m i n}, a_{m a x}), ϵ \sim clip (N (0, σ^{2}), - c, c)

This forces the critic networks to smooth their value surface over a localized action neighborhood, ensuring that tiny shifts in portfolio weight choices do not cause erratic swings in estimated future utility.

3. Delayed Policy & Target Updates

In an asset allocation framework, updating the actor (policy) network before the critic networks have accurately mapped the value landscape causes highly unstable training trajectories. TD3 addresses this by updating the actor network $π_{θ}$ and all target networks at a lower frequency than the critics (e.g., one policy update for every two critic parameter updates). This delay ensures that the actor is always guided by stable, mathematically sound value gradients.

To illustrate the general idea of what we wanted to achive, below you can find a very naive graph which shows the convergence of an actor algorithm towards the analytically solved Merton Portfolio Lifecycle Problem. Each year (timeperiod of choice) the agend observes the market and makes a decision to invest some amount of wealth into stock, bonds or consumption. At the end of the period the agent gets rewarded. Thereby, the utility function acts as the reward that the agent receives for its investment decision. A good decision creates a good utility and vice versa. After enough training epochs the agent converges towards the optimal Merton solution and shows that it was able to perform optimal investment decisions maximizing its utility.

TD3 Optimization Horizon Epoch 0 / 10,000

Plot 1: Dynamic Index Returns vs Portfolio Value Growth (Investment Lifecycle Tracking)

• Stocks Growth • Portfolio Growth

Plot 2: Weight Allocation Strategy (pi_t) vs Analytical Merton Target

• Optimal Merton Target • TD3 Policy Weight

Plot 3: Continuous Analytical Utility Mapping Framework (V(W, t))

• Merton Expected Value Surface Utility

Horizon Node: t = 0 (Youth)

Status: Initial State: Poor Allocation Performance

Horizon Node: t = 35 (Horizon Target)

image displaying a snapshot of training — The image shows the training progress of the reinforcement learning agent during some epoch to converge towards the analytical merton portfolio problem.

License

All original content by Alexander Thorne is licensed under a Creative Commons Attribution 4.0 International License.
© 2026 Helionox GmbH.