We trained two well-known reinforcement learning algorithms – Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG) – in various environments: GBM, stochastic volatility (SV) model of Heston and SV model with jumps in the underlying. Such training environments were meant to be as diverse as possible in terms of price and volatility dynamics, to let the agents generalize well on the real market data. Then we compared their performance to the same algorithms but trained in a more specific setting: i.e., in the same type of environment and for the same type of options (as those in the test set). One would suspect that the agents trained to perform a more specific task would do much better than those trained to deal with for more “general” situations – but this turned out not to be the case. Finally, we investigated how well our trained agents can hedge real-life options, or, in other words, how well they can transfer their knowledge acquired on simulated data to the real hedging environment.

In our setting, we consider hedging task as a utility maximization problem, where agents seek an optimal hedging strategy according to some utility function. Such utility function can correspond to maximizing the agent’s expected wealth (or P/L) at the end of the option’s life, minimizing the hedging error or its variance, or a combination of these (we use such a combination, in the form of a mean-variance utility optimization).

Using P/L-optimizing utility function, we can consider transaction costs involved in hedging (these can be actual trading fees, bid-ask spread in the underlying or market impact) – and these costs can severely affect hedging decisions. On one hand, an option trader wants to hedge as frequently as possible, since only then one can achieve a nearly perfect hedge, but on the other hand, frequent hedging can lead to unacceptable transactions costs.

We train our reinforcement learning algorithms to delta-hedge European call options. A “state” at each point in time is characterized by the underlying asset price, volatility, the number of units of the underlying in the hedging portfolio, the value of the option and its delta. Furthermore, we specify the reward as the increment in P/L, penalized by its variance (which corresponds to our mean-variance optimization problem). Trading costs are calculated based on the tick size, the number of units of underlying bought or sold and a parameter reflecting the market friction for a particular underlying (which can be more liquid or less liquid, hence different friction parameter can be used). Finally, the action is naturally the amount of the underlying held in the hedging portfolio until the next re- hedging date. For mathematical details in all these choices and definitions, we refer the reader to the full paper (Giurca and Borovkova (2021)).

Note that the above framework is suitable not just for (a portfolio of) European call options, but can be extended to puts, options with other, more exotic payoffs as well as other options (path-dependent, American options or options on assets that provide dividends).

The evolution of the market (i.e., of the underlying asset) is simulated by the combination of Black-Scholes model, Heston stochastic volatility model and Bates model of stochastic volatility with jumps. Especially for the goal of transfer learning, we construct a rich training set that exposes the agents to enough versatile training environments, to enable them to hedge options in a real market environment. We do this by including episodes generated by a mixture of different market simulators in our training set. Compared to a training set based solely on realizations of one model, where the learned optimal policies are specific to this environment, in our setting the agent strategies are (hopefully) robust to an unknown testing environment. Instead of assuming one model that reflects the asset price dynamics of real market data best, we use a multitude of models: Black-Scholes, Heston and Bates.

The performance of the reinforcement learning agents is compared with two benchmark hedging strategies: the discrete time Black-Scholes delta hedging and Wilmott delta hedging (this is a variant of Black-Scholes hedging strategy but considering the balance between transaction costs and quality of discrete delta hedging, as measured by Gamma).

For simulated-to-real knowledge transfer analysis, we apply our trained algorithms to a rich set of daily S&P500 options data from 2019. We chose not to include the data from 2020, as this year can be described as a period of abnormal market conditions due to the Covid-19 pandemic, which introduced large fluctuations in prices and volatility in stock markets. We test our algorithms on those options that are close to being at-the-money and have up to 6 months until maturity. S&P500 price evolution is simulated, for all considered models, using model parameters calibrated on the market data from 2009 until 2018.