RL for Robot Control | Grover Aruquipa

Classical control of parallel robots relies on precise kinematic and dynamic models. But what if the robot could learn to control itself? Reinforcement Learning (RL) offers a fundamentally different paradigm: instead of deriving equations, an agent learns optimal actions through trial and error, guided by rewards. This tutorial is your starting point into RL-based control, with a focus on parallel manipulators.

What You Will Learn

The core concepts of Reinforcement Learning (states, actions, rewards, policies), how to formulate a parallel robot control problem as an MDP, the key RL algorithms (PPO, SAC, TD3) used in robotics, and a complete worked example of training an RL agent to control a Stewart platform for trajectory tracking. You will also interact with a live simulation of the trained agent.

1. Why RL for Parallel Robots?

Parallel robots are notoriously difficult to control with classical methods. Their closed-loop kinematics create coupled, highly nonlinear dynamics. The inverse dynamics require solving complex equations in real time. And near singularities, model-based controllers can become unstable. RL offers an alternative.

Aspect	Classical Control	RL-Based Control
Model requirement	Full dynamic model needed	Model-free (learns from interaction)
Singularity handling	Requires explicit avoidance logic	Learns to avoid through negative reward
Adaptability	Re-derive for each platform	Re-train (same algorithm, different env)
Friction & backlash	Hard to model accurately	Learns to compensate implicitly
Optimality	Depends on tuning (PID gains)	Optimizes for defined reward function
Computation at runtime	Dynamics solved each step	Simple neural network forward pass

The Core Idea

An RL agent observes the robot’s state (joint positions, velocities, tracking error), takes an action (actuator forces/velocities), receives a reward (how well it tracked the target), and gradually learns a policy that maximizes cumulative reward.

2. RL Fundamentals

Before applying RL to robots, you need to understand the basic framework. RL is grounded in the theory of Markov Decision Processes (MDPs).

2.1 The Agent-Environment Loop

At each time step $t$, the agent observes the current state $s_t$, selects an action $a_t$ according to its policy $\pi(a|s)$, and the environment transitions to a new state $s_{t+1}$ and returns a reward $r_t$.

Figure 1. The RL agent-environment interaction loop applied to Stewart platform control. At each timestep, the agent selects actuator commands, and the environment returns the new platform state and a reward signal.

2.2 Key Definitions

Markov Decision Process (MDP)

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

$\mathcal{S}$: State space — what the agent observes (positions, velocities, errors)
$\mathcal{A}$: Action space — what the agent can do (actuator forces or velocities)
$P(s'|s,a)$: Transition dynamics — how the environment evolves (physics)
$R(s,a,s')$: Reward function — scalar feedback signal
$\gamma \in [0,1)$: Discount factor — how much the agent values future vs. immediate reward

2.3 The Objective

The agent’s goal is to find a policy $\pi^*$ that maximizes the expected discounted return:

\pi^* = \arg\max_\pi \; \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t \, r_t \right]

In plain language: the agent learns to act in a way that accumulates as much reward as possible over time, with a slight preference for sooner rewards (controlled by $\gamma$).

2.4 Policy, Value Function, and Q-Function

Three central concepts in RL:

Concept	Symbol	Meaning
Policy	$\pi(a\|s)$	Probability of taking action $a$ in state $s$
Value function	$V^\pi(s)$	Expected return starting from state $s$, following $\pi$
Q-function	$Q^\pi(s,a)$	Expected return after taking action $a$ in state $s$, then following $\pi$

V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \,\middle|\, s_0 = s\right], \qquad Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \,\middle|\, s_0 = s, a_0 = a\right]

Intuition

Think of $V(s)$ as the agent asking: “How good is my current situation?” And $Q(s,a)$ as: “How good is it to do this specific action right now?” The policy then picks the action with the highest Q-value (or samples from a distribution for exploration).

3. Formulating the Control Problem as an MDP

The most critical step in applying RL to robotics is designing the MDP. A poor formulation will prevent the agent from learning, regardless of the algorithm. Here we formulate trajectory tracking for a 6-DOF Stewart platform.

3.1 State Space $\mathcal{S}$

The state must contain all information the agent needs to make good decisions. For a Stewart platform tracking a trajectory:

s_t = \begin{bmatrix} \mathbf{p}_t \\ \dot{\mathbf{p}}_t \\ \boldsymbol{\Phi}_t \\ \dot{\boldsymbol{\Phi}}_t \\ \mathbf{e}_t \\ \mathbf{p}_{\text{target},t} \end{bmatrix} \in \mathbb{R}^{24}

where:

$\mathbf{p}_t \in \mathbb{R}^3$: current platform position $(x, y, z)$
$\dot{\mathbf{p}}_t \in \mathbb{R}^3$: linear velocity
$\boldsymbol{\Phi}_t \in \mathbb{R}^3$: current orientation (roll, pitch, yaw)
$\dot{\boldsymbol{\Phi}}_t \in \mathbb{R}^3$: angular velocity
$\mathbf{e}_t = \mathbf{p}_{\text{target}} - \mathbf{p}_t \in \mathbb{R}^6$: tracking error (position + orientation)
$\mathbf{p}_{\text{target},t} \in \mathbb{R}^6$: target pose

Design Tip

Always include the tracking error explicitly in the state. Although the agent could compute it from $\mathbf{p}_t$ and $\mathbf{p}_{\text{target}}$, providing it directly accelerates learning significantly.

3.2 Action Space $\mathcal{A}$

The action is the control command sent to the actuators. For the Stewart platform with 6 prismatic actuators, we typically use:

a_t = [\Delta\rho_1, \Delta\rho_2, \dots, \Delta\rho_6]^T \in [-1, 1]^6

Actions are normalized to $[-1, 1]$ and then scaled to the actual actuator velocity or force range. This normalization is essential for stable neural network training.

3.3 Reward Function $R(s, a)$

The reward function is the most important design choice. It defines what “good” means. A well-designed reward for trajectory tracking:

r_t = -\alpha \|\mathbf{e}_{\text{pos},t}\|^2 - \beta \|\mathbf{e}_{\text{ori},t}\|^2 - \lambda \|a_t\|^2 + r_{\text{bonus}}

where:

$\alpha \|\mathbf{e}_{\text{pos}}\|^2$: penalizes position tracking error
$\beta \|\mathbf{e}_{\text{ori}}\|^2$: penalizes orientation error
$\lambda \|a\|^2$: penalizes large actuator commands (energy efficiency)
$r_{\text{bonus}}$: bonus for reaching the target within tolerance

Reward Shaping Pitfall

If the reward is too sparse (e.g., +1 only at the exact target), the agent will almost never receive positive feedback and learning stalls. Use dense rewards (continuous penalties proportional to error) for robot control. Also: always add a small action penalty $\lambda\|a\|^2$ to prevent actuator chatter (rapid oscillations).

3.4 Episode Structure

Reset

Platform starts at a random pose near the home position. A random target trajectory is generated (e.g., circular, sinusoidal, or point-to-point).

Step loop (500–2000 steps at 100 Hz)

At each step: observe $s_t$, compute $a_t = \pi(s_t)$, apply to actuators, simulate physics, compute $r_t$, observe $s_{t+1}$.

Termination

Episode ends if: max steps reached, platform leaves workspace, or actuator limits are violated. Early termination with a large negative reward teaches the agent to stay safe.

4. Key Algorithms for Robot Control

Not all RL algorithms work well for continuous-action robot control. Here are the three most successful families for this task.

4.1 Proximal Policy Optimization (PPO)

PPO is an on-policy algorithm that updates the policy using a clipped surrogate objective. It is the most widely used algorithm in sim-to-real robotics.

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage estimate. The clipping prevents destructively large policy updates.

4.2 Soft Actor-Critic (SAC)

SAC is an off-policy algorithm that maximizes both reward and entropy (exploration). Excellent for continuous control:

\pi^* = \arg\max_\pi \; \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \left(r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))\right)\right]

The entropy term $\mathcal{H}$ encourages the agent to explore diverse actions, which prevents premature convergence to suboptimal policies.

4.3 Twin Delayed DDPG (TD3)

TD3 is a refined version of DDPG that addresses overestimation bias using twin Q-networks and delayed policy updates.

Algorithm	Type	Pros	Best For
PPO	On-policy	Stable, easy to tune, parallelizable	Sim-to-real, safety-critical
SAC	Off-policy	Sample efficient, robust exploration	Complex continuous control
TD3	Off-policy	Deterministic policy, low variance	Precise tracking tasks

Recommendation for Parallel Robots

Start with SAC for development (fast iteration due to sample efficiency), then switch to PPO for sim-to-real transfer (more stable, easier to constrain outputs). Use domain randomization in either case.

5. Worked Example: Stewart Platform Tracking

Let’s walk through a complete example. We train an RL agent (SAC) to make a Stewart platform track a circular trajectory in the $xz$-plane while maintaining constant height and orientation.

5.1 Environment Setup

Agent SAC Policy
MLP [256, 256]

→

Action Scaling [-1,1]&sup6; → force
Δρ × F_max

→

Simulator Stewart 6-UPS
dt = 0.01s

State p, v, Φ, ω, e
dim = 24

←

Reward -||e_pos||² - 0.1||a||²
+5 if ||e|| < 0.01

←

Physics Forward dynamics
Collision check

Figure 2. The complete RL training pipeline for Stewart platform trajectory tracking.

5.2 Hyperparameters

Parameter	Value	Rationale
Algorithm	SAC	Off-policy, good for continuous 6D action
Policy network	MLP [256, 256]	Sufficient for smooth control policies
Learning rate	$3 \times 10^{-4}$	Standard for SAC
Discount $\gamma$	0.99	Long horizon (10s episodes)
Batch size	256	Stable gradient estimates
Replay buffer	$10^6$	Store many transitions for off-policy
$\alpha$ (position penalty)	10.0	Primary objective
$\lambda$ (action penalty)	0.1	Smooth actuator commands
Episode length	1000 steps (10s)	One full circle trajectory
Total training steps	$5 \times 10^5$	~500 episodes

5.3 Pseudocode

# SAC Training Loop for Stewart Platform Control
import gymnasium as gym
from stable_baselines3 import SAC

# 1. Create the Stewart Platform environment
env = gym.make("StewartPlatform-v1",
    base_radius=0.70,
    platform_radius=0.35,
    trajectory="circular",
    max_steps=1000,
)

# 2. Initialize SAC agent
model = SAC(
    "MlpPolicy",
    env,
    learning_rate=3e-4,
    gamma=0.99,
    batch_size=256,
    buffer_size=1_000_000,
    policy_kwargs=dict(net_arch=[256, 256]),
    verbose=1,
)

# 3. Train
model.learn(total_timesteps=500_000)

# 4. Evaluate
obs, _ = env.reset()
for step in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)
    if done: break
print(f"Mean tracking error: {info['mean_error']:.4f} m")
        

5.4 Training Progress

The chart below shows a simulated training curve for this problem. The three phases are typical of RL for robot control:

Exploration (0–50k steps): Agent takes random actions, reward is very negative
Rapid learning (50k–250k): Agent discovers the tracking strategy, reward climbs quickly
Fine-tuning (250k–500k): Agent refines precision, small gains in tracking accuracy

6. Interactive: Watch the Agent Learn

Below is a live simulation of an RL agent controlling a Stewart platform. On the left, the 3D Stewart platform tracks a circular trajectory (gold ring). On the right, the training reward curve shows the agent’s progress.

Click Train to start training from scratch, Step to advance one episode, or use the Speed slider to control the simulation. Watch how the platform’s motion improves as training progresses!

RL Agent: Stewart Platform Tracking

Watch the agent learn to follow a circular trajectory

Untrained

Stewart Platform (Agent View)

Episode: 0

Error: --

Reward: --

Training Progress

Steps: 0

Best: --

Speed 5x

Trajectory

What to Observe

Early episodes: The platform moves erratically, overshooting and oscillating
Mid training: The platform begins following the trajectory but with visible lag
Late training: Smooth, accurate tracking with minimal overshoot
The gold ring is the target trajectory; the red dot is the platform center
Watch the reward curve on the right climb as the agent improves

7. Challenges & Practical Tips

7.1 Reward Engineering

The reward function can make or break your RL controller.

Problem	Symptom	Solution
Sparse reward	No learning progress	Dense error-based penalty
No action penalty	Chattering / oscillation	Add $\lambda\\|a\\|^2$ or $\lambda\\|\Delta a\\|^2$
Reward hacking	Agent finds loophole	Add constraints, clip rewards
Scale mismatch	Orientation ignored	Normalize and balance $\alpha, \beta$

7.2 Observation Normalization

Critical for Convergence

Always normalize observations to roughly $[-1, 1]$ or zero mean with unit variance. Neural networks are sensitive to input scales. Use a running mean/std normalizer (available in most RL libraries). For the Stewart platform: positions in meters and velocities in m/s have very different scales—normalization is essential.

7.3 Action Smoothing

Raw RL policies can produce jerky commands. Two techniques to smooth actions:

Action rate penalty: Add $\lambda_{\Delta a}\|a_t - a_{t-1}\|^2$ to the reward
Low-pass filter: Apply exponential smoothing: $a_{\text{applied}} = \alpha \cdot a_{\text{policy}} + (1-\alpha) \cdot a_{\text{prev}}$

7.4 Curriculum Learning

Start with easy tasks and gradually increase difficulty:

Phase 1: Point stabilization

Agent learns to hold the platform at a fixed position.

Phase 2: Slow trajectory

Simple circular path at low speed (0.01 m/s).

Phase 3: Full speed + noise

Complex trajectories with disturbances and observation noise.

7.5 Domain Randomization

To make the policy robust for sim-to-real transfer, randomize during training:

Mass & inertia: ±20% variation
Friction: Random joint friction coefficients
Sensor noise: Gaussian noise on position/velocity readings
Actuator delay: Random 1–3 step delay in action execution
External forces: Random perturbations during trajectory

8. Beyond: Sim-to-Real Transfer

The ultimate goal is to deploy the learned policy on a real robot. The gap between simulation and reality (the sim-to-real gap) is the biggest challenge in RL-based robot control.

Figure 3. The sim-to-real pipeline. A policy trained in simulation with domain randomization is deployed on the real Stewart platform, with optional online fine-tuning.

Key Strategies for Sim-to-Real

Domain randomization: Vary physics parameters during training so the policy is robust to real-world uncertainty
System identification: Measure real robot parameters and calibrate the simulator
Action filtering: Apply low-pass filters to prevent high-frequency commands that could damage hardware
Safety layers: Add workspace limits and force constraints that override the RL policy when necessary

The Future

RL for parallel robot control is an active research area. Current trends include model-based RL (learning a dynamics model for faster training), multi-task learning (one policy for many trajectories), and safe RL (guaranteeing constraint satisfaction during exploration).

Summary

Key Takeaways

RL learns control policies from interaction, without requiring explicit dynamic models
The MDP formulation (state, action, reward) is the most critical design step
Use dense rewards with error penalty + action penalty for robot control
SAC and PPO are the go-to algorithms for continuous robot control
Include tracking error in the state and normalize all observations
Domain randomization is essential for sim-to-real transfer
Curriculum learning accelerates training: start easy, increase difficulty
Always add safety constraints as a layer on top of the RL policy

Quick Check

Q1. In the MDP formulation for a Stewart platform, the action space typically consists of:

The target position of the platform

Normalized actuator force/velocity commands for 6 legs

Joint angles of the base

Q2. Why is a dense reward function preferred over a sparse one for robot control?

Dense rewards are computationally cheaper

Dense rewards provide continuous feedback, enabling faster learning

Sparse rewards cause the robot to move too fast

Q3. The action penalty term $\lambda\|a\|^2$ in the reward function serves to:

Increase the exploration rate

Make the robot move faster

Prevent actuator chattering and encourage smooth, energy-efficient commands

Q4. Domain randomization helps with:

Bridging the sim-to-real gap by making the policy robust to parameter variations

Reducing the training time

Eliminating the need for a simulator

← Previous: Singularity-Free Workspace More tutorials coming soon