Classical control of parallel robots relies on precise kinematic and dynamic models. But what if the robot could learn to control itself? Reinforcement Learning (RL) offers a fundamentally different paradigm: instead of deriving equations, an agent learns optimal actions through trial and error, guided by rewards. This tutorial is your starting point into RL-based control, with a focus on parallel manipulators.
What You Will Learn
The core concepts of Reinforcement Learning (states, actions, rewards, policies), how to formulate a parallel robot control problem as an MDP, the key RL algorithms (PPO, SAC, TD3) used in robotics, and a complete worked example of training an RL agent to control a Stewart platform for trajectory tracking. You will also interact with a live simulation of the trained agent.
1. Why RL for Parallel Robots?
Parallel robots are notoriously difficult to control with classical methods. Their closed-loop kinematics create coupled, highly nonlinear dynamics. The inverse dynamics require solving complex equations in real time. And near singularities, model-based controllers can become unstable. RL offers an alternative.
| Aspect | Classical Control | RL-Based Control |
|---|---|---|
| Model requirement | Full dynamic model needed | Model-free (learns from interaction) |
| Singularity handling | Requires explicit avoidance logic | Learns to avoid through negative reward |
| Adaptability | Re-derive for each platform | Re-train (same algorithm, different env) |
| Friction & backlash | Hard to model accurately | Learns to compensate implicitly |
| Optimality | Depends on tuning (PID gains) | Optimizes for defined reward function |
| Computation at runtime | Dynamics solved each step | Simple neural network forward pass |
The Core Idea
An RL agent observes the robot’s state (joint positions, velocities, tracking error), takes an action (actuator forces/velocities), receives a reward (how well it tracked the target), and gradually learns a policy that maximizes cumulative reward.
2. RL Fundamentals
Before applying RL to robots, you need to understand the basic framework. RL is grounded in the theory of Markov Decision Processes (MDPs).
2.1 The Agent-Environment Loop
At each time step $t$, the agent observes the current state $s_t$, selects an action $a_t$ according to its policy $\pi(a|s)$, and the environment transitions to a new state $s_{t+1}$ and returns a reward $r_t$.
Figure 1. The RL agent-environment interaction loop applied to Stewart platform control. At each timestep, the agent selects actuator commands, and the environment returns the new platform state and a reward signal.
2.2 Key Definitions
Markov Decision Process (MDP)
An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:
- $\mathcal{S}$: State space — what the agent observes (positions, velocities, errors)
- $\mathcal{A}$: Action space — what the agent can do (actuator forces or velocities)
- $P(s'|s,a)$: Transition dynamics — how the environment evolves (physics)
- $R(s,a,s')$: Reward function — scalar feedback signal
- $\gamma \in [0,1)$: Discount factor — how much the agent values future vs. immediate reward
2.3 The Objective
The agent’s goal is to find a policy $\pi^*$ that maximizes the expected discounted return:
In plain language: the agent learns to act in a way that accumulates as much reward as possible over time, with a slight preference for sooner rewards (controlled by $\gamma$).
2.4 Policy, Value Function, and Q-Function
Three central concepts in RL:
| Concept | Symbol | Meaning |
|---|---|---|
| Policy | $\pi(a|s)$ | Probability of taking action $a$ in state $s$ |
| Value function | $V^\pi(s)$ | Expected return starting from state $s$, following $\pi$ |
| Q-function | $Q^\pi(s,a)$ | Expected return after taking action $a$ in state $s$, then following $\pi$ |
Intuition
Think of $V(s)$ as the agent asking: “How good is my current situation?” And $Q(s,a)$ as: “How good is it to do this specific action right now?” The policy then picks the action with the highest Q-value (or samples from a distribution for exploration).
3. Formulating the Control Problem as an MDP
The most critical step in applying RL to robotics is designing the MDP. A poor formulation will prevent the agent from learning, regardless of the algorithm. Here we formulate trajectory tracking for a 6-DOF Stewart platform.
3.1 State Space $\mathcal{S}$
The state must contain all information the agent needs to make good decisions. For a Stewart platform tracking a trajectory:
where:
- $\mathbf{p}_t \in \mathbb{R}^3$: current platform position $(x, y, z)$
- $\dot{\mathbf{p}}_t \in \mathbb{R}^3$: linear velocity
- $\boldsymbol{\Phi}_t \in \mathbb{R}^3$: current orientation (roll, pitch, yaw)
- $\dot{\boldsymbol{\Phi}}_t \in \mathbb{R}^3$: angular velocity
- $\mathbf{e}_t = \mathbf{p}_{\text{target}} - \mathbf{p}_t \in \mathbb{R}^6$: tracking error (position + orientation)
- $\mathbf{p}_{\text{target},t} \in \mathbb{R}^6$: target pose
Design Tip
Always include the tracking error explicitly in the state. Although the agent could compute it from $\mathbf{p}_t$ and $\mathbf{p}_{\text{target}}$, providing it directly accelerates learning significantly.
3.2 Action Space $\mathcal{A}$
The action is the control command sent to the actuators. For the Stewart platform with 6 prismatic actuators, we typically use:
Actions are normalized to $[-1, 1]$ and then scaled to the actual actuator velocity or force range. This normalization is essential for stable neural network training.
3.3 Reward Function $R(s, a)$
The reward function is the most important design choice. It defines what “good” means. A well-designed reward for trajectory tracking:
where:
- $\alpha \|\mathbf{e}_{\text{pos}}\|^2$: penalizes position tracking error
- $\beta \|\mathbf{e}_{\text{ori}}\|^2$: penalizes orientation error
- $\lambda \|a\|^2$: penalizes large actuator commands (energy efficiency)
- $r_{\text{bonus}}$: bonus for reaching the target within tolerance
Reward Shaping Pitfall
If the reward is too sparse (e.g., +1 only at the exact target), the agent will almost never receive positive feedback and learning stalls. Use dense rewards (continuous penalties proportional to error) for robot control. Also: always add a small action penalty $\lambda\|a\|^2$ to prevent actuator chatter (rapid oscillations).
3.4 Episode Structure
Reset
Platform starts at a random pose near the home position. A random target trajectory is generated (e.g., circular, sinusoidal, or point-to-point).
Step loop (500–2000 steps at 100 Hz)
At each step: observe $s_t$, compute $a_t = \pi(s_t)$, apply to actuators, simulate physics, compute $r_t$, observe $s_{t+1}$.
Termination
Episode ends if: max steps reached, platform leaves workspace, or actuator limits are violated. Early termination with a large negative reward teaches the agent to stay safe.
4. Key Algorithms for Robot Control
Not all RL algorithms work well for continuous-action robot control. Here are the three most successful families for this task.
4.1 Proximal Policy Optimization (PPO)
PPO is an on-policy algorithm that updates the policy using a clipped surrogate objective. It is the most widely used algorithm in sim-to-real robotics.
where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage estimate. The clipping prevents destructively large policy updates.
4.2 Soft Actor-Critic (SAC)
SAC is an off-policy algorithm that maximizes both reward and entropy (exploration). Excellent for continuous control:
The entropy term $\mathcal{H}$ encourages the agent to explore diverse actions, which prevents premature convergence to suboptimal policies.
4.3 Twin Delayed DDPG (TD3)
TD3 is a refined version of DDPG that addresses overestimation bias using twin Q-networks and delayed policy updates.
| Algorithm | Type | Pros | Best For |
|---|---|---|---|
| PPO | On-policy | Stable, easy to tune, parallelizable | Sim-to-real, safety-critical |
| SAC | Off-policy | Sample efficient, robust exploration | Complex continuous control |
| TD3 | Off-policy | Deterministic policy, low variance | Precise tracking tasks |
Recommendation for Parallel Robots
Start with SAC for development (fast iteration due to sample efficiency), then switch to PPO for sim-to-real transfer (more stable, easier to constrain outputs). Use domain randomization in either case.
5. Worked Example: Stewart Platform Tracking
Let’s walk through a complete example. We train an RL agent (SAC) to make a Stewart platform track a circular trajectory in the $xz$-plane while maintaining constant height and orientation.
5.1 Environment Setup
MLP [256, 256]
Δρ × Fmax
dt = 0.01s
dim = 24
+5 if ||e|| < 0.01
Collision check
Figure 2. The complete RL training pipeline for Stewart platform trajectory tracking.
5.2 Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Algorithm | SAC | Off-policy, good for continuous 6D action |
| Policy network | MLP [256, 256] | Sufficient for smooth control policies |
| Learning rate | $3 \times 10^{-4}$ | Standard for SAC |
| Discount $\gamma$ | 0.99 | Long horizon (10s episodes) |
| Batch size | 256 | Stable gradient estimates |
| Replay buffer | $10^6$ | Store many transitions for off-policy |
| $\alpha$ (position penalty) | 10.0 | Primary objective |
| $\lambda$ (action penalty) | 0.1 | Smooth actuator commands |
| Episode length | 1000 steps (10s) | One full circle trajectory |
| Total training steps | $5 \times 10^5$ | ~500 episodes |
5.3 Pseudocode
5.4 Training Progress
The chart below shows a simulated training curve for this problem. The three phases are typical of RL for robot control:
- Exploration (0–50k steps): Agent takes random actions, reward is very negative
- Rapid learning (50k–250k): Agent discovers the tracking strategy, reward climbs quickly
- Fine-tuning (250k–500k): Agent refines precision, small gains in tracking accuracy
6. Interactive: Watch the Agent Learn
Below is a live simulation of an RL agent controlling a Stewart platform. On the left, the 3D Stewart platform tracks a circular trajectory (gold ring). On the right, the training reward curve shows the agent’s progress.
Click Train to start training from scratch, Step to advance one episode, or use the Speed slider to control the simulation. Watch how the platform’s motion improves as training progresses!
Episode: 0
Error: --
Reward: --
Steps: 0
Best: --
What to Observe
- Early episodes: The platform moves erratically, overshooting and oscillating
- Mid training: The platform begins following the trajectory but with visible lag
- Late training: Smooth, accurate tracking with minimal overshoot
- The gold ring is the target trajectory; the red dot is the platform center
- Watch the reward curve on the right climb as the agent improves
7. Challenges & Practical Tips
7.1 Reward Engineering
The reward function can make or break your RL controller.
| Problem | Symptom | Solution |
|---|---|---|
| Sparse reward | No learning progress | Dense error-based penalty |
| No action penalty | Chattering / oscillation | Add $\lambda\|a\|^2$ or $\lambda\|\Delta a\|^2$ |
| Reward hacking | Agent finds loophole | Add constraints, clip rewards |
| Scale mismatch | Orientation ignored | Normalize and balance $\alpha, \beta$ |
7.2 Observation Normalization
Critical for Convergence
Always normalize observations to roughly $[-1, 1]$ or zero mean with unit variance. Neural networks are sensitive to input scales. Use a running mean/std normalizer (available in most RL libraries). For the Stewart platform: positions in meters and velocities in m/s have very different scales—normalization is essential.
7.3 Action Smoothing
Raw RL policies can produce jerky commands. Two techniques to smooth actions:
- Action rate penalty: Add $\lambda_{\Delta a}\|a_t - a_{t-1}\|^2$ to the reward
- Low-pass filter: Apply exponential smoothing: $a_{\text{applied}} = \alpha \cdot a_{\text{policy}} + (1-\alpha) \cdot a_{\text{prev}}$
7.4 Curriculum Learning
Start with easy tasks and gradually increase difficulty:
Phase 1: Point stabilization
Agent learns to hold the platform at a fixed position.
Phase 2: Slow trajectory
Simple circular path at low speed (0.01 m/s).
Phase 3: Full speed + noise
Complex trajectories with disturbances and observation noise.
7.5 Domain Randomization
To make the policy robust for sim-to-real transfer, randomize during training:
- Mass & inertia: ±20% variation
- Friction: Random joint friction coefficients
- Sensor noise: Gaussian noise on position/velocity readings
- Actuator delay: Random 1–3 step delay in action execution
- External forces: Random perturbations during trajectory
8. Beyond: Sim-to-Real Transfer
The ultimate goal is to deploy the learned policy on a real robot. The gap between simulation and reality (the sim-to-real gap) is the biggest challenge in RL-based robot control.
Figure 3. The sim-to-real pipeline. A policy trained in simulation with domain randomization is deployed on the real Stewart platform, with optional online fine-tuning.
Key Strategies for Sim-to-Real
- Domain randomization: Vary physics parameters during training so the policy is robust to real-world uncertainty
- System identification: Measure real robot parameters and calibrate the simulator
- Action filtering: Apply low-pass filters to prevent high-frequency commands that could damage hardware
- Safety layers: Add workspace limits and force constraints that override the RL policy when necessary
The Future
RL for parallel robot control is an active research area. Current trends include model-based RL (learning a dynamics model for faster training), multi-task learning (one policy for many trajectories), and safe RL (guaranteeing constraint satisfaction during exploration).
Summary
Key Takeaways
- RL learns control policies from interaction, without requiring explicit dynamic models
- The MDP formulation (state, action, reward) is the most critical design step
- Use dense rewards with error penalty + action penalty for robot control
- SAC and PPO are the go-to algorithms for continuous robot control
- Include tracking error in the state and normalize all observations
- Domain randomization is essential for sim-to-real transfer
- Curriculum learning accelerates training: start easy, increase difficulty
- Always add safety constraints as a layer on top of the RL policy
Quick Check
Q1. In the MDP formulation for a Stewart platform, the action space typically consists of:
Q2. Why is a dense reward function preferred over a sparse one for robot control?
Q3. The action penalty term $\lambda\|a\|^2$ in the reward function serves to:
Q4. Domain randomization helps with: