AI × Robotics

Reinforcement Learning for Robot Control

Parallel Robots · 45 min read · By Grover Aruquipa

Classical control of parallel robots relies on precise kinematic and dynamic models. But what if the robot could learn to control itself? Reinforcement Learning (RL) offers a fundamentally different paradigm: instead of deriving equations, an agent learns optimal actions through trial and error, guided by rewards. This tutorial is your starting point into RL-based control, with a focus on parallel manipulators.

What You Will Learn

The core concepts of Reinforcement Learning (states, actions, rewards, policies), how to formulate a parallel robot control problem as an MDP, the key RL algorithms (PPO, SAC, TD3) used in robotics, and a complete worked example of training an RL agent to control a Stewart platform for trajectory tracking. You will also interact with a live simulation of the trained agent.

1. Why RL for Parallel Robots?

Parallel robots are notoriously difficult to control with classical methods. Their closed-loop kinematics create coupled, highly nonlinear dynamics. The inverse dynamics require solving complex equations in real time. And near singularities, model-based controllers can become unstable. RL offers an alternative.

Aspect Classical Control RL-Based Control
Model requirement Full dynamic model needed Model-free (learns from interaction)
Singularity handling Requires explicit avoidance logic Learns to avoid through negative reward
Adaptability Re-derive for each platform Re-train (same algorithm, different env)
Friction & backlash Hard to model accurately Learns to compensate implicitly
Optimality Depends on tuning (PID gains) Optimizes for defined reward function
Computation at runtime Dynamics solved each step Simple neural network forward pass

The Core Idea

An RL agent observes the robot’s state (joint positions, velocities, tracking error), takes an action (actuator forces/velocities), receives a reward (how well it tracked the target), and gradually learns a policy that maximizes cumulative reward.

2. RL Fundamentals

Before applying RL to robots, you need to understand the basic framework. RL is grounded in the theory of Markov Decision Processes (MDPs).

2.1 The Agent-Environment Loop

At each time step $t$, the agent observes the current state $s_t$, selects an action $a_t$ according to its policy $\pi(a|s)$, and the environment transitions to a new state $s_{t+1}$ and returns a reward $r_t$.

AGENT Policy π(a|s) Neural Network ENVIRONMENT Stewart Platform Physics Simulation action a₀ state s₁ , reward r₀ REWARD SIGNAL r = -||error||² - λ||a||²

Figure 1. The RL agent-environment interaction loop applied to Stewart platform control. At each timestep, the agent selects actuator commands, and the environment returns the new platform state and a reward signal.

2.2 Key Definitions

Markov Decision Process (MDP)

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

2.3 The Objective

The agent’s goal is to find a policy $\pi^*$ that maximizes the expected discounted return:

$$\pi^* = \arg\max_\pi \; \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t \, r_t \right]$$

In plain language: the agent learns to act in a way that accumulates as much reward as possible over time, with a slight preference for sooner rewards (controlled by $\gamma$).

2.4 Policy, Value Function, and Q-Function

Three central concepts in RL:

Concept Symbol Meaning
Policy $\pi(a|s)$ Probability of taking action $a$ in state $s$
Value function $V^\pi(s)$ Expected return starting from state $s$, following $\pi$
Q-function $Q^\pi(s,a)$ Expected return after taking action $a$ in state $s$, then following $\pi$
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \,\middle|\, s_0 = s\right], \qquad Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t r_t \,\middle|\, s_0 = s, a_0 = a\right]$$

Intuition

Think of $V(s)$ as the agent asking: “How good is my current situation?” And $Q(s,a)$ as: “How good is it to do this specific action right now?” The policy then picks the action with the highest Q-value (or samples from a distribution for exploration).

3. Formulating the Control Problem as an MDP

The most critical step in applying RL to robotics is designing the MDP. A poor formulation will prevent the agent from learning, regardless of the algorithm. Here we formulate trajectory tracking for a 6-DOF Stewart platform.

3.1 State Space $\mathcal{S}$

The state must contain all information the agent needs to make good decisions. For a Stewart platform tracking a trajectory:

$$s_t = \begin{bmatrix} \mathbf{p}_t \\ \dot{\mathbf{p}}_t \\ \boldsymbol{\Phi}_t \\ \dot{\boldsymbol{\Phi}}_t \\ \mathbf{e}_t \\ \mathbf{p}_{\text{target},t} \end{bmatrix} \in \mathbb{R}^{24}$$

where:

Design Tip

Always include the tracking error explicitly in the state. Although the agent could compute it from $\mathbf{p}_t$ and $\mathbf{p}_{\text{target}}$, providing it directly accelerates learning significantly.

3.2 Action Space $\mathcal{A}$

The action is the control command sent to the actuators. For the Stewart platform with 6 prismatic actuators, we typically use:

$$a_t = [\Delta\rho_1, \Delta\rho_2, \dots, \Delta\rho_6]^T \in [-1, 1]^6$$

Actions are normalized to $[-1, 1]$ and then scaled to the actual actuator velocity or force range. This normalization is essential for stable neural network training.

3.3 Reward Function $R(s, a)$

The reward function is the most important design choice. It defines what “good” means. A well-designed reward for trajectory tracking:

$$r_t = -\alpha \|\mathbf{e}_{\text{pos},t}\|^2 - \beta \|\mathbf{e}_{\text{ori},t}\|^2 - \lambda \|a_t\|^2 + r_{\text{bonus}}$$

where:

Reward Shaping Pitfall

If the reward is too sparse (e.g., +1 only at the exact target), the agent will almost never receive positive feedback and learning stalls. Use dense rewards (continuous penalties proportional to error) for robot control. Also: always add a small action penalty $\lambda\|a\|^2$ to prevent actuator chatter (rapid oscillations).

3.4 Episode Structure

Reset

Platform starts at a random pose near the home position. A random target trajectory is generated (e.g., circular, sinusoidal, or point-to-point).

Step loop (500–2000 steps at 100 Hz)

At each step: observe $s_t$, compute $a_t = \pi(s_t)$, apply to actuators, simulate physics, compute $r_t$, observe $s_{t+1}$.

Termination

Episode ends if: max steps reached, platform leaves workspace, or actuator limits are violated. Early termination with a large negative reward teaches the agent to stay safe.

4. Key Algorithms for Robot Control

Not all RL algorithms work well for continuous-action robot control. Here are the three most successful families for this task.

4.1 Proximal Policy Optimization (PPO)

PPO is an on-policy algorithm that updates the policy using a clipped surrogate objective. It is the most widely used algorithm in sim-to-real robotics.

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage estimate. The clipping prevents destructively large policy updates.

4.2 Soft Actor-Critic (SAC)

SAC is an off-policy algorithm that maximizes both reward and entropy (exploration). Excellent for continuous control:

$$\pi^* = \arg\max_\pi \; \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \left(r_t + \alpha \mathcal{H}(\pi(\cdot|s_t))\right)\right]$$

The entropy term $\mathcal{H}$ encourages the agent to explore diverse actions, which prevents premature convergence to suboptimal policies.

4.3 Twin Delayed DDPG (TD3)

TD3 is a refined version of DDPG that addresses overestimation bias using twin Q-networks and delayed policy updates.

Algorithm Type Pros Best For
PPO On-policy Stable, easy to tune, parallelizable Sim-to-real, safety-critical
SAC Off-policy Sample efficient, robust exploration Complex continuous control
TD3 Off-policy Deterministic policy, low variance Precise tracking tasks

Recommendation for Parallel Robots

Start with SAC for development (fast iteration due to sample efficiency), then switch to PPO for sim-to-real transfer (more stable, easier to constrain outputs). Use domain randomization in either case.

5. Worked Example: Stewart Platform Tracking

Let’s walk through a complete example. We train an RL agent (SAC) to make a Stewart platform track a circular trajectory in the $xz$-plane while maintaining constant height and orientation.

5.1 Environment Setup

Agent SAC Policy
MLP [256, 256]
Action Scaling [-1,1]&sup6; → force
Δρ × Fmax
Simulator Stewart 6-UPS
dt = 0.01s
State p, v, Φ, ω, e
dim = 24
Reward -||epos||² - 0.1||a||²
+5 if ||e|| < 0.01
Physics Forward dynamics
Collision check

Figure 2. The complete RL training pipeline for Stewart platform trajectory tracking.

5.2 Hyperparameters

Parameter Value Rationale
Algorithm SAC Off-policy, good for continuous 6D action
Policy network MLP [256, 256] Sufficient for smooth control policies
Learning rate $3 \times 10^{-4}$ Standard for SAC
Discount $\gamma$ 0.99 Long horizon (10s episodes)
Batch size 256 Stable gradient estimates
Replay buffer $10^6$ Store many transitions for off-policy
$\alpha$ (position penalty) 10.0 Primary objective
$\lambda$ (action penalty) 0.1 Smooth actuator commands
Episode length 1000 steps (10s) One full circle trajectory
Total training steps $5 \times 10^5$ ~500 episodes

5.3 Pseudocode

# SAC Training Loop for Stewart Platform Control import gymnasium as gym from stable_baselines3 import SAC # 1. Create the Stewart Platform environment env = gym.make("StewartPlatform-v1", base_radius=0.70, platform_radius=0.35, trajectory="circular", max_steps=1000, ) # 2. Initialize SAC agent model = SAC( "MlpPolicy", env, learning_rate=3e-4, gamma=0.99, batch_size=256, buffer_size=1_000_000, policy_kwargs=dict(net_arch=[256, 256]), verbose=1, ) # 3. Train model.learn(total_timesteps=500_000) # 4. Evaluate obs, _ = env.reset() for step in range(1000): action, _ = model.predict(obs, deterministic=True) obs, reward, done, truncated, info = env.step(action) if done: break print(f"Mean tracking error: {info['mean_error']:.4f} m")

5.4 Training Progress

The chart below shows a simulated training curve for this problem. The three phases are typical of RL for robot control:

  1. Exploration (0–50k steps): Agent takes random actions, reward is very negative
  2. Rapid learning (50k–250k): Agent discovers the tracking strategy, reward climbs quickly
  3. Fine-tuning (250k–500k): Agent refines precision, small gains in tracking accuracy

6. Interactive: Watch the Agent Learn

Below is a live simulation of an RL agent controlling a Stewart platform. On the left, the 3D Stewart platform tracks a circular trajectory (gold ring). On the right, the training reward curve shows the agent’s progress.

Click Train to start training from scratch, Step to advance one episode, or use the Speed slider to control the simulation. Watch how the platform’s motion improves as training progresses!

RL Agent: Stewart Platform Tracking
Watch the agent learn to follow a circular trajectory
Untrained
Stewart Platform (Agent View)

Episode: 0

Error: --

Reward: --

Training Progress

Steps: 0

Best: --

5x

What to Observe

7. Challenges & Practical Tips

7.1 Reward Engineering

The reward function can make or break your RL controller.

Problem Symptom Solution
Sparse reward No learning progress Dense error-based penalty
No action penalty Chattering / oscillation Add $\lambda\|a\|^2$ or $\lambda\|\Delta a\|^2$
Reward hacking Agent finds loophole Add constraints, clip rewards
Scale mismatch Orientation ignored Normalize and balance $\alpha, \beta$

7.2 Observation Normalization

Critical for Convergence

Always normalize observations to roughly $[-1, 1]$ or zero mean with unit variance. Neural networks are sensitive to input scales. Use a running mean/std normalizer (available in most RL libraries). For the Stewart platform: positions in meters and velocities in m/s have very different scales—normalization is essential.

7.3 Action Smoothing

Raw RL policies can produce jerky commands. Two techniques to smooth actions:

7.4 Curriculum Learning

Start with easy tasks and gradually increase difficulty:

Phase 1: Point stabilization

Agent learns to hold the platform at a fixed position.

Phase 2: Slow trajectory

Simple circular path at low speed (0.01 m/s).

Phase 3: Full speed + noise

Complex trajectories with disturbances and observation noise.

7.5 Domain Randomization

To make the policy robust for sim-to-real transfer, randomize during training:

8. Beyond: Sim-to-Real Transfer

The ultimate goal is to deploy the learned policy on a real robot. The gap between simulation and reality (the sim-to-real gap) is the biggest challenge in RL-based robot control.

SIMULATION Train policy Domain randomization 500k+ steps TRANSFER Policy deployment System identification REAL ROBOT Execute policy Fine-tune online Safety constraints Optional: real-world fine-tuning data

Figure 3. The sim-to-real pipeline. A policy trained in simulation with domain randomization is deployed on the real Stewart platform, with optional online fine-tuning.

Key Strategies for Sim-to-Real

The Future

RL for parallel robot control is an active research area. Current trends include model-based RL (learning a dynamics model for faster training), multi-task learning (one policy for many trajectories), and safe RL (guaranteeing constraint satisfaction during exploration).

Summary

Key Takeaways

Quick Check

Q1. In the MDP formulation for a Stewart platform, the action space typically consists of:

The target position of the platform
Normalized actuator force/velocity commands for 6 legs
Joint angles of the base

Q2. Why is a dense reward function preferred over a sparse one for robot control?

Dense rewards are computationally cheaper
Dense rewards provide continuous feedback, enabling faster learning
Sparse rewards cause the robot to move too fast

Q3. The action penalty term $\lambda\|a\|^2$ in the reward function serves to:

Increase the exploration rate
Make the robot move faster
Prevent actuator chattering and encourage smooth, energy-efficient commands

Q4. Domain randomization helps with:

Bridging the sim-to-real gap by making the policy robust to parameter variations
Reducing the training time
Eliminating the need for a simulator
← Previous: Singularity-Free Workspace More tutorials coming soon