In the previous Episode 9 we learned “never put heavy processing inside an ISR.” But what do you do when you absolutely must transfer a large amount of data?

The answer is DMA (Direct Memory Access).

DMA is hardware that moves data without involving the CPU. The transfer completes while the CPU is free to do other things.


📖 Previous Article

Episode 9: Interrupt Design Anti-Patterns — "Deliberately Break It" to Learn ISR Design

📍 Series Top Page

Full 13-Part Series: The Embedded World Beyond Pointers


✅ What you'll be able to do after this article

  • Explain why DMA exists from a bus-architecture perspective
  • Read STM32's DMA stream/channel mapping table and configure it
  • DMA-ify a UART TX and send data with the CPU free
  • Implement a transfer-complete callback and safely start the next transfer
  • Explain why a buffer-reuse timing mistake corrupts data

Table of Contents

  1. What Is DMA?
  2. Why Do We Need DMA?
  3. STM32’s DMA Architecture
  4. DMA-ifying UART TX (Practice)
  5. Throughput and Latency
  6. Bus Contention
  7. Common Pitfalls
  8. Summary

🔌 What Is DMA?

DMA Is a “Dedicated Data-Transfer Hardware Unit, Separate from the CPU”

DMA (Direct Memory Access) is a hardware module that exists independently from the CPU. It is not a memory feature, nor a CPU feature — it is an independent data-transfer engine connected to the bus.

Let’s first see what happens without DMA.

Data Transfer Without DMA

Consider the operation “send a string stored in RAM over UART.”

【Normal Transfer: The CPU Does Everything】

RAM (string buffer)
  │
  │ LDR instruction (CPU reads 1 byte from RAM)
  ▼
CPU (holds data in a register)
  │
  │ STR instruction (CPU writes 1 byte to UART DR register)
  ▼
UART DR register → physically transmitted

Sending 1 byte requires one “LDR → STR” CPU instruction pair. To send 100 bytes, the CPU repeats this 100 times. During that time the CPU can do nothing else.

With DMA

【DMA Transfer: CPU Just Places the Order】

CPU: "DMA, take 100 bytes from this address and send them to UART DR"
  │
  │ (CPU immediately moves to the next task)
  ▼
DMA controller (runs independently)
  │
  ├─ RAM → DMA → UART DR (byte 1)
  ├─ RAM → DMA → UART DR (byte 2)
  ├─     …… (CPU not involved)
  └─ RAM → DMA → UART DR (byte 100) Done!
           │
           └─ Transfer-complete interrupt notifies CPU

The CPU only “places the order,” and is then free to run.


Do All Microcontrollers Have DMA?

No. The presence or absence of DMA varies significantly by microcontroller.

Microcontroller DMA
Arduino UNO (AVR ATmega328P) ❌ None
Arduino Mega (AVR ATmega2560) ❌ None
STM32F401RE ✅ DMA1 + DMA2 (8 streams each)
ESP32 ✅ Yes (select peripherals)
Renesas RA / RX ✅ Yes
PIC32 ✅ Yes

One reason the Arduino UNO can only do blocking transmissions is that it has no DMA. 32-bit microcontrollers like the STM32 ship with DMA as standard.


What Is a “Peripheral”?

The word “peripheral” that appears in DMA explanations is a collective term for all hardware modules other than the CPU and memory.

Peripheral Role
UART / USART Serial communication (talking to a PC)
SPI High-speed communication with sensors, SD cards
I2C Communication with sensors and displays
ADC Converts analog voltage to digital value
DAC Converts digital value to analog voltage
TIM Timers and PWM generation

All of these are hardware operated by reading and writing to specific registers. DMA automatically shuttles data between a buffer in RAM and these peripheral registers.


What “Without CPU Involvement” Really Means

Let’s look at what it means for the CPU to be “involved,” at the hardware level.

Inside a microcontroller, the CPU, DMA, RAM, and peripherals are all connected by the AHB bus (Advanced High-performance Bus) — a shared highway.

graph LR subgraph masters["Bus Masters (either can move data on the bus)"] CPU["CPU
Cortex-M4"] DMA["DMA1"] end BUS(["AHB Bus Matrix"]) SRAM["SRAM"] UART["UART2
Peripheral"] CPU --> BUS DMA --> BUS BUS --> SRAM BUS --> UART

Both the CPU and DMA are bus masters connected to the same bus. Either can access RAM or peripherals.

“Without CPU involvement” means “instead of the CPU occupying the bus to repeat LDR/STR, the DMA uses the bus to move data.” The CPU doesn’t need to use the bus during that time — it can focus on computation and control logic.

💡 DMA and the CPU Are 'Colleagues'

DMA is not a “subordinate” of the CPU — it’s an independent hardware module that shares the same bus. Rather than “DMA works while the CPU sleeps,” the more accurate picture is: “the CPU and DMA are doing different jobs in parallel.”

  • CPU → control logic, computation, decision-making
  • DMA → simple data transport

This division of labor dramatically improves the efficiency of an embedded system.

DMA’s Three Elements

Element What it is Example (UART TX)
Source Where to read from RAM buffer
Destination Where to write to USART2->DR register
Count How many to transfer Number of bytes to send

Each also has an address-increment setting:

Source Destination
UART TX ✅ Increment (advance to next byte) ❌ Fixed (DR register is always the same address)
ADC → RAM ❌ Fixed (DR register) ✅ Increment (advance to next array element)

⚡ Why Do We Need DMA?

Interrupts Still Leave the CPU Occupied

In interrupt-based UART transmission (Episodes 8–9), every time UART finishes one byte an interrupt fires, and the CPU writes the next byte to the DR register — over and over.

Byte 1 done → interrupt → CPU writes byte 2 to DR
Byte 2 done → interrupt → CPU writes byte 3 to DR
    …… (repeats for every byte)

Sending 100 bytes triggers 100 interrupts, and the CPU runs an ISR each time.

⚠️ 画像が見つかりません: without_dma.png

With DMA, Only One Interrupt — at Completion

CPU tells DMA: "send N bytes from this buffer to UART DR"
  ↓
DMA autonomously transfers one byte at a time (CPU can do other work)
  ↓
Transfer complete → DMA fires one interrupt
  ↓
CPU "prepares the next data" — that's all

⚠️ 画像が見つかりません: with_dma.png

📌 The DMA Idea: Make the CPU Idle

DMA is a mechanism for making the CPU idle. Because the CPU doesn’t have to babysit the transfer, that time is free for “work that truly needs the CPU” — reading sensors, running control algorithms, updating displays.

Interrupt → “Call the CPU on an event-driven basis”
DMA → “Move data without calling the CPU at all”

These two are not opposites — they’re used together.


🏗 STM32’s DMA Architecture

DMA1 and DMA2

The STM32F401RE has DMA1 (8 streams) and DMA2 (8 streams).

  • DMA1: Handles APB1 peripherals (USART2, I2C1, SPI2, etc.)
  • DMA2: Handles APB2 peripherals (USART1, SPI1, ADC1, etc.) + memory-to-memory transfers

Streams and Channels

Each DMA has 8 streams (Stream0–7); each stream selects one of 8 channels (Channel0–7). The channel determines which peripheral’s DMA request is accepted.

DMA Stream Channel Peripheral
DMA1 Stream6 Channel4 USART2_TX ← used in this episode
DMA1 Stream5 Channel4 USART2_RX
DMA1 Stream0 Channel1 I2C1_RX
DMA2 Stream7 Channel4 USART1_TX
⚠️ Always check the mapping in the data sheet

The stream/channel mapping differs between microcontrollers. For the STM32F401, verify using the RM0368 (Reference Manual), “Table: DMA1 request mapping” in the DMA chapter. Configuring with CubeMX automatically enters the correct values.

DMA Transfer Modes

Mode Description Use case
Normal Transfer a set count, then stop UART TX (one-shot)
Circular Reuse buffer continuously ADC sampling, audio
Memory-to-Memory RAM-to-RAM copy memcpy replacement (DMA2 only — DMA1 does not support this)

🛠 DMA-ifying UART TX (Practice)

CubeMX Configuration

  1. USART2 → Mode: Asynchronous (Baud: 115200)
  2. DMA Settings tab → Add → USART2_TX
    • Direction: Memory To Peripheral
    • Mode: Normal
    • Increment: Memory ✅ / Peripheral ❌
  3. NVIC Settings → DMA1 stream6 global interrupt → Enable ✅

⚠️ 画像が見つかりません: cubemx_dma.png

Generated Code

Key parts of the initialization code CubeMX generates:

/* Excerpt from MX_DMA_Init() in main.c */
hdma_usart2_tx.Instance                 = DMA1_Stream6;
hdma_usart2_tx.Init.Channel             = DMA_CHANNEL_4;
hdma_usart2_tx.Init.Direction           = DMA_MEMORY_TO_PERIPH;
hdma_usart2_tx.Init.PeriphInc           = DMA_PINC_DISABLE;   /* DR fixed */
hdma_usart2_tx.Init.MemInc              = DMA_MINC_ENABLE;    /* advance through buffer */
hdma_usart2_tx.Init.PeriphDataAlignment = DMA_PDATAALIGN_BYTE;
hdma_usart2_tx.Init.MemDataAlignment    = DMA_MDATAALIGN_BYTE;
hdma_usart2_tx.Init.Mode                = DMA_NORMAL;
hdma_usart2_tx.Init.Priority            = DMA_PRIORITY_LOW;

Implementation: DMA Transmission

/* Global buffer (must remain in RAM throughout the DMA transfer) */
static uint8_t g_tx_buf[64];
volatile bool g_tx_busy = false;   /* <stdbool.h> */

/* Start a DMA transmission */
void uart_send_dma(const char *str)
{
    uint16_t len = strlen(str);
    if (len > sizeof(g_tx_buf)) len = sizeof(g_tx_buf);

    /* Wait if a transfer is already in progress (or handle as error) */
    while (g_tx_busy);   /* ← polling — see note below */

    memcpy(g_tx_buf, str, len);
    g_tx_busy = true;

    HAL_UART_Transmit_DMA(&huart2, g_tx_buf, len);
    /* ← CPU returns here immediately; DMA handles the rest */
}

/* Transfer-complete callback (called from the DMA interrupt) */
void HAL_UART_TxCpltCallback(UART_HandleTypeDef *huart)
{
    if (huart->Instance == USART2)
    {
        g_tx_busy = false;   /* allow the next transmission */
    }
}
/* while(1) in main */
while (1)
{
    char msg[32];
    snprintf(msg, sizeof(msg), "tick=%lu\r\n", g_tim2_tick);

    uart_send_dma(msg);   /* CPU returns immediately */

    /* CPU is free to do other work */
    HAL_GPIO_TogglePin(GPIOA, GPIO_PIN_5);
    HAL_Delay(500);
}
💡 while(g_tx_busy) is polling

while (g_tx_busy) stalls the CPU until the previous transfer finishes — polling. When transmission frequency is high, this defeats the non-blocking benefit of DMA. In production code, consider returning an error or skipping the send when busy to preserve non-blocking behaviour.

/* Non-blocking design */
if (g_tx_busy) return HAL_BUSY;   /* let the caller decide */
✅ HAL_UART_Transmit_DMA() Is Non-Blocking

HAL_UART_Transmit() is blocking (waits until transmission completes); HAL_UART_Transmit_DMA() is non-blocking. Control returns to the caller the instant the function is called, while DMA continues the transfer in the background.

This is why it is critical not to reuse the buffer before the transfer completes (details in the pitfalls section).

Verification: Measuring with DWT CYCCNT

/* Measure CPU time from DMA launch to callback */
uint32_t t_start = DWT->CYCCNT;
HAL_UART_Transmit_DMA(&huart2, g_tx_buf, len);
uint32_t t_launch = DWT->CYCCNT - t_start;
/* t_launch ≈ a few tens to a few hundred cycles (just the "launch" cost) */

/* For comparison: blocking version */
t_start = DWT->CYCCNT;
HAL_UART_Transmit(&huart2, g_tx_buf, len, HAL_MAX_DELAY);
uint32_t t_blocking = DWT->CYCCNT - t_start;
/* t_blocking ≈ bytes × 86.8µs × 84 cycles/µs (CPU occupied the whole time) */

Comparison example for a 64-byte transmission:

Method CPU time occupied
HAL_UART_Transmit() (blocking) ≈ 47,000 cycles (≈ 560 µs)
HAL_UART_Transmit_DMA() (DMA) ≈ 200 cycles (≈ 2.4 µs)

The blocking CPU occupancy scales linearly with the byte count:

t_{\text{blocking}} \approx N_{\text{bytes}} \times \frac{10\,\text{bits}}{f_{\text{baud}}} \times f_{\text{CPU}}

With DMA, this cost collapses to a fixed \approx 200\,\text{cycles} launch overhead — independent of N_{\text{bytes}} .


📊 Throughput and Latency

Let’s clarify why DMA feels “faster.”

Throughput (Transfer Volume per Unit Time)

Whether or not you use DMA, the physical transfer speed of UART does not change. 115200 bps is 115200 bps.

What DMA improves is not throughput — it’s the time the CPU has available for other work.

Latency (Responsiveness)

With DMA, the “waiting time” for a UART transmission is nearly zero from the CPU’s perspective.

【Blocking】
CPU locked to transmission ─────────────────────────── freed
               ↑TX start                             ↑TX complete

【DMA】
CPU issues start ─ freed
              ↑                DMA transfer complete
                               ↑callback via interrupt

This difference is the “latency improvement.” The CPU can move on to the next task immediately.


🔀 Bus Contention

AHB Bus Matrix

The STM32F4 uses an AHB (Advanced High-performance Bus) Matrix structure. The CPU and DMA exist as independent bus masters and share the same bus.

graph LR CPU["CPU
Cortex-M4"] DMA["DMA1"] BUS(["AHB Bus Matrix"]) SRAM["SRAM"] Flash["Flash"] APB["APB1/APB2
Peripherals"] CPU --> BUS DMA --> BUS BUS --> SRAM BUS --> Flash BUS --> APB

When Contention Occurs

If the CPU and DMA try to access the same memory simultaneously, one of them must wait.

CPU: about to read g_tx_buf (in SRAM)
DMA: currently reading data from g_tx_buf (in SRAM)
  → Bus arbitration fires → one party waits 1 cycle

Bus arbitration is the hardware mechanism that referees competing simultaneous access attempts — the traffic controller of the bus. This situation is called bus contention.

💡 The AHB Bus Matrix is a Multi-Layer Structure

The STM32F4 AHB Bus Matrix is not a single shared road — it is a multi-lane interchange. When the CPU is fetching instructions from Flash and DMA is simultaneously accessing SRAM, they travel on separate bus paths and do not conflict.

Contention only occurs when both the CPU and DMA access SRAM at the same time. If, say, DMA is reading the UART TX buffer from SRAM while the CPU also tries to read or write SRAM, bus arbitration fires. For most UART workloads the overlap is rare and brief.

💡 Bus Contention Cannot Be Eliminated

Bus contention is unavoidable, but its impact is typically small (a few to a few dozen cycle stalls). For slow, periodic transfers like UART DMA, it’s almost never a problem.

Contention becomes significant when DMA is running high-bandwidth, continuous transfers (high-speed ADC, audio, cameras). In those cases, separating transfer buffers across internal SRAM and CCM (Core-Coupled Memory) may be needed.


⚠️ Common Pitfalls

Pitfall 1: Overwriting the Buffer Before the Transfer Completes

/* ❌ Overwriting the buffer while DMA is still reading it */
uint8_t tx_buf[64];

HAL_UART_Transmit_DMA(&huart2, tx_buf, 32);
/* While DMA is still reading tx_buf… */
memset(tx_buf, 0, 64);   /* ← Corrupts the buffer! */

DMA is still reading the old buffer, but the CPU overwrites it. The transmitted data is corrupted.

/* ✅ Confirm completion via callback before the next operation */
volatile uint8_t g_tx_done = 1;

void HAL_UART_TxCpltCallback(UART_HandleTypeDef *huart)
{
    g_tx_done = 1;
}

/* Check the done flag before transmitting */
while (!g_tx_done);   /* wait for the previous transfer to finish */
g_tx_done = 0;
memcpy(tx_buf, new_data, len);
HAL_UART_Transmit_DMA(&huart2, tx_buf, len);

Pitfall 2: Using a Local (Stack) Variable as a DMA Buffer

/* ❌ Using a stack-local variable as a DMA buffer */
void send_message(void)
{
    char buf[32] = "Hello DMA!\r\n";
    HAL_UART_Transmit_DMA(&huart2, (uint8_t*)buf, 12);
    /* ← When the function returns, the stack is freed.
          DMA is still reading that "former stack" area! */
}

DMA buffers must be global variables or static local variables.

/* ✅ static keeps it alive beyond the function scope */
void send_message(void)
{
    static char buf[32];
    snprintf(buf, sizeof(buf), "Hello DMA!\r\n");
    HAL_UART_Transmit_DMA(&huart2, (uint8_t*)buf, 12);
}

Pitfall 3: Misusing volatile on DMA Buffers

You might want to mark a DMA-written buffer (e.g. a receive buffer) as volatile, but when using HAL you need to be careful.

/* Tricky example: volatile uint8_t as a DMA buffer */
volatile uint8_t rx_buf[64];

/* Passing a volatile pointer to HAL_UART_Receive_DMA() may
   produce a compiler warning (HAL doesn't expect volatile) */

Because the HAL DMA driver manages cache internally, the safer approach on the application side is to omit volatile and instead control access timing via callbacks.

Pitfall 4: D-Cache and Coherency Problems

This problem applies to any MCU that has a D-Cache (data cache) — not just the STM32F7/H7. MCUs based on Cortex-M7 or higher, such as NXP i.MX RT, Renesas RZ, and the high-performance lines from various vendors, are all equally affected.

Because DMA accesses RAM directly without going through the cache, the CPU’s cached view of memory and the actual RAM contents can go out of sync (coherency problem). The STM32F401 has no D-Cache, so this never occurs in this series — but any MCU with D-Cache requires explicit handling when using DMA.

RX Direction (DMA → RAM → CPU) Failure

DMA writes received data to rx_buf in RAM (bypasses cache)
   ↓
CPU tries to read rx_buf
   → Cache still holds stale data (zeros or data from previous transfer)
   → CPU reads from cache, not RAM ← corrupted

Symptom: Buffer reads as all zeros, or unchanged from the previous transfer. Inserting a printf “fixes” it (the memory access flushes the cache, making it non-reproducible).

TX Direction (CPU → RAM → DMA) Failure

CPU writes new data to tx_buf
   → Write-back cache: data is in cache but not yet written to RAM
   ↓
DMA reads tx_buf directly from RAM → sends stale data ← corrupted

Symptom: Transmitted data is the same as the previous transfer, or is garbage.

Fix: SCB_CleanDCache / SCB_InvalidateDCache

/* ---- After RX: invalidate cache so CPU reads fresh data from RAM ---- */
void HAL_UART_RxCpltCallback(UART_HandleTypeDef *huart)
{
    SCB_InvalidateDCache_by_Addr((uint32_t*)rx_buf, sizeof(rx_buf));
    process_received_data(rx_buf);   /* data is now up to date */
}

/* ---- Before TX: flush cache to RAM so DMA reads the latest data ---- */
void uart_send_dma(uint8_t *data, uint16_t len)
{
    memcpy(tx_buf, data, len);
    SCB_CleanDCache_by_Addr((uint32_t*)tx_buf, sizeof(tx_buf));
    HAL_UART_Transmit_DMA(&huart2, tx_buf, len);   /* DMA now reads fresh data */
}
⚠️ 32-byte alignment is mandatory

The Cortex-M7 L1 data cache has a cache line size of 32 bytes. All cache operations are performed in this 32-byte unit — which is exactly why SCB_InvalidateDCache_by_Addr / SCB_CleanDCache_by_Addr only work on 32-byte-aligned addresses and 32-byte-multiple lengths.

If the buffer address is not 32-byte-aligned, the cache of adjacent variables gets invalidated too, causing corruption in a completely unrelated location.

/* DMA buffers must be 32-byte aligned */
__attribute__((aligned(32))) uint8_t rx_buf[64];
__attribute__((aligned(32))) uint8_t tx_buf[64];
Function Behavior When to call
SCB_CleanDCache_by_Addr() Flush cache → RAM (write-back) Before DMA TX (CPU writes → DMA reads)
SCB_InvalidateDCache_by_Addr() Invalidate cache, force re-read from RAM After DMA RX (DMA writes → CPU reads)
Define non-cached MPU region Place DMA buffers outside the cached address space Root-cause fix when many buffers are involved

Summary

What we learned in this episode:

Concept Detail
Role of DMA Transfers data between RAM and peripherals without CPU involvement
DMA1 Stream6 Ch4 Handles USART2_TX on the NUCLEO-F401RE
HAL_UART_Transmit_DMA() Non-blocking. CPU regains control immediately
TxCpltCallback Called after transfer completes. Permit the next transmission here
Bus contention Occurs when CPU and DMA access the same memory simultaneously. Minor impact for slow transfers
Buffer management Never overwrite a buffer mid-transfer. Place buffers in global or static storage
D-Cache coherency On any MCU with D-Cache: call SCB_CleanDCache before TX and SCB_InvalidateDCache after RX. 32-byte alignment mandatory
📌 The Essence of DMA: Giving the CPU a Hierarchy of Jobs

Once you master DMA, you can assign priority to the CPU’s work.

  • Work only the CPU can do: control logic, decision-making, state transitions
  • Work you can hand off to DMA: simple data transport

The moment you shift from “the CPU carries bytes one at a time” to “DMA handles everything, the CPU does its real job only” — your embedded system design moves up a level.

Next up is “Linker Scripts and Map Files (Looking Under the Hood).” We’ll learn where .text, .data, and .bss are placed, and how to track RAM consumption with a map file.


Next Episode

🚀 Episode 11: Linker Scripts and Map Files — Looking Under the Hood

Where are .text, .data, and .bss placed? How to read a linker script, track RAM/Flash usage from a map file, and understand stack and heap placement.


FAQ

Q. Does using DMA make UART transmission “faster”?

The physical transfer speed (bps) does not change. What improves is the time the CPU has available for other processing. The time previously locked up in transmission approaches zero, so the system’s overall responsiveness improves.

Q. Can I call HAL_UART_Transmit_DMA() back-to-back?

Calling it before the previous transfer completes returns an error (HAL_BUSY). The correct pattern is to set a flag in HAL_UART_TxCpltCallback() and check for completion before each new transmission.

Q. When do you use Circular mode?

When you need to stream data continuously — ADC sampling, audio output, and similar cases. In Circular mode, when the transfer completes, DMA automatically wraps back to the start, and the HalfCpltCallback (half complete) and CpltCallback (full complete) enable double-buffered processing.

Q. Can DMA be used with peripherals other than UART?

Yes. Almost all peripherals support DMA — SPI, I2C, ADC, DAC, timers, and more. In particular, running SPI (high speed) and ADC (high-frequency sampling) without DMA is impractical in most real applications.


💡 Next Step: DMA Reception and Circular Mode

Once you’ve mastered DMA-ified UART transmission, DMA reception is waiting. HAL_UART_Receive_DMA() enables continuous, CPU-free reception.

Go further with Circular mode — it automatically wraps the DMA back to the start of the buffer, enabling high-speed ADC sampling, audio processing, and any scenario where data flows non-stop. DMA is far more than just UART transmission.