Understanding with AXI Protocol and Cache Coherency

As AXI protocol and Cache Coherency are commonly used concepts these days in almost each and every complex SoC’s so knowledge of those concepts are must for everyone to know how it works.

The AXI protocol is burst-based and defines the following independent transaction channels:
• read address
• read data
• write address
• write data
• write response.

An address channel carries control information that describes the nature of the data to be transferred.

The data is transferred between master and slave using either:
• A write data channel to transfer data from the master to the slave. In a write transaction, the slave uses the write response channel to signal the completion of the transfer to the master.
• A read data channel to transfer data from the slave to the master.
The AXI protocol:
• permits address information to be issued ahead of the actual data transfer
• supports multiple outstanding transactions
• supports out-of-order completion of transactions

1. AXI has 1 read address channel, 1 write address channel, 1 read data channel, 1 write data channel. 1 write response channel That is all together it has 5 parallel channels.
Whereas AHB has 1 address channel, 1 read data channel, 1 write data channel.
2. AXI as native support for multiple outstanding transactions.
3. AXI supports transaction IDs. The user may issue multiple outstanding transactions per transaction ID.
4. User can insert a pipeline register anywhere in the path of any of the 5 channels, which helps in timing closure and help achieve higher operating frequency.
5. The length of the burst is always known right at the start. This feature is supported by using AxLEN bits. Wherein AHB is unknown at the start.
6. Write Strobes Are supported.
7. AXI3 supports Locked Transfers, AXI4 does not support Locked Transfers.

Master initiates a transaction and doesn't wait for it to complete(response to arrive) and initiates another transaction. So the first transaction is an outstanding transaction. AXI supports multiple outstanding transactions so an AXI master doesn't have to wait for a transaction to complete to initiate a new one. So the performance
is boosted.

READ operation doesn't have a response channel because direction both the read data and read response is from slave to master. With every beat, the slave will send a read response along with the data in read data channel.

The data bus width as per spec can be 8,16,32...,1024 bits. So the minimum is 8 and maximum is 1024 bits

Write data channel information is always treated as buffered so that the master can perform write transactions without slave acknowledgment of previous write transactions.

Write Response and Read data channels.

Transaction - The complete set of required operations on the AXI bus.
Burst - Required payload data to is transferred.
Beats - Burst can comprise multiple data transfers.

No, Because early burst termination is not supported.

Each AXI channel transfers information in only one direction, and the architecture does not require any fixed relationship between the channels.

This means a register stage can be inserted at almost any point in any channel, at the cost of an additional cycle of latency.

An Interconnect is a component with more than one interface that connects one or more master components to one or more slave components.

The characteristics of a transaction(read/write) like burst_length, burst_size, burst type, atomic characteristics, etc are called the control information.

Manages the transactions between the MASTER and SLAVE like Routing, providing responses, buffer.

Shared address and data buses, shared address buses and multiple data buses, multilayer with multiple addresses and data buses.

If the AXI slave component is taking more time in responding back to the master for the completion of the transfer then such components
are said to be having high initial access latency.

The input and output on hardware are set to individual channels. But the bus is just a pathway from and-to somewhere.

Data interleving increases the throughput.

The connection between two components.

The AXSIZE signal denotes how much amount of data in bytes can be accommodated in a single beat of the burst.

AXLEN denotes how many transfers are there in a burst.

The strobe signal is used to indicate which bytes of the write data bus are valid for each transfer of data. No, it's only used in a write operation.

This signal indicates the last transfer in a write/read burst. Yes, Write data and read data channels are used to send this signal.

1.The source uses the VALID signal to indicate when valid information is available.
2.The VALID signal must remain asserted, meaning set to high, until the destination accepts the information.
3.The destination indicates when it can accept information using the READY signal. The READY signal goes from the channel destination to the channel source.
4.This mechanism is not an asynchronous handshake and requires the rising edge of the clock for the handshake to complete.

No, the initiator and receiver should not wait for the assertion of handshaking signal but after a successful handshake, valid must be deasserted as per spec.

VALID should go high when the initiator has valid information to send. It should go low if there is no valid information and it should go low after a successful handshake.

Write response is generated after the completion of a write transaction.

There are certain dependencies on how handshaking signals should be asserted. If it's violated handshaking will not occur and the process will be stalled. It's called a deadlock scenario.

For eg., A deadlock condition can occur if the slave is waiting for WVALID before asserting AWREADY.

For read transfers, the information and the response flow are from slave to master. But for a write transaction, the information and the response are in different directions.
So individual responses for each transfer will involve more clock cycles and unnecessary traffic because of the two-way flow between master and slave.
So it is better to have a single response for a write transaction compared to a response for each transfer in a read transaction.

By ensuring proper channel handshaking dependencies as per the protocol, We can ensure data integrity.

NO, because data handshaking happens at least one CLK cycle after the address handshaking.

With respect to write operation, WLAST indicates it's the last transfer in write burst. So if WLAST is not provided by MASTER, the slave will not know whether the transfer is completed or not. So it will not be able to assert any response signal.

NO, But it's based on the user's requirement.

The granularity of mapping in AXI is 4KB. That means the smallest "block" of addresses that can be assigned to a given slave/peripheral is 4KB. And all allocations are multiples of 4KB. So when you cross a 4K boundary you are potentially going from slave A's address space to slave B' Discarding read data that is not required can result in lost data when accessing a read-sensitive device such as a FIFO.

When accessing such a device, a master must use a burst length that exactly matches the size of the required data transfer

After the initiation of a transaction, the Master must have status information of that particular transaction. Sometimes an address to which a transaction
is initiated will not be available because the address will not be there or maybe not accessible because of the secured type. Sometimes the slave may not accept the data.

So in these conditions the master but be aware of the status so it can act accordingly. So response signals are important.

Okay, exclusive okay, decode error, slave error.

When both the source and destination happen to indicate in the single rising edge, that they can transfer the address, data, or control information.

In this case, the transfer occurs at the rising clock edge when the assertion of both VALID and READY can be recognized. This means the transfer occurs at the next rising edge.

When AWREADY is HIGH the slave must be able to accept any valid address that is presented to it. As the default, AWREADY state of LOW forces the transfer to take at least two cycles, one to assert AWVALID and another to assert AWREADY.

It is incorrect. the slave must wait for both ARVALID and ARREADY to be asserted before it asserts RVALID to indicate that valid data is available.

Address and data are two independent channels. Address and control information is transfer to the address channel by which the slave configuring accordingly to receive the data.

As this information is generated from the master, it can assert a valid signal. And also deadlock conditions can the avoided.

1. For wrapping bursts, the burst length must be 2, 4, 8, or 16
2. A burst must not cross a 4KB address boundary
3. Early termination of bursts it not supported.

Burst type. The burst type and the size information, determine how the address for each transfer within the burst is calculated.

FIXED, INCR, WRAP are the burst types supported in AXI

An interconnect component, to indicate that there is no slave at the transaction address.

For the second transfer, convert the unaligned address to aligned and then continue the transaction.

The transaction starts with that first address only. When it reaches the addressN it wraps back to the wrap boundary and continues till axlen.

Transaction will not take place as unaligned addresses are not supported in the wrap burst.

Valid should be deasserted if handshaking is completed and the addresses are not coming in each and every clock cycle. Ready need not be deasserted.

INCR burst is used in sequential memory and FIXED is used in FIFO.

The byte lane of the highest addressed byte of a transfer is the upper byte lane and the lowest addressed byte of a transfer is the lower byte lane.

Burst_length = AxLEN+1 so 5.

True. As per AXI4 specification only for INCR burst, the burst length is 1-256 transfers. For fixed and wrap it's 1-16 transfers. And for a wrap, it must be 2,4,8 and 16.

If(start address % transfer size == 0) address is aligned address else address is unaligned

No, It will either upper byte lane or the lower byte lane. We cannot mix both lanes to transfer.

If a transfer is narrower than its data bus. Then it's called a narrow transfer. When a master generates a transfer that is narrower than its data bus, the address, and control information determine which byte lanes the transfer uses:
• in incrementing or wrapping bursts, different byte lanes are used on each beat of the burst.
• in a fixed burst, the same byte lanes are used on each beat.

WSTRB can take any value. But it's recommended that they have to either driver low or held previous values.

These equations determine addresses of transfers within a burst:
• Start_Address = AxADDR
• Number_Bytes = 2 ^ AxSIZE
• Burst_Length = AxLEN + 1
• Aligned_Address = (INT(Start_Address / Number_Bytes) ) x Number_Bytes.

This equation determines the address of the first transfer in a burst:
• Address_1 = Start_Address.
For an INCR burst, and for a WRAP burst for which the address has not wrapped, this equation determines the address of any transfer after the first transfer in a burst:
• Address_N = Aligned_Address + (N – 1) × Number_Bytes.

For a WRAP burst, the Wrap_Boundary variable defines the wrapping boundary:
• Wrap_Boundary = (INT(Start_Address / (Number_Bytes × Burst_Length)))× (Number_Bytes × Burst_Length).
For a WRAP burst, if Address_N = Wrap_Boundary + (Number_Bytes × Burst_Length), then:
• use this equation for the current transfer:
— Address_N = Wrap_Boundary
• use this equation for any subsequent transfers:
— Address_N = Start_Address + ((N – 1) × Number_Bytes) – (Number_Bytes × Burst_Length).

The transactions which are yet to be completed are called outstanding transactions.

for example: Let us say we have 10 writes initiated from the Master component. Out of 10, only 3 of them have received an OKAY response from slaves. In such a case, the rest of the 7 writes whose responses are yet to be received are called outstanding transactions.

If the AXI slave component is taking more time (in terms of clock cycles) in responding back to the master for the completion of the transfer then such components are said to be having high initial access latency.

AXI protocol provides a signal called WSTRB will enable on which data
lanes the data has to transfer.

The responses from the slave can be sent out of order. There is no
restriction from the slave side where the responses are completed in the order in which they have been received. The exception here is the first transaction. Except for the first transaction, this facility is applicable.

No. Addresses (read/write) are generated only from the AXI Master side only. It is the READ data and write response channels that are owned by AXI slave.
The slave will only be sending READ Data, READ response, WRITE Responses.

The statement means that for both WRITE and READ, there will be an
associated WLAST and RLAST signals which can indicate whether the last item within a transaction has been taken place or not.

The AWSIZE signal denotes how much amount of data in bytes can be
accommodated in a single transfer of the burst. The maximum value is 128 bytes.

5 channels
Each channel will have a valid & ready signal.
Write operation has both data and address channels
AWADDR: write address
AWVALID: Write address valid: source is Master
AWREADY: READ address ready : source is Slave
WDATA : Write data
WVALID: VALID write data : Source is Master
WREADY: write ready : Source is slave

READ operation has both data and address channels
ARADDR: READ address
ARVALID: READ address valid: source is Master
ARREADY: READ address ready: source is Slave
RDATA : READ data: slave
RVALID: VALID READ data: Source is Slave
RREADY: READ ready: Source is Master

WRITE response channel:
Owned by slave
BVALID: Source is a slave
BREADY: Source is MAster

The max allowable AWSIZE is 128 bytes and the max allowable length is 16.
So, it is the product of 128*16 = 2048 bytes.

Let us take the following system scenario:

L2 cache memory is in the path between the processor and interconnect.
Any transfer that can access the cache will check the cache contents
(called cache lookup) before potentially accessing the downstream memory in
this case, it is DDR memory.

INCR is the simplest burst type, accessing a lower address and sequentially
and stepping up in memory to a higher address. These types of bursts can also be used in performing a cache, but the problem with that burst type is that you might need to perform a complete cache linefill before that data you want is stored in the cache and made available to the processor. This is where WRAP burst has an advantage.

A WRAP burst fetches the important data first (which the processor actually
wants) and then completes the cache line fill around that important data.

In system-level terminology, this important data which the processor actually
wants from the particular access location of the cache is called "critical word".

As an example, if we had an 8-word cache line, and the processor wanted to
read data from address 0x18 (the 7th entry on a cache line if that data was
cached), and INCR burst would need to fetch data for:
0x00, 0x04, 0x08, 0x0C, 0x10, 0x14 before finally getting the 0x18 data the
processor wants (the processor is no longer stalled), and then the final 0x1C
cache line entry is filled.

Instead, if we use a WRAP burst, this burst can start at 0x18 (so the processor
is no longer stalled), and the cache line then fills up around this "critical word", with accesses to 0x1C, 0x00, 0x04, 0x08, 0x0C, 0x10 and 0x14.

There will still be 8 memory accesses to perform the cache linefill but in most
cases the WRAP burst type will stall the requesting processor for fewer cycles than the INCR burst type.

NO. Early burst termination is NOT supported in AXI. AXI Master can disable
writing by deasserting all the write strobes but it must complete the
remaining transfers of the burst. Discarding READ data that is NOT required
can result in lost data when accessing a READ sensitive device like FIFO.

Cache coherency is a system where the system s/w updates all cache to the same data,
using some additional extensions provided by the AMBA AXI4 ACE(AXI Coherency Extension) protocol.
L1 cache is specific to each core.
L2 cache is specific to processor sub-system
Example: Each core will have a unique L1 cache and all other cores in a sub system will have 1 L2 cache.

System s/w will decide which address is cacheable & which address is non-cacheable.
Accordingly, the processor will generate the signal AWCACHE in such a way that the address will be cached.

The processor will go and create an entry in the cache and will fetch the data & put it into the cache.

If the address is not present in the cache, then the processor will go and create an entry in the cache and will fetch the data & put it into the cache. During this process, there is a chance that L1 and L2 may go out of sync.

For example, there is an address 'h1000 present in the DDR memory, L1 and L2. In a case where the L1 cache address got updated and L2 is NOT updated, there should be a mechanism to make them in sync. Such a mechanism is called cache coherency.

Prefetching refers to retrieving & storing data into buffer memory (cache) before the processor requires the data. When the processor wants to process the data, it is readily available and can be processed within a short period of time.

Had there not been a cache memory, the processor has to download the data directly from the memory address, hence there could be a delay.

Cache prefetching is a speed-up technique used by the processors where instructions/data are fetched before they are needed.

AWCACHE[1]:- For writes this means that number of writes can be merged together.
ARCACHE[1]:- For reads, this means that the location can be prefetched or can be fetched just once for multiple read transactions.

System s/w will decide which address is cacheable & which address is non-cacheable. Accordingly, the processor will generate the necessary
attributes over the signals AWCACHE/ARCACHE to provide support to system-level caches about the transaction types.

RA: if high, it means that if the transfer is read and if it misses in the cache then it could be allocated.
WA: if high, it means that if the transfer is write and if it misses in the cache then it could be allocated.

In the form of different variants of accesses.
a. privileged
b. normal
c. secure
d. nonsecure

EX access fails. If a master doesn't complete the write portion of an exclusive operation, a subsequent EX-RD changes the address that is
being monitored for exclusivity.

EX Fails. In such a case, to overcome the memory overriding problem, the slave reserves some memory resource for M1 virtually as indicated by EX-RD request earlier from M1. This is the fundamental advantage of exclusive access in AXI.

AXI slave will start monitoring the ADDRS on which EXREAD operation has been initiated and also the ARID provided by the master until either a write occurs to that location or until another EX READ with the same ARID value resets the EX ACCESS monitoring logic in the slave to a different address.

The length of the burst must be 2,4,8,16. No support for unaligned transfers

IN WRITE: there is just one response given for the entire burst but not for each and every individual data item within the burst.

FOR READ: the slave can provide different responses for different transfers within a burst.

For example: in a burst of 16 read transfers, the slave might return an OKAY response for 15 of them and a SLVERR response for the 16th item.

In a multi-master system, the IC will append additional information to the
ID tag to ensure that ID tags from all the masters are unique. The ID tag is
similar to a master number but with an extension that each master can
implement multiple virtual masters within the same port by supplying an ID tag to indicate the virtual master number.

Write data is treated as buffered so that the master can perform write
transactions without slave acknowledgment of previous writes

WLAST is asserted for the last data item of the burst by the AXI Master.

Yes. Unless both ARVALID & ARREADY signals are seen HIGH, RVALID cannot be driven to HIGH value. It is a READE transaction. Unless the master drives the ADDRS for fetching the data, the READ transaction cannot be performed. Unless there is a valid read address, there cannot be a READ.

Transfer size * burst_length

AWLEN: 4
AWSIZE is 4 bytes: 32 bits

address boundary is: 16
0 - F

By default, the WSTRB member of the master transaction is random, and it would get random values of all the bits when randomized. If the user wants to set all the bits to '1', then they can apply the constraint as below

Add relevant constraint during randomization as follows:
foreach (wstrb[i])
wstrb[i] == (1<<(1<<this.burst_size)) - 1;

All AXI and Cache Coherency concepts are guided by one of the most experienced Verification Engineer in the industry Mr. Rahul Bhardwaj who is having more than 15 years of experience in the ASIC Verification Domain and working on different products in the VLSI industry.

Thank you so much Mr. Rahul Bhardwaj for providing such valuable information as I know you are having the busiest schedule still you are giving your time to help the engineers.

I will come with new blog posts soon till then Keep on learning and Keep on Growing See Ya Take Care:)

About the author

Avatar photo

The Art of Verification

Hi, I’m Hardik, and welcome to The Art of Verification.

I’m a Verification Engineer who loves to crack complex designs and here to help others commit to mastering Verification Skills through self-learning, System Verilog, UVM, and most important to develop that thought process that every verification engineer should have.

I’ve made it my mission to give back and serve others beyond myself.

I will NEVER settle for less than I can be, do, give, or create.

View all posts

15 Comments

  • Hello Hardik,

    I would like to add one more point to the below mentioned question.

    Q: What will happen if the address is not present in the Cache?

    Ans:
    For any given data, the Processor sends its request to the Cache memory.

    If the data is found in Cache, it can be loaded quickly into the CPU. If is not resident in Cache, the request is forwarded to the next lower level of the hierarchy, and this process begins again.
    If the data is found at this level, the whole block in which the data resides is transferred into the Cache.

    If the data is not found at this level, the request is forwarded to the next lower level, and so on.

  • Hi,
    can anyone explain how to write a test case for outstanding and out of order transactions in AXI

    • Hi, can any one give the information about how boot code works in soc? And how reset is handling in soc?

      Regards,
      Raushan

  • Thank you Hardik for such an amazing content.
    I was searching for AXI interview question with answer. And finally my search end at your website.
    Kudos to you and look forward to more such content.
    Thank you so much.

    Just one observation.

    I found your website more readable in mobile rather than laptop.
    I am using Microsoft edge browser there it was looking like plan text with no colouring.

  • Hi,
    Thank you for this valuable content.
    A question – AXI4 allows the write data to be sent before the write address and control information.
    Can you please suggest when it might be useful, and elaborate on this subject?

  • Hi,

    Thank you. Excellent post!!

    I would like know the advantage of unaligned transfer.
    Why is unaligned transfer used?