Linux Interview: Cortex-A8 Architecture

Cortex-A8 Pipeline Diagram

Cortex-A8 Control Diagram

· High quality branch prediction results in fewer replays and lower power

· Branch prediction maintains 95% accuracy over a wide codebase

· Dynamic branch predictor components

· 512-entry 2-way BTB

· 4K-entry GHB indexed by branch history and PC

· 8-entry return stack

· Branch resolution

· all branches are resolved in single stage

· Maintains speculative and non-speculative versions of branch history and return stack

Cortex-A8 Instruction Decode

· Instruction decode highlights

· 4 entry pending queue reduces Fetch stalls and increases pairing opportunities

· replay queue keeps instructions for reissue on memory system stall

· scoreboard predicts register availability using static scheduling techniques

· cross-checks in D3 allow issue of dependent instruction pairs

Cortex-A8 Instruction Issuing

· Two ALU instructions

· One ALU instruction and one load/store instruction

· One multiply/MAC instruction with one

· ALU instruction

· load/store instruction

· NEON data processing instruction

· Two NEON data processing instructions

· One NEON data processing instruction with one

· Load/store instruction

· ALU instruction

Some instructions will only issue to pipeline 0

· Multiply/MAC instructions

· Load/store multiple and other multi-cycle instructions

Cortex-A8 Instruction Execution

· Execution pipeline highlights

· 2 symmetric ALU pipelines: Shift/ALU/SAT

· Load/store pipe used by instructions in either pipeline

· Multiply instructions are tied to pipe 0

· All key forwarding paths supported

· Static scheduling allows for extensive clock gating

Cortex-A8 Memory System

· Harvard Level 1 Caches – both 16KByte, 4 way set associative

· single-cycle load-use penalty

· Virtual index Physically tagged(VIPT)

· Level 1 Data cache is blocking

· Non-Neon read misses cache cause replay of subsequent instructions

· Reduces complexity in later pipeline stages

· Good for power and clock frequency

· Neon data not allocated to L1 (but will read/update in L1 if necessary)

· Integrated 256 KB unified Level 2 Cache, 8-way set associative

· Dedicated low latency, high bandwidth interface to the Level-l cache

· Line length of 16 words(64 bytes)

· Physically index Physically tagged(PIPT)

· Minimum latency of 8 cycles

· Streams to the Neon processing unit; up to 16GByte/s bandwidth

· 128-bit data streaming from both L1D$ and L2$

· 64 bit AMBA AXI interconnect to external memory

· Supports multiple outstanding memory transactions to minimize memory latencies

Cortex-A8 Control Coprocessor

· The processor does not have an external coprocessor interface but it does implement two internal coprocessors, CP14 and CP15

· The CP14 coprocessor: also known as the debug coprocessor

· used for various debug functions.

· CP15 coprocessor: also known as the system control coprocessor

· used to control and provide status information for the functions implemented in the processor.

Cortex-A8 CP15 Register Groups

Function	CP15 Registers
System Configuration	c0
System Control	c1
Translation Base Control	c2
Domain Access Control	c3
Faults	c5/c6
Cache Operations	c7
TLB Operations	c8/c10
Performance Monitor	c9
L2 Control	c9
Pre-load Engine	c11
Interrupts	c12
Process ID	c13
Memory Arrays	c15

Cortex-A8 Performance Monitor Unit

· Controlled through CP15 registers (c9)

· Four Counters

· Count events

· Cache misses

· TLB missed

· Branch ms-predictions

· Exceptions

· External events

· Others

· Interrupt output on overflow

Register	Description
Performance monitor control	Controls the operation of the count registers
Count Enable Set	Enables PMU count registers
Count Enable Clear	Disables PMU count registers
Overflow Flag Status	Enables/Disables PMU count overflow flags
Software Increment	Increments the count of PMU count register
Performance counter selection	Selects a PMU counter
Cycle Count	Reads/writes the PMU cycle count register
Event selection	Selects the event for the PMU to count
Performance Monitor Count	Reads/wites the 4 PMU event count registers
User Enabled	Allows user mode to access the PMU
Interrupt Enable Set	Enables overflow Interrupts
Interrupt Enable Clear	Disables overflow interrupts

Cortex-A8 L2 Preload engine

· PLE is not the same Dynamic Memory Allocation (DMA) engine used in previous ARM family of processors but has a similar programming interface.

· Moves cache lines to/from L2

· Two channels

· Maximum number of cache lines is limited by cache way size(16K in OMAP3)

· Set of registers to control the PLE( through CP15) in secure privileged mode

· Transfer only dirty data from L2

· Supports the ability to lock data to a specific L2 cache way.

· Generates output signals

· nDMAIRQ

· nDMASIRQ

· nDMAEXTERERRIRQ

· Different from PLD, which is an actual single cycle instruction which preloads L2 with line of data. PLD instruction does not preload data into L1 cache as in V6 architecture.

Linux Interview

Wednesday, 6 July 2016

Cortex-A8 Architecture

No comments:

Post a Comment