Wednesday, 6 July 2016

Cortex-A8 Architecture



Cortex-A8 Pipeline Diagram
Cortex-A8Pipeline.png


Cortex-A8 Control Diagram
·       High quality branch prediction results in fewer replays and lower power
·       Branch prediction maintains 95% accuracy over a wide codebase
CortexA8-ControlFlow.png
·       Dynamic branch predictor components
·       512-entry 2-way BTB
·       4K-entry GHB indexed by branch history and PC
·       8-entry return stack
·       Branch resolution
·       all branches are resolved in single stage
·       Maintains speculative and non-speculative versions of branch history and return stack
Cortex-A8 Instruction Decode
CortexA8-instDecode.png
·       Instruction decode highlights
·       4 entry pending queue reduces Fetch stalls and increases pairing opportunities
·       replay queue keeps instructions for reissue on memory system stall
·       scoreboard predicts register availability using static scheduling techniques
·       cross-checks in D3 allow issue of dependent instruction pairs
Cortex-A8 Instruction Issuing
·       Two ALU instructions
·       One ALU instruction and one load/store instruction
·       One multiply/MAC instruction with one
·       ALU instruction
·       load/store instruction
·       NEON data processing instruction
·       Two NEON data processing instructions
·       One NEON data processing instruction with one
·       Load/store instruction
·       ALU instruction
Some instructions will only issue to pipeline 0
·       Multiply/MAC instructions
·       Load/store multiple and other multi-cycle instructions
Cortex-A8 Instruction Execution
CortexA8-InstExec.png
·       Execution pipeline highlights
·       2 symmetric ALU pipelines: Shift/ALU/SAT
·       Load/store pipe used by instructions in either pipeline
·       Multiply instructions are tied to pipe 0
·       All key forwarding paths supported
·       Static scheduling allows for extensive clock gating
Cortex-A8 Memory System
·       Harvard Level 1 Caches – both 16KByte, 4 way set associative
·       single-cycle load-use penalty
·       Virtual index Physically tagged(VIPT)
·       Level 1 Data cache is blocking
·       Non-Neon read misses cache cause replay of subsequent instructions
·       Reduces complexity in later pipeline stages
·       Good for power and clock frequency
·       Neon data not allocated to L1 (but will read/update in L1 if necessary)
·       Integrated 256 KB unified Level 2 Cache, 8-way set associative
·       Dedicated low latency, high bandwidth interface to the Level-l cache
·       Line length of 16 words(64 bytes)
·       Physically index Physically tagged(PIPT)
·       Minimum latency of 8 cycles
·       Streams to the Neon processing unit; up to 16GByte/s bandwidth
·       128-bit data streaming from both L1D$ and L2$
·       64 bit AMBA AXI interconnect to external memory
·       Supports multiple outstanding memory transactions to minimize memory latencies





Cortex-A8 Control Coprocessor
·       The processor does not have an external coprocessor interface but it does implement two internal coprocessors, CP14 and CP15
·       The CP14 coprocessor: also known as the debug coprocessor
·       used for various debug functions.
·       CP15 coprocessor: also known as the system control coprocessor
·       used to control and provide status information for the functions implemented in the processor.
CortexA8-coProc.png
Cortex-A8 CP15 Register Groups
Function
CP15 Registers
System Configuration
c0
System Control
c1
Translation Base Control
c2
Domain Access Control
c3
Faults
c5/c6
Cache Operations
c7
TLB Operations
c8/c10
Performance Monitor
c9
L2 Control
c9
Pre-load Engine
c11
Interrupts
c12
Process ID
c13
Memory Arrays
c15
Cortex-A8 Performance Monitor Unit
·       Controlled through CP15 registers (c9)
·       Four Counters
·       Count events
·       Cache misses
·       TLB missed
·       Branch ms-predictions
·       Exceptions
·       External events
·       Others
·       Interrupt output on overflow
Register
Description
Performance monitor control
Controls the operation of the count registers
Count Enable Set
Enables PMU count registers
Count Enable Clear
Disables PMU count registers
Overflow Flag Status
Enables/Disables PMU count overflow flags
Software Increment
Increments the count of PMU count register
Performance counter selection
Selects a PMU counter
Cycle Count
Reads/writes the PMU cycle count register
Event selection
Selects the event for the PMU to count
Performance Monitor Count
Reads/wites the 4 PMU event count registers
User Enabled
Allows user mode to access the PMU
Interrupt Enable Set
Enables overflow Interrupts
Interrupt Enable Clear
Disables overflow interrupts
Cortex-A8 L2 Preload engine
·       PLE is not the same Dynamic Memory Allocation (DMA) engine used in previous ARM family of processors but has a similar programming interface.
·       Moves cache lines to/from L2
·       Two channels
·       Maximum number of cache lines is limited by cache way size(16K in OMAP3)
·       Set of registers to control the PLE( through CP15) in secure privileged mode
·       Transfer only dirty data from L2
·       Supports the ability to lock data to a specific L2 cache way.
·       Generates output signals
·       nDMAIRQ
·       nDMASIRQ
·       nDMAEXTERERRIRQ
·       Different from PLD, which is an actual single cycle instruction which preloads L2 with line of data. PLD instruction does not preload data into L1 cache as in V6 architecture.


No comments:

Post a Comment