APR20/D Application Optimization for the DSP56300/DSP56600 Digital Signal Processors M o t o r o l a ’ s H i g h - P e r f o r m a n c e D S P T e c h n o l o g y
TABLE OF CONTENTS SECTION 1 INTRODUCTION . . . . . . . . . . . . . . . 1.1 DSP56300 CORE FAMILY . . . . . . . . . . . . . . 1.2 DSP56600 CORE FAMILY . . . . . . . . . . . . . . 1.3 ENHANCEMENTS OVER THE DSP56000 . . 1.3.1 Instruction Set Enhancements. . . . . . . . . . 1.3.2 Architectural Enhancements . . . . . . . . . . . 1.4 APPLICATION NOTE STRUCTURE . . . . . . . 1.4.1 DSP56300 and DSP56600 Features Description and Use . . . . . . . . . . . . . . . . . 1.4.2 Optimizing the Code for Best Performance 1.4.
SECTION 4 USING THE DMA . . . . . . . . . . . . . . .4-1 4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . .4-1 4.2 CONSERVING CORE MIPS BY WORKING IN PARALLEL. . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1 4.3 USING SLOW, LOW-COST MEMORIES . . . .4-4 4.4 SERVICING A PERIPHERAL . . . . . . . . . . . . .4-6 4.5 DATA TRANSFER OPTIMIZATION HINTS. .4-12 SECTION 5.1 5.1.1 5.1.2 5.1.3 5.2 5.3 5 INSTRUCTION CACHE AND MEMORY FEATURES . . . . . . . . . . .5-1 THE INSTRUCTION CACHE. . . . . .
6.3 6.3.1 6.3.2 6.4 6.4.1 6.4.1.1 6.4.1.2 6.4.1.3 6.4.1.4 6.4.1.5 6.4.1.6 6.4.2 STACK EXTENSION DELAYS . . . . . . . . . . . 6-8 Stack Extension Full/Empty Cases . . . . . . 6-9 Avoiding Stack Extension Delays . . . . . . . 6-9 PROGRAM FLOW-CONTROL PIPELINE INTERLOCKS . . . . . . . . . . . . . . . . . . . . . . . . 6-9 What are the Program Flow-Control Pipeline Interlocks? . . . . . . . . . . . . . . . . . . . . . . . . 6-10 MOVE to the Status Register (SR) . . .
7.4.1 7.4.2 7.4.3 vi Dual Data Spaces . . . . . . . . . . . . . . . . . . . .7-7 Using the TFR instructions . . . . . . . . . . . . .7-8 Clearing Registers. . . . . . . . . . . . . . . . . . . .7-8 APPENDIX A SAVING POWER. . . . . . . . . . . . . . . A.1 LOW POWER MODES . . . . . . . . . . . . . . . . . A.1.1 Wait Standby Mode . . . . . . . . . . . . . . . . . . A.1.2 Stop Standby Mode. . . . . . . . . . . . . . . . . . A.1.3 Low-Power Clock Divider . . . . . . . . . . . . . A.
LIST OF FIGURES Figure 2-1 The Fast Normalization Operation for the DSP56300 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9 Figure 2-2 48 × 48-bit Multiplication with 48 Bits of the Result Kept. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 Figure 3-1 State of the Stack When IRQA Is Serviced. . . . 3-5 Figure 4-1 DMA Addressing Modes for SCI Transmitters . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Figure 5-1 DSP56302 Memory Maps . . . . . . . . . . .
LIST OF TABLES viii Table 1-1 New Instructions in DSP56300 and DSP56600 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 Table 2-1 Parallel Move Instructions . . . . . . . . . . . . . . . . . 2-2 Table 2-2 Registers Used in Parallel XY Moves . . . . . . . . 2-4 Table 2-3 Registers used in Long Addressing . . . . . . . . . . 2-5 Table 2-4 Data Operations Using Multi-shift . . . . . . . . . . . 2-8 Table 2-5 Bit manipulation instructions . . . . . . . . . . . . . .
Section 1 INTRODUCTION The DSP56300 and DSP56600 are the new high-performance 24-bit and 16-bit cores in Motorola’s family of Digital Signal Processors. They are based on the same pipeline structure. This structure is capable of executing an instruction on every clock cycle. At the same time these cores maintain a Harvard architecture and programming model similar to the older 24-bit DSP56000 core.
Introduction DSP56600 Core Family • Position Independent Code (PIC) instruction-set support • Unique DSP addressing modes • On-chip memory-expandable hardware stack • Nested hardware DO loops • Fast auto-return interrupts • On-chip instruction cache • On-chip concurrent six-channel DMA controller • On-chip Phase Lock Loop (PLL) • On-Chip Emulation (OnCE) module • Program address tracing support • JTAG port compatible with the IEEE 1149.
Introduction Enhancements over the DSP56000 The first members of DSP chips that use the DSP56600 core are the DSP56602 and the DSP56603. The main differences between these derivatives are the size of the on-chip memory and the types of on-chip peripherals. 1.3 ENHANCEMENTS OVER THE DSP56000 The DSP56300 and the DSP56600 include many architectural enhancements over the older generation 24-bit DSP family, the DSP56000. The following tables shortly describe these enhancements. 1.3.
Introduction Enhancements over the DSP56000 Table 1-1 New Instructions in DSP56300 and DSP56600 Opcodes 1-4 Opcodes Exist in DSP56300? Exist in DSP56600? MAC (uu) Unsigned MAC √ √ DMAC Double-Precision MAC √ √ PLOCK Lock Cache Sector √ PUNLOCK Unlock Cache Sector √ PFLUSH Flush Cache Sectors √ PFLUSHUN Flush Unlocked Cache Sectors √ PFREE free all locked sectors √ LRA Load Relative Address √ √ BSR / BScc Branch Subroutine always/conditionally √ √ BRA / Bcc Branch Target
Introduction Enhancements over the DSP56000 1.3.2 Architectural Enhancements The programmer’s model of the new DSP cores were also enhanced by the following: • An instruction cache controller was added to the DSP56300. A Burst mode can be used to lower the off-chip traffic if external DRAMs are used. • A six-channel DMA controller was added to the DSP56300. • A true barrel shifter (56-bit in DSP56300 and 40-bit in DSP56600) was added to support multibit operations.
Introduction Application Note Structure 1.4 APPLICATION NOTE STRUCTURE This document has three main component parts: • DSP56300 and DSP56600 features description and use • Optimizing the code for best performance • Appendices 1.4.1 DSP56300 and DSP56600 Features Description and Use The first five sections in this application note describe all the architectural and instruction set enhancements in the new DSP cores and how they can be used to optimize applications.
Introduction Application Note Structure • Section 4—Using the DMA – How to reduce core MIPS by using the DMA – How to service peripherals using the DMA – How to use slow, inexpensive memory chips without loosing performance – How to handle complex data structures by using the DMA • Section 5—Instruction Cache and Other Memory Features – Basic instruction cache tutorial – Data organization for efficient sector allocation – Sector locking for critical loops – Flushing the cache after task swit
Introduction Application Note Structure 1.4.
Section 2 DATA OPERATIONS 2.1 USING THE DUAL DATA PATHS The DSP56300/DSP56600 core can execute a new instruction every clock cycle. This performance can be used efficiently only if data can be fed to the core and its results moved out of it at a sufficient rate.
Data Operations Using the Dual Data Paths There are two ways to generate the operand addresses for parallel moves: • XY addressing—Two address registers are used independently, one generating an operand address for the X memory and the other for the Y memory. The FIR example above is of this kind. The address registers must be of different “banks”, meaning that if an address register R0–3 is used for one data field, an address register R4–7 should be used for the other data field.
Data Operations Using the Dual Data Paths Table 2-1 Parallel Move Instructions (Continued) Mnemonic Relevant Opcode variants Arithmetic Shift Accumulator Right ASR Single bit, non-immediate Clear Accumulator CLR Compare CMP Compare Magnitude CMPM Logical Exclusive OR EOR Logical Shift Left LSR Logical Shift Right LSR Multiply and Accumulate MAC Signed Multiply and Accumulate and Round MACR Transfer by Signed Value MAX Transfer by Magnitude MAXM Signed Multiply MPY Signed Multiply
Data Operations Using the Dual Data Paths Table 2-1 Parallel Move Instructions (Continued) Instruction Mnemonic Transfer Data ALU Register TFR Test Accumulators TST Relevant Opcode variants Parallel moves are also restricted in their use of registers as source and destination to a part of the Data ALU registers. The register options available for XY Addressing are listed in Table 2-2.
Data Operations Using the Dual Data Paths Table 2-3 Registers used in Long Addressing Assembler Syntax X Field Y Field Shifting/ Limiting if source Sign extension if destination Zero fill if destination A10 A1 A0 no no no B10 B1 B0 no no no X X1 X0 no no no Y Y1 Y0 no no no A A1 A0 yes A2 no B B1 B0 yes B2 no AB A1 B1 yes A2,B2 A0,B0 BA B1 A1 yes A2,B2 A0,B0 Keeping those restrictions in mind, writing a critical data processing loop efficiently should b
Data Operations 16-bit Arithmetic Mode (DSP56300 Only) 2.2 16-BIT ARITHMETIC MODE (DSP56300 ONLY) The 16-bit Arithmetic mode causes the Data ALU to use only 16 bits of the 24-bit data in transfers and calculations, allowing use of the DSP56300 as a 16-bit data processor. The 16-bit data is right aligned in the memory, but left aligned in data registers (in order to comply with the fractional numerical representation convention).
Data Operations The Max instruction 2.3 THE MAX INSTRUCTION MAX is a new instruction in the DSP56300 and DSP56600 instruction set that can used to enhance performance in critical data operation loops. For example, max a,b compares the two accumulators, and places the bigger value in the destination accumulator (accumulator B). The MAXM instruction does the same thing, only it transfers the bigger absolute value to the destination .
Data Operations Using the barrel shifter 2.4 USING THE BARREL SHIFTER The DSP56300/DSP56600 includes a true barrel shifter that can be used for multi-bit data shifts. The instructions that use the barrel shifter are listed in Table 2-4.
Data Operations Using the barrel shifter will normalize A, so that in the DSP56300 it’s leading one or zero will be shifted to Bit 46 in the accumulator. If |A| > 1 (meaning that it spilled to the extension A2), then CLB returns a positive number (between 1 and 8). If |A| < 1, CLB returns a zero or a negative number (between –47 and 0). The two cases in Figure 2-1 exemplify the normalization operation for the DSP56300.
Data Operations BIt manipulation instructions move move move do move normf normf move _ENDLOOP move 2.
Data Operations Double precision arithmetic is specified by its width (in bits) and its starting position (in bits, relative to the LSB of the accumulator). The width and position values could be prepared using the MERGE instruction, which merges data from two data registers in the appropriate positions for future use as a control operand for EXTRACT and INSERT. The EXTRACT instruction extracts the specified field, right-aligns it, and sign-extends it in the destination accumulator.
Data Operations Double precision arithmetic 23 × 0 23 0 X1 23 X0 0 23 0 Y1 Y0 47 0 X0(u) • Y0(u) 47 0 Y1(s) • X0(u) + 47 0 X1(s) • Y0(u) 47 0 X1(s) • Y1(s) 47 0 Result accumulator AA0832 Figure 2-2 48 × 48-bit Multiplication with 48 Bits of the Result Kept. The (U) means an unsigned operand, and the (S) a signed operand. The following four instructions perform the operation in full: ;48x48 bit multiplication with 48 bit result.
Data Operations Using Less Straight-Forward Instructions The features that help in this case are: • The ability to specify combinations of signed and unsigned operands • The 24-bit right arithmetic shifting inherent in the DMAC instruction Using these instruction combinations, and others, enables the programmer to build other multi-register arithmetic operations. The user is referred to Appendix A of the DSP56300 and DSP56600 Family Manuals for the full documentation of the various instruction options. 2.
Data Operations Using Less Straight-Forward Instructions ;determine partial 6th term mpy -x1,y0,b rnd b move b,x1 ;determine 5th term and add its contribution mpy -#$5000,y1,b ;b = 0 - (swTemp4 x ;TERMS_MULTIPLIER) add b,a ;determine 6th term and add its contribution macr #$7000,x1,a ;swSqrtOut is contained in a rts In this example, the ADDR and MPYR instructions replace a few instructions in the original code causing some reduction in total cycle count: sqroot ;determine 2nd term and add contribution
Section 3 PROGRAM CONTROL 3.1 HARDWARE LOOPS Hardware looping is one of the strongest features of the DSP56300/DSP56600 core families. Loop counter management and end-of-loop testing is done by hardware in parallel to instruction execution, thus saving execution time of otherwise needed control software. This enables the user to muster more performance in critical loops, and also makes program writing more close to high-level languages.
Program Control Hardware Loops A common programming technique is known as “loop unrolling”, in which a high-level loop is replaced by the inner loop code, repeated N times, thus saving the time needed to decrement the counter, test for the end of the loop, and jumping back to the top.
Program Control The Hardware Stack Note: The BRKcc instruction has the same functionality as the C language “break”, (i.e., terminating the loop and resuming execution after the end of the loop). A similar instruction is the ENDDO instruction, which exits the loop after finishing the current loop iteration. ENDDO is not a conditional instruction, therefore normal use generally includes testing a condition and skipping the ENDDO instruction accordingly.
Program Control The Hardware Stack The current stack location is pointed by the SP register. A single stack location can store two words, referred to as occupying the “high” and “low” halves of the stack location. The current stack locations pointed by SP (top of stack) are named SSH and SSL, respectively. A single “push” or “pop” activity can access the SSH and SSL concurrently. Stack activities are triggered implicitly at execution of specialized instruction or fulfillment of certain conditions.
Program Control The Hardware Stack Table 3-1 Implicit Stack Activity (Continued) Activity exit DO loop immediately Triggered by Instruction or Condition BRKcc (condition true) Implicit Stack Actions Taken PC: = LA + 1; SR: = SSL SP: = SP – 1 LA: = SSH, LC: = SSL SP: = SP – 1 The next example shows loop and subroutine nesting. Figure 3-1 shows the state of the stack at the time the fast interrupt is executing (label I_IRQA, that enters execution when PC = $000529).
Program Control The Hardware Stack ;example of loop and subroutine nesting. ;interrupt definitions: fast interrupt from IRQA_ org p:I_IRQA bset #5,x:(r0) nop ... ;program area ;after jsr execution, sp == 1, ;execution continues at _SUB1 jsr _SUB1 ... ... ... _SUB1 do ... do btst brkcs move move move _LOOP2 #6,_LOOP1 ;after instruction, sp == 3 forever,_LOOP2 ;after instruction, sp == 5 #0,x:(r0) ;if condition true, resume at ;_LOOP2,and sp == 3. a0,x:(r1)+ ;<---- irqA occurs here.
Program Control Using the Stack Extension 3.3 USING THE STACK EXTENSION The hardware stack could be extended to the data memory (X or Y), and it’s depth could be set by the user according to need. After initialization, the stack extension works automatically without any user overhead, giving the same functionality as the hardware stack. The registers participating in stack extension operation are listed in Table 3-2.
Program Control Using the Stack Extension following formula, which takes into account that each increment in SZ corresponds to two memory locations: SZ = available memory extension size ⁄ 2 + 14 For example, if the memory extension space available is 1024 words, SZ should be set to 1024/2 + 14 = 526. SZ should be set to an even number since stack extension transfers are done in pairs. SC is a 5-bit register that stores the number of entries in the hardware stack.
Program Control Using the Stack Extension Table 3-3. The use of SP bits for stack status when the stack extension is disabled, instead of OMR for both cases, is for code compatibility with the 56K family. The user’s stack error interrupt routine should test the SEN bit (Stack Extension Enable) in OMR to know what register to consult for stack status information. Table 3-3 Stack Status Information Stack Status Bit Extension Info.
Program Control Task Switching with the Stack Extension 3.4 TASK SWITCHING WITH THE STACK EXTENSION A multi-tasking operating system using the stack extension should ensure stack coherence when switching from one task to the other. Here is a possible task switching scenario: 1. During the execution of the task “T1”, a “time-out” interrupt occurs indicating the need to replace the active task with task “T2”.
Program Control Conditional DALU Instructions 5. In order to activate the new task T2, the Operating System dispatcher should first restore the task T2 programming model: move move .... move move move move move move .... move move #T2_task_reg_area,r7 ;Load pointer. x:(r7)+,x0 ;Restore registers... r7,n0 x:(r7)+,r7 x:(r7)+,x:OS_r7_temp n0,r7 x:(r7)+,n0 x:(r7)+,n1 ;save pointer ;Restore r7 w/ T2 data ;Keep r7. ;restore pointer ;Restore n0 ;Restore n1 x:(r7)+,lc x:(r7)+,la 6.
Program Control Conditional DALU Instructions specify that the instruction will update the CCR (according to the result and only if it is executed), by writing “.U” at the end of the condition attribute. For example: add x0,a IFne.U The full set of condition mnemonics may be used, thus helping program clarity and flexibility. The condition table could be found on Appendix A of the DSP56300 and DSP56600 Family Manuals.
Program Control PC Relative Instructions add bra b,x1 _CONT add b,x0 _TRUE _CONT ..... Using conditional instructions, the code can be written more compactly, as listed below: cmp add add Y0,a b,x0 b,x1 IFeq IFne The only difference between the two codes is that the Status Register in the later option is not updated according to the calculation result. Conditional execution with CCR update may in some cases solve the problem, as in the following example: btst asr #0,a0 a IFcs.
Program Control PC Relative Instructions In absolute addressing, the argument is the numerical value of the address. In PC relative addressing, the argument is the displacement of the address relative to the PC. For both absolute and PC addressing, the address argument could be specified in one of four ways: 1. explicit, as part of the 1-word opcode (restricted to short arguments), 2. explicit, as a second program word , 3. stored in a register , or, 4.
Program Control PC Relative Instructions Table 3-5 Instructions with Program Memory Arguments The Address Argument Function jump to subroutine Address Argument destination jump to subroutine on destination CCR condition jump if bit clear/set destination jump to subroutine if bit clear/set destination DO loop last address Mnemonic Encoded in the opcode (total 1 w) JSR address < 4096 + – + BSR –257 < disp < 256 + + – JScc address < 4096 + – + BScc –257 < disp < 256 + + – JCLR,
Program Control PC Relative Instructions Table 3-5 Instructions with Program Memory Arguments The Address Argument Function Note: 1. Address Argument Mnemonic Encoded in the opcode (total 1 w) 2nd Register Data word Memory The LRA opcode can only add a displacement to the PC. The Assembler translates the absolute address to displacement from the Location Counter, so the two modes (absolute address/ displacement register) are the same from the machine’s point of view.
Program Control Using Fast Interrupts instructs the assembler to try and compact the argument into a 1-word opcode. Without it, the assembler may use the 2-word version. Note: If a 2-word opcode is used, the value of _CONT1 is 3 (and not 2). This is because the extra word pushes the RTS instruction address one position forward. 3.
Program Control Using Fast Interrupts ready to transmit another word and expects the core to move data (generally from the memory) to the transmit register. Both these actions include moving data from one memory-mapped register to another, which in many processors can be done in 2 instructions only through an intermediate core register that must be kept ready continuously in anticipation of that event. For this reason the MOVEP instruction (move to/from peripheral) is included in the instruction set.
Program Control Using Fast Interrupts org movep p:I_SI0RD x:<
Program Control Using Fast Interrupts 3-20 Optimizing DSP56300/DSP56600 Applications MOTOROLA
Section 4 USING THE DMA 4.1 INTRODUCTION The DSP56300 DMA is a powerful functional block for moving data. It has special registers and data paths that enable it to perform various transfer tasks without stalling the core. It's main features are: • Parallel operation with the core • Complex address calculation modes • Transfer triggered by peripheral events, external interrupts or software This section describes the main DMA features and how they can be used to enhance performance.
Using the DMA Conserving Core MIPS by Working In Parallel The core may contend with the DMA in one of two cases only: 1. Accessing the same internal memory block (contiguous 256 RAM words or 3 K ROM words), in which case the DMA stalls—otherwise simultaneous core and DMA access to the internal memory is possible without any delays 2.
Using the DMA Conserving Core MIPS by Working In Parallel move move move ; ; ; ; ; ; ; ; ; ; ; ; #0,m1 ;modulo 1. each increment, R1 will flip ;between two consecutive addresses. #BASE_AREA1,x:(r1)+ #BASE_AREA2,x:(r1)+;R1 will now point to ;the address storing BASE_AREA1. ;until changed, x:(r1) stores the ;base address of the current core ;processing area; ;x:(r+1) stores the base address of the ;DMA area.
Using the DMA Using Slow, Low-Cost Memories Another possible application of this kind is in a multi-tasking operating system: the DMA can be periodically activated by the timer, and load the program of the next process, while the core executes another code segment. 4.3 USING SLOW, LOW-COST MEMORIES In many systems, data that is stored in external memory is not frequently used, and can be loaded at a relatively slow rate. In principle, this permit the use of slow, low-cost memories.
Using the DMA Using Slow, Low-Cost Memories memory locations. In a read access, the 3 bytes are concatenated to one 24-bit word that is written to the destination by the DMA. In a write access, the 24-bit word is unpacked to 3 single-byte write accesses. After initialization, all this activity is done automatically without software overhead.
Using the DMA Servicing a Peripheral ;DAM[5:3]101 ;DAM[2:0]000 ;DS[3:2]01 ;DS[1:0]01 movep destination address post-increment source address: 2D with offset register 0 transfer destination: y memory. transfer source: y memory. #$580285,x:M_DCR0;load control register. ;============ main program ... bset #23,x:M_DCR0 ... ;============ interrupt definition org p:I_DMA0 jsr
Using the DMA Servicing a Peripheral In the following example, the DMA receives data from the ESSI and passes it to a memory buffer. Only after the buffer is filled the DMA interrupts the core. Each ESSI request triggers a transfer of one word. After N words are transferred, the DMA is disarmed and interrupts the core. The core re-arms the DMA at the end of the interrupt routine. ;================= initialize DMA movep #M_RX0,x:M_DSR0;address of ESSI receive ;register is transfer source ;address.
Using the DMA Servicing a Peripheral Note: Before servicing the data processing interrupt after the buffer was filled, the core does not allocate any resource (registers or processing time) to service the data acquisition that is going on in the background. The DMA flexible addressing modes can also be used to support special data structures and I/O mapped addresses. Consider the SCI, which can only transmit and receive serial data that is 8-bit long.
Using the DMA Servicing a Peripheral Mx), and the MIPS required to process a fast interrupt for every 24-bit word transfer. • Using the DMA 3-D Addressing mode to increment the source address after the basic 3-word transfer—The core is interrupted only after N words are transmitted to fill the buffer again. The “cost” is one DMA channel and four offset registers.
Using the DMA Servicing a Peripheral Destination Source TX_BUF SRXL TX_BUF + 1 SRXM TX_BUF + 2 SRXH . . . TX_BUF + SIZE – 1 X I/O Space Y Space AA0834 Figure 4-1 DMA Addressing Modes for SCI Transmitters The following assembler code is needed for this configuration.
Using the DMA Servicing a Peripheral movep #0,x:M_DOR2 movep #-2,x:M_DOR3 ;offset register 2, ;added every word ;(DCOL) to destination address. ;offset register 3, ;added every 3 words ;(DCOM) to destination address ;DMA ch.0 control word:% 0 1 001 10 0 01111 1 111 0 00 01 00 ;DE 0 channel not armed (yet) ;DIE 1 DMA interrupts enabled ;DTM 001 word transfer triggered by request source, ; DE disarmed at end of count. ;DPR 10 chann.
Using the DMA Data Transfer Optimization Hints 4.5 DATA TRANSFER OPTIMIZATION HINTS Some points should be bared in mind when optimizing the code for performance: • While transferring words between two data memory locations takes approximately the same number of cycles if done either by software or by DMA, the DMA has an advantage when transferring to or from program memory. This is due to the 6 cycles required for every software access (MOVEM instruction) to program memory.
Section 5 INSTRUCTION CACHE AND MEMORY FEATURES The DSP56300 supports running programs from the external memory, but each fetch of a program word inserts wait states (depending on the memory type, with a minimum of one wait state per fetch). The performance of such a program may be severely impaired, but the user is able to reduce his system cost by using slower and cheaper memory devices, such as slow EPROMs and Dynamic RAMs.
Instruction Cache and Memory Features The Instruction Cache Activating the cache requires only setting the CE bit in the SR. The following instruction activates the cache: bset #19,SR Because of pipelining, allow four instructions to execute before assuming the cache is active. Disabling the cache is done by clearing that bit. Note: For obvious reasons, the user should not enable the cache while running from the cacheable memory area itself.
Instruction Cache and Memory Features The Instruction Cache preceding it. The DO instruction, being a 2-word instruction, suffers the wait states of two fetches Instructions in a loop are re-fetched on each iteration, with the wait states inserted each time. The column in Table 5-1 labeled “hit cycles” is the number of cycles needed for the execution of the instructions if they were run from internal memory or were cache hits.
Instruction Cache and Memory Features The Instruction Cache independent addresses. The instruction addresses must have one of the allocated tag fields, and only eight different tag fields can be allocated at any given time. An application that depends on the cache for efficient execution should be designed taking into account the sector allocation.
Instruction Cache and Memory Features The Instruction Cache and written over. Locking a sector is useful for time-critical code sections, that should execute at maximum speed whenever called. Locking them will prevent the need to re-allocate a sector and re-load it by slow fetches. 2. Unlock a sector (PUNLOCK,PUNLOCKR)—Make the sector available again for the LRU replacement algorithm. The unlocked sector is considered “most-recently-used”, that is, last in line for replacement. 3.
Instruction Cache and Memory Features The Instruction Cache Notes: 1. Disabling the cache controller and enabling it again implicitly flushes the cache. Data stored in the cache prior to its activation cannot be accessed as “hits”. A program section cannot be copied into the cacheable array for use as cached instructions. 2. The user should refrain from using old data from the cacheable array after the cache is disabled. 5.1.
Instruction Cache and Memory Features The Instruction Cache controlled by the user. In a program segment that advances consecutively (no change of flow), the fetches will be done in groups of four, initiated by instructions with addresses ending with “00”. The following example is of a program that uses the same DRAM for the program and the data. The DRAM has 2 wait states for in-page access and 8 wait states for out-of-page access.
Instruction Cache and Memory Features The Instruction Cache Table 5-2 Cycle Count Example With and Without Burst Mode Burst Mode Disabled No Burst Mode Enabled Instruction External External Cyc Cyc Accesses Accesses i11 mac x0,y0,a i12 macr x0,y0,a i13 nop i14 move i15 1do,1di,1po 21 2di 6 1do,1di,1po 21 2di,1po,3pi 24 1do,1po 18 1do 9 1do,1di,1po 21 2di 6 nop 1pi 3 — i16 nop 1pi 3 1po,3pi 18 i17 nop 1do 9 1do 9 TOTAL: 257 — 186 1po 9 1po,3pi 15 i0 x:(
Instruction Cache and Memory Features Memory Switch When the same code is run in Burst mode, every fourth external fetch is replaced by four fetches, and the other fetches are cache hits. In a hit state the internal fetch and instruction execution take (in this example) 1 cycle. This cycle may be in parallel to an external data access, so the total cycle count of such an instruction will be equal to the cycle count of the external data access.
Instruction Cache and Memory Features Memory Switch Default Mode External Memory Memory Switch Mode External Memory $6000 Icache $5C00 $5000 $4C00 Icache Internal Memory Internal Memory Program Memory Program Memory External Memory External Memory $1C00 $1400 $1400 Internal Memory Internal Memory Y Memory Y Memory External Memory External Memory $1C00 $1400 $1400 Internal Memory Internal Memory X Memory X Memory AA0835 Figure 5-1 DSP56302 Memory Maps 5-10 Optimizing DSP56300/DSP5
Instruction Cache and Memory Features Using the Bootstrap ROM Possible advantages for using the Memory Switch mode: 1. A program may dynamically change it's internal memory map according to need. 2. A system may be developed with the intention of using Program and data ROM in the end product. Using a ROM allows much more data to be placed on chip. Development is done with a RAM-based emulation version, which can hold much smaller internal memory.
Instruction Cache and Memory Features Using the Bootstrap ROM memory port, etc. The boot program initializes the relevant port, then starts reading data in. The program generally interprets the first two 24-bit words that are read as the number of words to be read, and the internal program memory address destination, respectively. That number of data words is read and written to the internal memory starting at the given address. The boot program then passes program control to that address.
Section 6 PIPELINE INTERLOCKS Due to the pipeline nature of the DSP56300 and DSP56600 Cores, there are certain instruction sequences that cause a delay in execution. There are seven types of instruction sequence delays: • External Bus Wait States • External Bus Arbitration • Instruction fetch delays • Data ALU Pipeline Interlocks This section describes various Pipeline Interlocks and suggests ways to avoid them.
Pipeline Interlocks Data ALU Pipeline Interlocks 6.1.1 What are the Data ALU Pipeline Interlocks? There are three types of Data ALU pipeline interlocks: • Arithmetic Interlock—An arithmetic interlock causes a single cycle delay in the execution of the MOVE instruction. It is caused by moving the contents of (or one part of) an accumulator that was the destination in the preceding arithmetic instruction. Example: mpy X0,Y0,A move A,X:(R0)+ ;Arithmetic Instruction.
Pipeline Interlocks Data ALU Pipeline Interlocks and Status Interlock may be avoided by adding some useful instructions in between the instructions. The following paragraph will demonstrate few ways to overcome the Arithmetic Interlock. 6.1.2 Avoiding Data ALU Pipeline Interlocks There are few common ways to avoid the Arithmetic Interlock. The first is to change the order of the instructions such that a sequence that caused the interlocks will not be part of the re-ordered code.
Pipeline Interlocks Data ALU Pipeline Interlocks move move move x:(r4)+,AA,y:(r1) B,y:(r7) y:(r0)+,B ;A=next a,PUT d’ ;PUT b’, A=next a ;B=next c The parallel source moves that caused the pipeline interlocks were shifted to the following instructions. This example illustrates the importance of ordering the arithmetic instructions and the parallel read operations. Taking this approach when writing a program can shorten the execution time by preventing unnecessary pipeline interlocks. 6.1.2.
Pipeline Interlocks Data ALU Pipeline Interlocks The read operations in the tenth and eleventh instructions will not cause arithmetic pipeline interlocks to happen. Although the loop contains double the words of the original code, it is executed half the time, resulting in 2N cycles performance as desired. 6.1.2.2.
Pipeline Interlocks Data ALU Pipeline Interlocks move x:(r0)+,bb,y:(r4)+ move a,y:(r4)+ move b,y:(r4)+ ;previous data ;write destination memory, ;read next data _end 6.1.2.3 ;write last-1 word to ;destination memory ;write last word to destination ;memory Saving Interlocks by Using the TFR Instruction.
Pipeline Interlocks Address Generation Pipeline Interlocks 6.2 ADDRESS GENERATION PIPELINE INTERLOCKS There are sequences related to the Address Generation Unit that cause the insertion of one, two or three pipeline interlock cycles. These paragraphs describe what are these sequences and suggest few ways to avoid them in the application. 6.2.
Pipeline Interlocks Stack Extension Delays instruction and no interlock cycles will be added to the execution of the second MPY instruction: move move move mpy mpy 6.2.2 #addr,R0 ;R0 is destination #data,X0 #data,Y0 X0,Y0,A X:(R0)+,Y0;Instruction uses R0 X0,Y0,A X:(R0)+,Y0;Instruction uses R0 Avoiding Address Generation Pipeline Interlocks There are few common ways to avoid the Address Generation pipeline Interlock.
Pipeline Interlocks Program Flow-Control Pipeline Interlocks hardware stack so that it will not be full or empty and the execution of instructions can continue. This activity of the stack extension delays the execution phases by the number of cycles required to move data to or from the stack, usually two cycles for each move. 6.3.1 Stack Extension Full/Empty Cases The stack-full or stack-empty states are defined by the contents of the SC (Stack Counter) register.
Pipeline Interlocks Program Flow-Control Pipeline Interlocks 6.4.1 What are the Program Flow-Control Pipeline Interlocks? Some of the flow-control interlocks may exist only in very unique sequences and will not be described in this paragraph.
Pipeline Interlocks Program Flow-Control Pipeline Interlocks 6.4.1.3 JMP to Last Addresses of a Do-Loop (LA or LA–1) Whenever I1 is any type of JMP with the target address equals to (LA) or to (LA–1) then the instruction following the instruction at (LA) will be delayed by 2 or 1 clock cycles, respectively. 6.4.1.
Pipeline Interlocks Program Flow-Control Pipeline Interlocks cmp blt move sub B,x0 cont add b,a ;compare to threshold (r4)+ x0,b ;increment counter ;subtract threshold from sum cont LoopEnd ;efficient version - loop reordered. ;the main point - the CMP and subsequent branch are split between two ;iterations ;execution time of one iteration (condition true): 7 clocks move X:(r0)+,B ;read first data to B cmp B,x0 ;first compare - before loop.
Section 7 COMPACT OPCODE USE The rich instruction set of the DSP56300 and DSP56600 gives a great amount of flexibility to the DSP software engineer when writing the DSP code. However, careful selection of the right opcode will help the user to generate an optimized application.
Compact Opcode Use Cycle Count of an Instruction move move do ... move rep mac move move move ... r4,n4 r0,n0 #N,_loop x:(r0)+,x0 #10 x:(r0)+,x0 x0,y0,a n0,r0 n4,r4 x(r1)+,x1 y:(r4)+,y0 y:(r4)+,y0 _loop The cycle count of this loop is increased by the number of cycles it takes to decode the REP instruction, which is 5. The code may be optimized by replacing the REP with in-line assembly and restructuring some instructions to have parallel moves, saving 8N cycles: move move do ...
Compact Opcode Use Cycle Count of an Instruction Example 7-1 First Example—Original Code with Conditional Branch tst bgt add bra a _else x0,b _endif add y0,b _else _endif Example 7-2 First Example—Code with Conditional Branch Replaced by Conditional Execution Opcodes (IFcc) tst add add a x0,b y0,b ifgt ifle In the second example, the Tcc instruction is used in parallel with a move instruction to replace a conditional branch, saving 6 cycles.
Compact Opcode Use Cycle Count of an Instruction Example: tst blt frequent_code ... rare_error a rare_error ... By choosing the inverse of the condition, the code can be optimized and some cycles can be saved: tst bge a frequent_code rare_error ... frequent_code ...
Compact Opcode Use Addressing Modes By choosing the conditions more carefully, the code can be optimized: tst bne add bra a _case_4 #2,a _end_case _case_4 cmp bne tfr bra #4,a _case_9 b,a _end_case _case_9 cmp bne asl bra #9,a _default a _end_case add x0,a _default _end_case 7.2 ADDRESSING MODES The cycle count of an instruction may depend upon the specific addressing mode used with this instruction.
Compact Opcode Use Addressing Modes 7.2.2 Short Addressing Mode The lower portion (first 64 locations 0–63) of data memory can be accessed by special short addressing modes that can specify the location as part of the opcode, contrary to other locations where a second instruction word is required. Example: move X:5,x0 This instruction executes in 1 clock cycle. This makes it possible to use the lower portion of the data memory as general purpose registers without a significant increase in code length.
Compact Opcode Use Peripheral Addressing 7.2.5 Register Addressing The register addressing can also be used to decrease the total cycle count. The next example is an implementation of a jump table that uses register addressing. The code is used when exiting reset to jump to a location that corresponds to the specific mode that was chosen at power up: org move and move move move jmp j_table dc dc dc dc dc dc dc dc 7.2.
Compact Opcode Use Special Instructions 7.4 7.4.1 SPECIAL INSTRUCTIONS Dual Data Spaces The Harvard architecture of the DSP56300/DSP56600 cores includes two data memory spaces: X and Y. An efficient structure of the application’s data segment can improve the code performance by being able to use instructions that support this architecture.
Compact Opcode Use Special Instructions 7.4.3 Clearing Registers It is often needed to clear a certain register or accumulator in the code. Optimization can be accomplished in this area, also.
Compact Opcode Use Special Instructions 7-10 Optimizing DSP56300/DSP56600 Applications MOTOROLA
Appendix A SAVING POWER A very important attribute of the code efficiency is its power requirements. The DSP programmer should use various power saving techniques that will result in a minimal power requirement by the application. A.1 LOW POWER MODES The DSP56300 and DSP56600 have several low power modes: • Wait Standby Mode This section describes way to optimize the application for minimal power consumption. • Stop Standby Mode • Low-Power Clock Divider A.1.
Saving Power Low Power Modes peripheral, an interrupt request is generated to take the core out of the Wait mode. Power consumption during a Wait Standby Mode is very low, in the range of a few milliamperes. Please refer to the specific device data sheet for more accurate numbers. A.1.2 Stop Standby Mode The Stop Standby mode is entered by using the special STOP instruction.
Saving Power Disabling Functional Blocks A.2 DISABLING FUNCTIONAL BLOCKS The are few functional blocks that can be disabled during normal operation if they are not required by the application. A special control bit exist for each block that should be used to disable it and by that reduce the total power consumption.
Saving Power Disabling Functional Blocks A-4 Optimizing DSP56300/DSP56600 Applications MOTOROLA
Appendix B DEBUG AND TEST SUPPORT The DSP56300 and DSP56600 families provide board and chip-level testing capability through the On-Chip Emulation (OnCE) module and the Test Access Port (TAP) commonly referred to as the JTAG port. These two ports are both accessed through the JTAG port pins. The DE pin is the only direct access to the OnCE module. The presence of the JTAG interface allows the user to insert the DSP chip into a target system while retaining debug control.
Debug and Test Support JTAG Port Features • Trace one (single stepping) or up to 256 instructions • Save or restore the current pipeline state of the DSP core • Display the contents of the real-time instruction trace buffer • Return to user mode from Debug mode • Set-up breakpoints without being in Debug mode • All OnCE events can either force the chip into Debug mode or force a vectored interrupt, based on the users needs B.2 JTAG PORT FEATURES The JTAG port conforms to the IEEE 1149.
Debug and Test Support Address Tracing • Force test data onto the outputs of a DSP or DSPs, while replacing its boundary scan register in the serial data path with a single bit register • Enable a weak pull-up current device on all input signals of a DSP or DSPs; this helps to ensures deterministic test results in the presence of a continuity fault during interconnect testing B.
Debug and Test Support Address Tracing B-4 Optimizing DSP56300/DSP56600 Applications MOTOROLA
Appendix C USING THE PROFILER C.1 SCOPE Profiling capabilities are built into the Motorola DSP Simulator. The profiler provides dynamic and static analysis. The analysis results are displayed in profiling report files. Note: Acquaintance with Motorola DSP Simulator is required for activating the profiler. Please refer to the Simulator’s user’s manual for detailed description of the DSP Simulator. C.
Using the Profiler C.3 THE PROFILING REPORT The profiling report is provided in two formats: ASCII and Postscript. Assuming the profiler was invoked using the command ‘log p filename’, the ASCII report in written into the file named filename.log and the Postscript report is written into the file named filename.ps. The profile report consists of several sections, each pertaining to some metrics of the DSP program. The following sections describe each of the report sections. C.3.
Using the Profiler C.3.2 Symbol Report The symbol report section provides a profile of the accesses made during program execution to the memory objects defined by the program symbols. This report can highlight the usage patterns of memory objects. For each array in memory, the report specifies the number of read and write accesses performed to each of the cells of the array. When memory locations are aliased by several symbols, accesses to the locations are reported under all aliasing symbols.
Using the Profiler Example C-3 Typical Instruction Set Usage Report s t a t i c d y n a m i c mnemonic # occur % of 100 # occur % of 100 --------------------------------------------------------------------------abs 15 0.22 21536 0.08 add 392 5.84 1327468 5.14 and 13 0.19 36372 0.14 andi 50 0.75 7357 0.03 asl 133 1.98 866526 3.35 asr 166 2.47 534554 2.07 For move instructions, statistics are provided to describe the level of parallelization of moves with Data ALU instructions.
Using the Profiler C.3.4 Code Coverage Report The code coverage report juxtaposes the assembly source code with dynamic profile information pertaining to the code generated for that source. The report provides, for each source line that corresponds to an assembly instruction, the number of times control has passed through that instruction and the total number of machine cycles spent in executing the instruction.
Using the Profiler C.3.5 Basic Subroutine Report This section of the profile report lists the subroutines that have been executed during the DSP program simulation. For each subroutine, the report provides the number of times the subroutine has been called, the number of different places from which the subroutine was called, the number of entry points used for the subroutine, and the total number of machine cycles spent executing the subroutine.
Using the Profiler Example C-8 Typical Subroutine Call Graph Report Subroutine Call Graph report ----------------------------------------------------------------------------------------speechEncoder calls - 100/100, cycles - 9189668 aflat calls - 100, cycles - 15100/9174568 flat calls - 100/100, cycles - 1676900 rcToCorrDpL calls - 100/100, cycles - 188300 vad_algorithm calls - 100/100, cycles - 323968 swComfortNoise calls - 100/100, cycles - 4900 lpcCorrQntz calls - 100/100, cycles - 6980500 -------------
Using the Profiler C.3.8 Subroutine Call Report This section exists only in the Postscript profile report. It illustrates the relationships between the subroutines that have been active during program simulation. Each such subroutine appears as a node in a graph. Nodes are connected using directed edges, which correspond to the caller/callee relationships. Subroutines that have not been invoked during program simulation will not appear in this graph. C.
Mfax and OnCE are trademarks of Motorola, Inc. Motorola reserves the right to make changes without further notice to any products herein. Motorola makes no warranty, representation or guarantee regarding the suitability of its products for any particular purpose, nor does Motorola assume any liability arising out of the application or use of any product or circuit, and specifically disclaims any and all liability, including without limitation consequential or incidental damages.