【正文】
ow bitserial architectures for this application. In order to match the divider bitstreams exactly to the multiplier bitstreams it is then just a matter of inserting extra delays along the FA sum pipeline so that the addition of PPs from a number of different multiplications can be performed simultaneously as shown by Bellis et al[13]. Study of the bitserial interleaved divider and multiplier reveals that both architectures show a large degree of similarity. Both work in load/operational phases。 the divider requires m PEs, for 1’s plement error correction which occurs for negati ve dividends, and the multiplier requires m1 HA PEs to add the output carries from the PPs. Therefore, it is possible to bine the two designs to make a programmable bitserial device which allows m+1 putations to be simultaneously interleaved, as shown in figure 1. The processor has two mode selection inputs DIVi and SUBi, which control four modes of operation ii YZX /0 ?? or iii YXZZ ???0 where iZ and 0Z are both double precision. Ldi is the load/operational mode select signal for the storage of iY and iZ over the first m(m+1) clock cycles. Ldi switches into operational mode over the next m(m+1) clock cycles where the remaining data is input and the bulk of the putation is performed in the FA array. All control signals are fully pipelined similarly to the data, allowing the shortest possible block pipeline period of 2m(m+1) clock cycles and continuous input/output of data(. while one block set of m+1 putations are being output, the next block set may be loaded in). The pipeline also allows independent functionality between each of the separate interleaves and on the same interleave a division may immediately follow an inner step product putation and viceversa. 4. INTERLEAVED PROCESSOR BASED MODIFIED COVARIANCE SYSTEM Costbene?t analysis on systolic array implementation of the CMR and Cholesky sections of the MC spectral estimator shows that a 12 bit ?xed point wordlength is suf?cient for these putations[7]. Using the bitserial processor with a 12 bit wordlength results in the capacity for interleaving 13 putations. On interleaves 0 to 4 the CMR multiplications are performed over N consecutive block sets, such that the products inn xx ?? are produced on interleave )40( ??ii and blockset )10( ??? Nnn . A bitserial systolic array provides the correct input data sequencing from consecutive Doppler signal samples and a separate MSB first double precision accumulator, whose architecture is similar to that of HA section in figure1, putes the covariance matrix elements, which are then stored in RAM. The system for puting the CMR calculation is shown in figure 2. The entire Cholesky, forward elimination, back substitution and WNV putations are performed on interleave 5 on the system shown in figure 3. Here division and inner product step putation are necessary. Once the covariance matrix elements are stored in the dual port RAM after block set N the Cholesky deposition can mence on interleave 5 while in parallel the CMR putation on the next set of data can be processed on interleaves 0 to 4. A ROM block controls the addressing of the dual port RAM for retrieval of stored data to go onto the processor inputs and storage of the processor results. To achieve good dynamic resolution for the low wordlength used, a systolic array scaling module is included between the RAM and the processor, whose scaling factors are also produced by the ROM controller along with the mode control. Overall timing in the system is controlled by three counters, qi(range 0 to 12),qb(range 0 to 23) and qw(range 0 to N)corresponding to the interleaves, bitposition and input word. A zero padded point DFT is puted on interleaves 6, 7, 8 and 9. This is basically amatrix vector multiplication and is puted by using the processor in inner product step mode. The system for this section consists of a ROM to provide storage of the twiddle factor matrix nW , another ROM to control the addressing of the twiddle factors for a particular qw and 4 registers which continuously recirculate the filter parameter results( na? )from the Cholesky deposition stage. On interleave 6 the real and imaginary parts of the first set of products 1?aWiN? are alternately formed. Using a single flipflop delay the results of these putations are then fed back into the iZ input of the interleaved processor to be added to the products 2?aWiN? and the DFT is built up in this way. The dynamicrange of the PSD putation is quite high pared to the rest of the system, therefore, at this stage a floating point representation of the DFT results is taken using a systolic based conversion circuit. PIPO registers are used to store the 6 bit exponents of the real and imaginary parts of the DFT, whose squares are puted on interleave 10. On interleave 11 the absolute value of the DFT is puted. The maximum of each pair of real and imaginary results from interleave 10 is fed to the iZ input while the other value is piped into the iY to be appropriately scaled by the difference in the two squared exponents appearing on the iX input. The PSD is then puted on interleave 12, involving N/2 divisions of the WNV formed on interleave 5 with the absolute values from interleave 11. The exponents of the PSD are then easily derived from the exponents of the DFT results. 5. CONCLUSION This paper has proposed a bitserial interleaved processor which can be programmed for use in division or inner product step putations. The interleaving idea was introduced in order to perform bitserial division at the same high clock rate as multiplication without resorting to carry lookahead schemes to remove the munication bottleneck. The result is a high throughput processor which is cost ef?cient in terms of VLSI implementation, since munication between PEs in