CN112286864B  Sparse data processing method and system for accelerating operation of reconfigurable processor  Google Patents
Sparse data processing method and system for accelerating operation of reconfigurable processor Download PDFInfo
 Publication number
 CN112286864B CN112286864B CN202011552162.8A CN202011552162A CN112286864B CN 112286864 B CN112286864 B CN 112286864B CN 202011552162 A CN202011552162 A CN 202011552162A CN 112286864 B CN112286864 B CN 112286864B
 Authority
 CN
 China
 Prior art keywords
 weight
 calculation
 unit
 calculated
 value
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active
Links
 238000003672 processing method Methods 0.000 title claims abstract description 15
 239000011159 matrix material Substances 0.000 claims abstract description 73
 238000004364 calculation method Methods 0.000 claims abstract description 71
 230000000875 corresponding Effects 0.000 claims description 41
 238000003062 neural network model Methods 0.000 claims description 12
 210000002683 Foot Anatomy 0.000 claims 1
 238000000605 extraction Methods 0.000 claims 1
 230000002349 favourable Effects 0.000 abstract description 4
 238000010586 diagram Methods 0.000 description 6
 230000001537 neural Effects 0.000 description 4
 230000005540 biological transmission Effects 0.000 description 2
 238000007906 compression Methods 0.000 description 2
 238000001514 detection method Methods 0.000 description 2
 230000000694 effects Effects 0.000 description 2
 230000001133 acceleration Effects 0.000 description 1
 238000004220 aggregation Methods 0.000 description 1
 230000002776 aggregation Effects 0.000 description 1
 238000011030 bottleneck Methods 0.000 description 1
 238000005516 engineering process Methods 0.000 description 1
 238000000034 method Methods 0.000 description 1
 238000005065 mining Methods 0.000 description 1
 230000004048 modification Effects 0.000 description 1
 238000006011 modification reaction Methods 0.000 description 1
 230000000452 restraining Effects 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F15/00—Digital computers in general; Data processing equipment in general
 G06F15/76—Architectures of general purpose stored program computers
 G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
 G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
 G06F15/7871—Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computer systems based on biological models
 G06N3/02—Computer systems based on biological models using neural network models
 G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
 G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Abstract
The invention provides a sparse data processing method for accelerating the operation of a reconfigurable processor, which comprises the following steps: and dividing the weight matrix into a plurality of unit blocks by taking P multiplied by Q as a dividing unit along the row and column directions of the sparse weight matrix to be calculated. And combining the columndirection units in the weight matrix to be calculated into a group. And dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity. The PE array reads vector values of all unit blocks in the calculation group in sequence, and stores a nonzero weight value of a current unit and a zero weight unit interval number which is distant from a previous nonzero weight as an effective weight address of the current unit into a storage address which can correspond to the calculation group. Therefore, the grouping rule sparsification strategy adopted by the invention is more favorable for algorithm precision convergence, and can provide higher sparsity rate under the same algorithm precision. Meanwhile, the invention provides a sparse data processing system for accelerating the running of the reconfigurable processor.
Description
Technical Field
The invention relates to the field of reconfigurable processors, in particular to the calculation of neural network calculation of the indegree learning of the reconfigurable processors in the fields of image detection, image recognition, voice recognition and the like. The invention particularly relates to a sparse data processing method and a sparse data processing system for accelerating the running of a reconfigurable processor.
Background
Neural network calculation based on deep learning is widely applied to the fields of image detection, image recognition, voice recognition and the like, and convolution operation and fullconnection operation in the neural network consume a large amount of storage resources, calculation resources and bandwidth resources, so that the neural network becomes a bottleneck for implementation on intelligent equipment such as an intelligent camera, an intelligent earphone and an intelligent sound box. The sparsification technology is a method for restraining the proportion of nonzero weights in the weights in convolution calculation and fullconnection operation in a training mode so as to reduce the storage cost of the storage weights. Meanwhile, research finds that sparsification can be used for reducing the times of multiplication and addition of convolution calculation and fullconnection calculation and reducing the bandwidth of data transmission. However, the random sparse weights in the training process are not favorable for fully mining the computing resources and bandwidth resources of the hardware.
Disclosure of Invention
The invention aims to provide a sparse data processing method for accelerating the operation of a reconfigurable processor, the adopted grouping rule sparse strategy is more favorable for algorithm precision convergence, and higher sparse rate can be provided under the same algorithm precision.
Another object of the present invention is to provide a sparse data processing system for accelerating the operation of a reconfigurable processor, which can provide a higher sparse rate with the same algorithm precision.
In a first aspect of the invention, a method for sparse data processing to accelerate the operation of a reconfigurable processor is provided, wherein the reconfigurable processor comprises a PE array. The PE array has P × Q PE units. The sparse data processing method comprises the following steps:
step S101, dividing the weight matrix into a plurality of cell blocks by taking P multiplied by Q as a dividing unit along the row and column direction of the sparse weight matrix to be calculated. The cell block includes a plurality of valid weights.
Step S102, the column direction units in the weight matrix to be calculated are grouped into a group. And judging whether the total number of the effective weights in the unit blocks in one group is more than P × Q/2, if so, averagely splitting one group into two groups of unit blocks. And acquiring the number of a group of unit blocks not exceeding P × Q/2 in the weight matrix to be calculated as the number of the grouped divisions. And dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity.
Step S103, the PE array reads vector values of all unit blocks in the calculation group in sequence, and if the vector value of the current unit block is a nonzero weight, the nonzero weight value of the current unit block and the interval number of the zero weight unit blocks with the nonzero weight at the last distance are stored in a storage address corresponding to the calculation group as effective weight addresses of the current unit block.
In another embodiment of the present invention, the method for processing sparse data to accelerate the operation of the reconfigurable processor further includes, after step S103:
and step S104, acquiring a nonzero weight value corresponding to the effective weight address and a corresponding storage address thereof according to the effective weight address of each calculation group of the array to be processed by P × Q PE units in the PE. And reading the convolution calculation value corresponding to the storage address corresponding to the nonzero weight value.
And step S105, realizing convolution or fullconnection layer calculation in the deep learning neural network model according to the convolution calculation value corresponding to the nonzero weight value in each calculation group.
In another embodiment of the present invention, the method for processing sparse data to accelerate the operation of the reconfigurable processor further includes, after step S105: and step S106, outputting a convolution or full connection layer calculation result in the neural network model. In another embodiment of the sparse data processing method for accelerating the operation of the reconfigurable processor, the P × Q PE units in the PE array are 8 × 8 PE units.
In a second aspect of the invention, a sparse data processing system is provided for accelerating the operation of a reconfigurable processor, the reconfigurable processor comprising a PE array. The PE array has P × Q PE units. The sparsifying data processing system includes:
and a weight dividing unit configured to divide the weight matrix into a plurality of unit blocks with P × Q as a dividing unit in a rowcolumn direction of the thinned weight matrix to be calculated. The cell block includes a plurality of valid weights.
A grouping unit configured to group columnwise cells in the weight matrix to be calculated into a group. And judging whether the total number of the effective weights in the unit blocks in one group is more than P × Q/2, if so, averagely splitting one group into two groups of unit blocks. And acquiring the number of a group of unit blocks not exceeding P × Q/2 in the weight matrix to be calculated as the number of the grouped divisions. And dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity. And
and the storage unit is configured to read the vector values of all the cell blocks in the calculation group in sequence by the PE array, and if the vector value of the current cell block is a nonzero weight, the nonzero weight value of the current cell block and the zero weight cell block interval number which is away from the nonzero weight are taken as effective weight addresses of the current cell block to be stored in the storage address corresponding to the calculation group.
In another embodiment of the present invention, a sparse data processing system for accelerating the operation of a reconfigurable processor is further provided, the system further comprising:
and the extracting unit is configured to obtain a nonzero weight value corresponding to the effective weight address and a corresponding storage address thereof according to the effective weight address of each calculation group of the array to be processed by P multiplied by Q PE units in the PE. And reading the convolution calculation value corresponding to the storage address corresponding to the nonzero weight value. And
and the calculation unit is configured to realize convolution or fullconnection layer calculation in the deeplearning neural network model according to the convolution calculation value corresponding to the nonzero weight value in each calculation group.
In another embodiment of the present invention, a sparse data processing system for accelerating the operation of a reconfigurable processor further comprises: an output unit configured to output a convolution or fullconnected layer calculation result in the neural network model.
In another embodiment of the present invention, the present invention provides a sparse data processing system for accelerating the operation of a reconfigurable processor, wherein the P × Q PE elements in the PE array are 8 × 8 PE elements.
The characteristics, technical features, advantages and implementation manners of the sparse data processing method and system for accelerating the operation of the reconfigurable processor will be further described in a clear and easy manner with reference to the attached drawings.
Drawings
Fig. 1 is a flowchart for explaining a thinned data processing method for accelerating the operation of a reconfigurable processor in one embodiment of the present invention.
Fig. 2 is a flowchart for explaining a thinningout data processing method for accelerating the operation of a reconfigurable processor in another embodiment of the present invention.
Fig. 3 is a flowchart for explaining a thinned data processing method for accelerating the operation of a reconfigurable processor in still another embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating a sparse data processing system for accelerating the operation of a reconfigurable processor according to an embodiment of the present invention.
Fig. 5 is a schematic diagram for illustrating a division of the weight matrix according to an embodiment of the present invention.
Fig. 6 is a schematic diagram for explaining another division of the weight matrix in an embodiment of the present invention.
Fig. 7 is a schematic diagram for explaining a sparse matrix storage format in an embodiment of the present invention.
Fig. 8 is a schematic diagram for explaining another sparsifying matrix storage format in an embodiment of the present invention.
Fig. 9 is a schematic diagram for explaining still another sparse matrix storage format in an embodiment of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings, in which the same reference numerals indicate the same or structurally similar but functionally identical elements.
"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings only schematically show the parts relevant to the present exemplary embodiment, and they do not represent the actual structure and the true scale of the product.
In a first aspect of the invention, a method for sparse data processing to accelerate the operation of a reconfigurable processor is provided, wherein the reconfigurable processor comprises a PE array. The PE array has P × Q PE units. As shown in fig. 1, the sparse data processing method includes:
in step S101, a plurality of cell blocks are divided.
In this step, the weight matrix is divided into a plurality of cell blocks by using P × Q as a dividing unit along the row and column direction of the sparse weight matrix to be calculated. The cell block includes a plurality of valid weights.
The invention provides a rule sparse method friendly to hardware and an accelerated hardware design. Rule sparsification is a packet sparsification structure.
For example, a weight matrix MxN is divided into (M/Q) x (N/P) small blocks with granularity of QxP, wherein the number of weights in the constraint matrix KxQ does not exceed P x Q/2 (where P and Q represent the size of the convolution array). I.e., the size of P × Q PE elements in the PE array.
By way of specific example, as shown in fig. 5, a 64 × 64 weight matrix is given, where P is 8 and Q is 8 (i.e., the PE array is 8 × 8 PE units), i.e., the dividing unit of the weight matrix is the number of PE units in the PE array, so as to facilitate the calculation of the weight matrix by the PE array.
As shown in fig. 5, 8 × 8 cells included in each of the unit blocks 1..... 64 (corresponding to the divisional areas 1, 2.. 64) are divided, thereby dividing the entire weight matrix of 64 × 64 into 8 × 8 matrices.
In step S102, a plurality of calculation groups are acquired.
In this step, the columnwise cells in the weight matrix to be calculated are grouped into a group. And judging whether the total number of the effective weights in the unit blocks in one group is more than P multiplied by Q/2, if so, averagely splitting one group into two groups of unit blocks. And acquiring the number of a group of unit blocks not exceeding P multiplied by Q/2 in the weight matrix to be calculated as the number of the grouping divisions. And dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity.
For example, as shown in FIG. 5, columnwise blocks 1 to 8 in the calculation weight matrix are grouped together. The group is formed by the principle that the number of effective weights (i.e. nonzero weights) in the group does not exceed (8 × 8)/2 (1/2 of the number of PE units), i.e. 32, because 1/2 is reserved in 64 PE units as address storage locations of the effective weights.
For example: when the number of the effective weights of the unit blocks 18 in a group is less than 32, for example, the number of the effective weights in 18 unit blocks is 20, the number of the effective weights in 916 unit blocks is 15, the number of the effective weights in 1724 unit blocks is 10, the number of the effective weights in 2532 unit blocks is 31, the number of the effective weights in 3340 unit blocks is 30, the number of the effective weights in 4148 unit blocks is 28, the number of the effective weights in 4956 unit blocks is 8, and the number of the effective weights in 5764 unit blocks is 11.
As can be seen from the number of effective weights of the cell blocks, the group having the largest number of effective weights is set to have 31 as the number of effective weights in 25 to 32 cell blocks. Since the weight matrix does not exceed 32, the 8 unit blocks in the column direction can be divided into one group, and the weight matrix is divided into 8 groups, namely a first group of 18 unit blocks and an eighth group of 916 unit blocks.
As shown in fig. 6, when the number of effective weights of unit blocks 1 to 8 in a group exceeds 32, for example, when the number of effective weights of unit blocks 1 to 8 is 56, 1 to 8 are split into: the 1 ~ 4 unit piece is a set of 5 ~ 8 unit pieces and a set of, so on. Until the number of valid weights for a unit block in one of its computation sets is less than 32. Therefore, 4 cell blocks in the column direction can be divided into one group, the weight matrix is divided into 8 groups, namely a first group G1 of 14 cell blocks and a second group of 516 cell blocks, namely a 128 th group of 6164 cell blocks. Therefore, in a weight matrix, the calculation groups are divided according to combinations that can be smaller than 32 after column unit blocks are combined.
Fig. 5 exemplifies a weight matrix of 64x64, where K is 32, P is 8, and the number of weights in K x Q is constrained to not exceed 32 ^ 8 2/2. According to different requirements of engineering application, different grouping strategies can be flexibly selected, such as: the eight matrices are divided into a group (G8), as shown in fig. 6, each region of G8 contains 8 matrices 8 × 8 (one square represents one matrix 8 × 8), and no more than P × Q/2 nonzero weights, i.e., the nonzero weights are less than 32. Such as: the four matrices are divided into a group,
for the weight matrix of the fullconnection calculation, M is fo, and N is fi; wherein fo is: outputting the number of characteristic channels; fi is: and inputting the number of the characteristic channels.
For convolution weight templates calculated by convolution, M ═ fo, N ═ kx × ky ═ fi; wherein fo is: outputting the number of characteristic channels; fi is: inputting the number of characteristic channels; kx and ky are as follows: the dimensions of the rolltoroll template.
Therefore, the grouping sparsification mode adopted by the invention is simultaneously suitable for weight sparsification of convolution and fullconnection calculation. In addition, compared with the aggregation rule sparsification provided by the prior art, the grouping rule sparsification strategy adopted by the invention is more favorable for algorithm precision convergence, and can provide higher sparsity rate under the same algorithm precision.
Step S103, obtaining the effective weight address.
In this step, the PE array sequentially reads vector values of the unit blocks in the calculation group, and if the vector value of the current unit block is a nonzero weight, stores the nonzero weight value of the current unit block and the zero weight unit block interval number that is a nonzero weight from the current unit block as the effective weight address of the current unit block in the storage address corresponding to the calculation group.
As shown in fig. 7, in the sparse matrix storage format, the present invention stores the sparse weight matrix by means of sparse coding, and uses the interval bits between the nonzero weight value and the nonzero weight value to arrange in turn, so as to realize the compression of the weight matrix, for example, under the condition of G8, the compression effect of 4 times can be achieved. Specific storage format as shown in the following drawings, fig. 6 shows how a 16bit vector is compressed by using the storage format of the present invention, the yellow part is a nonzero part, and the white parts are all zero, according to the storage method of the present invention, the vector is marked as (a,0) (B,3) (C,7) (D,2), and the number represents the number of zeros between two nonzero weights, which effectively reduces the storage capacity and reduces the bandwidth of data transmission compared with the original storage vector a000B0000000C 00D.
In the hardware acceleration design, the invention adopts a P × Q MAC array to accelerate convolution and sparsification operation. And reading an input feature vector and P weights of one dimension P each time by the MAC array of P x Q, and calculating to obtain an output feature vector of the dimension Q.
In the sparse mode, reading Kdimensional feature vectors and P x Q/2 sparse nonzero weights each time, reducing the constraint matrix by extracting interval length values in the storage format during calculation, obtaining the position of each nonzero weight corresponding to the multiplied input feature vector, and calculating to obtain Qdimensional output feature vectors.
Sparse decoding: according to sparse coding, a K x Q matrix is completed from left to right from the top left corner of the matrix and from top to bottom. For example, taking a matrix of 6 × 4 as an example, he sparsely encodes (1,0) (2,3) (4,5) (3,6) (5,5), where in the parenthesis, the first number represents the nonzero weight value and the second number represents the interval between this nonzero number and the previous nonzero number or starting point. This matrix is shown in fig. 8.
At this time, the sparse code is decoded into a data and address format (value, address), and since the constraint matrix has 64 × 8(29) numbers in total, the address length is 9 bits.
In the constrained K × Q matrix, each column only allows a maximum of 8 nonzero values, which are taken out by a logic circuit and read out the nonzero weights and the serial numbers of the columns, taking the matrix shown in fig. 7 as an example, the first nonzero number 1 has a weight value of 1, and its serial number is 1; the second nonzero number has a value of 2 and its index number is 5. As shown in fig. 7.
Based on the sequence number read from this column, the value under the sequence number corresponding to the given Kdimensional input eigenvector is fetched, and the value in the column vector with the same sequence number and the value under the sequence number in the first column of the matrix are subjected to multiplyadd operation, thereby obtaining the output value, which in the case of fig. 9 is 1x2+2x9 equal to 20. And performing parallel expansion, simultaneously performing multiplication and addition operation on the nonzero weight of each row and the input feature vector to obtain Q result numerical values of the multiplication in total, and outputting a Qdimensional result vector output.
For example, in the second column, there is only one nonzero number 4, and the sequence number is 5, then the fifth value, that is, 9, should be taken from the feature vector, resulting in 4x9 ═ 36; going to the third column, take out the nonzero number 3, with the sequence number 6, and multiply it with the 6 th value in the feature vector, i.e. 3x8 ═ 24; going to the fourth column, take the nonzero number 5, with the index number 6, and multiply it by the 6 th value in the feature vector, i.e., 5x8 equals 40. We thus obtained four numbers for this operation: 20,36,24,40. Then output is (20,36,24, 40). In the case of a Qcolumn matrix, Q values are obtained, and the Q values are formed into a vector, i.e., an output vector.
As shown in fig. 2, in another embodiment of the present invention, which provides a method for processing sparse data to accelerate the operation of a reconfigurable processor, after step S103, the method further includes:
step S104, reading the convolution calculation value.
In this step, through P × Q PE units in the PE, a nonzero weight value corresponding to an effective weight address and a storage address corresponding to the nonzero weight value are obtained according to the effective weight address of each calculation group of the array to be processed. And reading the convolution calculation value corresponding to the storage address corresponding to the nonzero weight value.
And step S105, realizing convolution or full connection layer calculation.
In this step, the convolution or fullconnected layer calculation in the deep learning neural network model is realized according to the convolution calculation value corresponding to the nonzero weight value in each calculation group.
In another embodiment of the present invention, the method for processing sparse data to accelerate the operation of the reconfigurable processor further includes, after step S105, as shown in fig. 3:
and step S106, outputting the result.
In this step, the convolution or fulllink layer calculation results in the neural network model are output.
In another embodiment of the sparse data processing method for accelerating the operation of the reconfigurable processor, the P × Q PE units in the PE array are 8 × 8 PE units.
In a second aspect of the present invention, a sparse data processing system is provided for accelerating the operation of a reconfigurable processor, as shown in fig. 4, the reconfigurable processor comprising a PE array. The PE array has P × Q PE units. The sparsifying data processing system includes:
and a weight dividing unit 101 configured to divide the weight matrix into a plurality of unit blocks with P × Q as a dividing unit in a rowcolumn direction of the thinned weight matrix to be calculated. The cell block includes a plurality of valid weights.
A grouping unit 201 configured to group columnwise cells in the weight matrix to be calculated into a group. And judging whether the total number of the effective weights in the unit blocks in one group is more than P × Q/2, if so, averagely splitting one group into two groups of unit blocks. And acquiring the number of a group of unit blocks not exceeding P × Q/2 in the weight matrix to be calculated as the number of the grouped divisions. And dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity. And
and the memory unit 301 is configured to sequentially read the vector values of the unit blocks in the calculation group by the PE array, and if the vector value of the current unit block is a nonzero weight, store the nonzero weight value of the current unit block and the zero weight unit block interval number which is a nonzero weight at a previous distance as an effective weight address of the current unit block into the memory address corresponding to the calculation group.
As shown in fig. 4, in another embodiment of the present invention, a sparse data processing system for accelerating the operation of a reconfigurable processor is provided, the system further comprising:
an extracting unit 401, configured to obtain, according to the effective weight address of each calculation group of the array to be processed, a nonzero weight value corresponding to the effective weight address and a storage address corresponding to the nonzero weight value according to P × Q PE units in the PE. And reading the convolution calculation value corresponding to the storage address corresponding to the nonzero weight value. And
and a calculating unit 501, configured to implement convolution or fullconnected layer calculation in the deeplearning neural network model according to the nonzero weight value and the corresponding convolution calculation value in each calculation group.
In another embodiment of the present invention, a sparse data processing system for accelerating the operation of a reconfigurable processor further comprises: an output unit configured to output a convolution or fullconnected layer calculation result in the neural network model.
In another embodiment of the sparse data processing system, the present invention provides a method for accelerating the operation of a reconfigurable processor, wherein P × Q PE elements in a PE array are 8 × 8 PE elements.
It should be understood that although the present description is described in terms of various embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and those skilled in the art will recognize that the embodiments described herein as a whole may be suitably combined to form other embodiments as will be appreciated by those skilled in the art.
The abovelisted detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
Claims (8)
1. The sparse data processing method for accelerating the operation of the reconfigurable processor is characterized in that the reconfigurable processor comprises a PE array; the PE array has P × Q PE units; the sparse data processing method comprises the following steps:
step S101, dividing a weight matrix into a plurality of cell blocks by taking P multiplied by Q as a dividing unit along the row and column direction of the sparse weight matrix to be calculated; the cell block comprises a plurality of effective weights;
step S102, forming a group of columndirection units in the weight matrix to be calculated; judging whether the total number of effective weights in the unit blocks in the group is more than P multiplied by Q/2, if so, averagely splitting the group into two groups of unit blocks; acquiring the number of a group of unit blocks not exceeding P multiplied by Q/2 in a weight matrix to be calculated as the number of grouping divisions; dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity;
step S103, the PE array reads vector values of all unit blocks in the calculation group in sequence, and if the vector value of the current unit block is a nonzero weight, the nonzero weight value of the current unit block and the zero weight unit block interval number which is away from the current unit block by a nonzero weight are stored in a storage address corresponding to the calculation group as effective weight addresses of the current unit block.
2. The method for processing thinned data according to claim 1, further comprising, after the step S103:
step S104, acquiring a nonzero weight value corresponding to an effective weight address and a corresponding storage address thereof according to the effective weight address of each calculation group of the array to be processed by virtue of P multiplied by Q PE units in the PE; reading a convolution or fullconnection characteristic input value corresponding to the storage address corresponding to the nonzero weight value according to the storage address corresponding to the nonzero weight value;
and step S105, realizing convolution or fullconnection layer calculation in the deeplearning neural network model according to the nonzero weight value in each calculation group and the characteristic input value corresponding to the nonzero weight value.
3. The method for processing thinned data according to claim 2, further comprising, after the step S105: and S106, outputting a calculation result of a convolution layer or a full connection layer in the neural network model.
4. The method of claim 1, wherein the P × Q PE elements in the PE array are 8 × 8 PE elements.
5. The sparse data processing system is characterized in that the reconfigurable processor comprises a PE array; the PE array has P × Q PE units; the sparse data processing system includes:
a weight dividing unit configured to divide the weight matrix into a plurality of unit blocks with P × Q as a dividing unit in a rowcolumn direction of a thinned weight matrix to be calculated; the cell block comprises a plurality of effective weights;
a grouping unit configured to group columnwise cells in the weight matrix to be calculated into a group; judging whether the total number of effective weights in the unit blocks in the group is more than P multiplied by Q/2, if so, averagely splitting the group into two groups of unit blocks; acquiring the number of a group of unit blocks not exceeding P multiplied by Q/2 in a weight matrix to be calculated as the number of grouping divisions; dividing the weight matrix to be calculated into a plurality of calculation groups along the column direction of the weight matrix to be calculated according to the grouping division quantity; and
and the storage unit is configured to sequentially read the vector values of all the cell blocks in the calculation group by the PE array, and if the vector value of the current cell block is a nonzero weight, the nonzero weight value of the current cell block and the zero weight cell block interval number which is away from the previous nonzero weight are taken as effective weight addresses of the current cell block and stored in the storage address corresponding to the calculation group.
6. The sparsified data processing system as claimed in claim 5, further comprising:
an extraction unit configured to obtain, by P × Q PE units in the PEs, a nonzero weight value corresponding to an effective weight address and a storage address corresponding to the effective weight address according to the effective weight address of each calculation group of the array to be processed; reading a convolution or fullconnection characteristic input value corresponding to the storage address corresponding to the nonzero weight value according to the storage address corresponding to the nonzero weight value; and
and the calculation unit is configured to realize convolution or fullconnection layer calculation in the deeplearning neural network model according to the convolution or fullconnection characteristic input values corresponding to the nonzero weight values in each calculation group.
7. The sparsified data processing system as claimed in claim 6, further comprising:
an output unit configured to output a convolution or fullconnected layer calculation result in the neural network model.
8. The sparse data processing system of claim 5, wherein the P x Q PE elements in the PE array are 8x8 PE elements.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN202011552162.8A CN112286864B (en)  20201224  20201224  Sparse data processing method and system for accelerating operation of reconfigurable processor 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN202011552162.8A CN112286864B (en)  20201224  20201224  Sparse data processing method and system for accelerating operation of reconfigurable processor 
Publications (2)
Publication Number  Publication Date 

CN112286864A CN112286864A (en)  20210129 
CN112286864B true CN112286864B (en)  20210604 
Family
ID=74426070
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN202011552162.8A Active CN112286864B (en)  20201224  20201224  Sparse data processing method and system for accelerating operation of reconfigurable processor 
Country Status (1)
Country  Link 

CN (1)  CN112286864B (en) 
Families Citing this family (1)
Publication number  Priority date  Publication date  Assignee  Title 

CN113076083B (en) *  20210604  20210831  南京后摩智能科技有限公司  Data multiplyadd operation circuit 
Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US8972958B1 (en) *  20121023  20150303  Convey Computer  Multistage development workflow for generating a custom instruction set reconfigurable processor 
CN110737628A (en) *  20191017  20200131  辰芯科技有限公司  reconfigurable processor and reconfigurable processor system 
CN110888832A (en) *  20180910  20200317  东京计器株式会社  Reconfigurable processor 
Family Cites Families (6)
Publication number  Priority date  Publication date  Assignee  Title 

WO2009035185A1 (en) *  20070911  20090319  Core Logic Inc.  Reconfigurable array processor for floatingpoint operations 
KR101553648B1 (en) *  20090213  20150917  삼성전자 주식회사  A processor with reconfigurable architecture 
CN102572415B (en) *  20101217  20131204  清华大学  Method for maping and realizing of movement compensation algorithm on reconfigurable processor 
CN102638659B (en) *  20120328  20140514  西安电子科技大学  Highresolution imaging system and method based on CMOSTDI (Complementary Metal Oxide SemiconductorTime Delay and Integration) mode 
US10540180B2 (en) *  20141207  20200121  Lenovo Enterprise Solutions (Singapore) Pte. Ltd.  Reconfigurable processors and methods for collecting computer program instruction execution statistics 
CN104679670B (en) *  20150310  20180130  东南大学  A kind of shared data buffer structure and management method towards FFT and FIR 

2020
 20201224 CN CN202011552162.8A patent/CN112286864B/en active Active
Patent Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US8972958B1 (en) *  20121023  20150303  Convey Computer  Multistage development workflow for generating a custom instruction set reconfigurable processor 
CN110888832A (en) *  20180910  20200317  东京计器株式会社  Reconfigurable processor 
CN110737628A (en) *  20191017  20200131  辰芯科技有限公司  reconfigurable processor and reconfigurable processor system 
Also Published As
Publication number  Publication date 

CN112286864A (en)  20210129 
Similar Documents
Publication  Publication Date  Title 

CN112286864B (en)  Sparse data processing method and system for accelerating operation of reconfigurable processor  
CN107301456B (en)  Deep neural network multicore acceleration implementation method based on vector processor  
CN107423816B (en)  Multicalculationprecision neural network processing method and system  
CN109063825B (en)  Convolutional neural network accelerator  
US10534839B2 (en)  Method for matrix by vector multiplication for use in artificial neural network  
WO2021004366A1 (en)  Neural network accelerator based on structured pruning and lowbit quantization, and method  
KR20200037748A (en)  Chip device and related product  
CN109543140B (en)  Convolutional neural network accelerator  
CN111062472A (en)  Sparse neural network accelerator based on structured pruning and acceleration method thereof  
CN109635940B (en)  Image processing method and image processing device based on convolutional neural network  
CN107220702B (en)  Computer vision processing method and device of lowcomputingcapacity processing equipment  
US5668748A (en)  Apparatus for twodimensional discrete cosine transform  
CN111445012A (en)  FPGAbased packet convolution hardware accelerator and method thereof  
CN109284824B (en)  Reconfigurable technologybased device for accelerating convolution and pooling operation  
CN110705703A (en)  Sparse neural network processor based on systolic array  
CN104572588B (en)  Matrix inversion process method and apparatus  
CN111198670B (en)  Method, circuit and SOC for executing matrix multiplication operation  
CN111008691A (en)  Convolutional neural network accelerator architecture with weight and activation value both binarized  
CN102447898B (en)  Method for realizing KLT (KarhunenLoeve Transform) by means of FPGA (Field Program Gate Array)  
CN112132275A (en)  Parallel computing method and device  
CN209708122U (en)  A kind of computing unit, array, module, hardware system  
CN111078189A (en)  Sparse matrix multiplication accelerator for recurrent neural network natural language processing  
CN112257844A (en)  Convolutional neural network accelerator based on mixed precision configuration and implementation method thereof  
CN111667062A (en)  Consistency transformation method for pooling and vector operation of special neural network accelerator  
US20210350214A1 (en)  Convolutional neural network computing method and system based on weight kneading 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 