We analyze the parallelism in direct iterative, total least squares (TLS) methods for the forwardbackward linear prediction (FBLP) of multidimensional signals. It turns out that the most complex operation in all direct iterative methods is the matrix-matrix multiplication Y=CTCX, where CTC is a Gramian (Gram matrix) of matrix C and X is the iteration matrix which approximates the subset of right singular vectors. In our case, the matrix C is Toeplitz-block and this enables the derivation of a very efficient way of how to compute Y. The standard algorithm for the computation of Y is based on the associative law (i.e., Y=CT(CX) and on the embedding of each Toeplitz block CTC into circulant. We present another algorithm based, first, on the computation of the generator of Toeplitz-like Gramian CTC from its displacement structure, and, second, on the use of generalized Gohberg-Semencul formula for the matrix multiplication Y=CTCX. The time complexity of the computational phase of the standard algorithm is O(m log m) while that of the new one is O(n log n), so that a considerable gain in speed can be expected whenever m >> n, i.e., when the Toeplitz blocks are tall and thin. Parallel versions of both algorithms were implemented on an IBM Power4 Regatta parallel computer using the Message Passing Interface library. We discuss pro.ling results which con.rm an expected gain in speed for the computational phase of new algorithm.