Document Description

A STUDY OF INVERSE SHORT-TIME FOURIER TRANSFORM Bin Yang Chair of System Theory and Signal Processing, University of Stuttgart, Germany
ABSTRACT In this paper, we study the inverse short-time Fourier transform (STFT). We propose a new vector formulation of STFT. We derive a family of inverse STFT estimators and a least squares one. We discuss their relationship and compare their performance with respect to both additive and multiplicative modiﬁcations to STFT. The inﬂuence of window, overlap, an

Document Share

Document Tags

Document Transcript

A STUDY OF INVERSE SHORT-TIME FOURIER TRANSFORM
Bin Yang
Chair of System Theory and Signal Processing, University of Stuttgart, Germany
ABSTRACT
In this paper, we study the inverse short-time Fourier trans-form (STFT). We propose a new vector formulation of STFT.We derive a family of inverse STFT estimators and a leastsquares one. We discuss their relationship and compare theirperformance with respect to both additive and multiplicativemodiﬁcations to STFT. The inﬂuence of window, overlap, andzero-padding are investigated as well.
IndexTerms
—
inverseshort-timeFouriertransform, time-frequency analysis, least squares methods
1. INTRODUCTION
The short-time Fourier transform (STFT) is a useful tool toanalyze nonstationary signals and time-varying systems. Ithas been successively applied to a large number of signalprocessing applications like time-frequency analysis, speechenhancement, echo cancelation, and blind source separation.The problem of the inverse STFT (ISTFT) is to constructthe srcinal time domain sequence from a modiﬁed STFT.The modiﬁcation could be additive or multiplicative. Sur-prisingly, ISTFT has not been studied systematically in theliterature. Almost all papers up to now including many recentpublications use a heuristic overlap-add method for comput-ing ISTFT [1, 2, 3, 4]. It combines the results of the inverseFourier transform of different sections by overlap-add. In[5], a weighted overlap-add procedure has been proposed, butwithout a guideline about the optimum weighting.STFT is a linear operation. Due to section overlappingandzero-padding, theresultofSTFTcontainsalargernumberof samples than the srcinal time domain sequence. Clearly,ISTFT is a linear overdetermined problem for which the leastsquares (LS) approach is well applicable [6].In this paper, we take a detailed look at ISTFT. In com-parison to [6], our contributions are: 1) We derive a novelcompact vector formulation of STFT. It facilitates the work-ing with STFT and is also useful for other purposes. 2) Wederive a family of heuristic ISTFT estimators including theclassical overlap-add method. 3) We also derive a LS esti-mator. While [6] assumed a continuous-frequency represen-tation, we focus on the discrete-frequency Fourier transform.In addition, we allow zero-padding and non-equally spacedfrequencies. 4) We clarify the relationship between differ-ent ISTFT estimators. 5) We compare their performance withrespect to both additive and multiplicative modiﬁcations toSTFT.6)Finally, wealsostudytheinﬂuenceofwindow, over-lap, and zero-padding on ISTFT.The following notations are used in the paper. Matricesand column vectors are represented by boldface and under-lined characters. The superscript
T
and
H
denote transposeand Hermitian transpose, respectively.
·
is the Euclideanvector norm.
diag(
·
)
describes a diagonal matrix.
2. VECTOR FORMULATION OF STFT
We ﬁrst derive a new vector formulation of STFT. This willbe the basis for further investigations.A time domain sequence
x
(
n
) (
n
≥
0)
is divided into
M
overlapping sections. Each section has the length
N
. Theshift length from section to section is
1
≤
S
≤
N
. Theoverlap length between two adjacent sections is
N
−
S
, seeFig. 1. Let
x
m
= [
x
(
mS
)
,x
(
mS
+1)
,...,x
(
mS
+
N
−
1)]
T
∈
C
N
(1)denote the
m
-th section (
0
≤
m
≤
M
−
1
) of
x
(
n
)
.
M
such overlapping sections with a shift length
S
contain a totalnumber of
J
= (
M
−
1)
S
+
N
samples
x
(0)
,...,x
(
J
−
1)
with
J
≤
MN
. Let
x
= [
x
(0)
,x
(1)
,...,x
(
J
−
1)]
T
∈
C
J
(2)be the vector of all involved samples. The relationship be-tween
x
and
x
m
in (1) is described by
[
x
T
0
,x
T
1
,...,x
T M
−
1
]
T
=
O
x
(3)where
O
=
⎡⎢⎢⎢⎢⎢⎣
I
N S
I
N
...
I
N
⎤⎥⎥⎥⎥⎥⎦
∈
R
MN
×
J
(4)is a so called overlap matrix. It consists of
M
identity ma-trices
I
N
along the main diagonal. Each identity matrix isshifted by
S
columns to the right with respect to the aboveone. The left-multiplication of
x
by
O
corresponds to its seg-mentation into
M
overlapping sections as shown in Fig. 1.Each section
x
m
is now weighted by a real valued win-dow
w
(
n
)
of the same length. We assume that
w
(
n
)
is non-zero and thus invertible. Then this windowed sequence is ap-pended by
K
−
N
zeros before the
K
-point Fourier trans-form
X
m
(
k
)
is calculated at
K
discrete frequencies
ω
k
(0
≤
k
≤
K
−
1)
with
K
≥
N
. In general,
ω
k
do not need to
35411-4244-1484-9/08/$25.00 ©2008 IEEE ICASSP 2008
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 11, 2009 at 02:38 from IEEE Xplore. Restrictions apply.
x
⇓
N N N
x
0
x
1
x
2
SS
Fig. 1
.
O
x
: Divide a sequence
x
into overlapping sectionsbe equally spaced like in discrete Fourier transform (DFT). Inmatrix-vector-notation, the Fourier transform is described by
X
m
= [
X
m
(0)
,...,X
m
(
K
−
1)]
T
=
FPW
x
m
∈
C
K
(5)with
W
= diag(
w
(0)
,...,w
(
N
−
1))
∈
R
N
×
N
,
P
=
I
N
0
(
K
−
N
)
×
N
∈
R
K
×
N
,
(6)
F
= [
e
−
jω
k
n
]
0
≤
k,n
≤
K
−
1
∈
C
K
×
K
.
W
is a diagonal window matrix.
P
consists of an
N
×
N
identity and a
(
K
−
N
)
×
N
zero matrix and describes thezero-padding.
F
is the square Fourier transform matrix withthe element
e
−
jω
k
n
at the
k
-th row and
n
-th column, bothcounted starting from zero.We stack the Fourier transforms of all
M
sections into asingle column vector
X
= [
X
T
0
,X
T
1
,...,X
T M
−
1
]
T
∈
C
MK
.
(7)By using (3), (5), and the notation
A
⊗
B
= [
a
ij
B
]
i,j
forthe Kronecker tensor product, we ﬁnally obtain the linear re-lationship between the complete time domain sequence
x
andits windowed zero-padded STFT
X X
=
⎡⎢⎣
FPW
...
FPW
⎤⎥⎦⎡⎢⎣
x
0
...
x
M
−
1
⎤⎥⎦
=
H
x,
H
= (
I
M
⊗
(
FPW
))
O
.
(8)The meaning of the different matrices in (8) is self-explained:
ã
O
: overlapped segmentation
ã
W
: windowing
ã
P
: zero-padding
ã
F
: discrete-frequency Fourier transform
ã
I
M
⊗
: section-by-section processing
3. A FAMILY OF HEURISTIC ISTFT ESTIMATORS
Starting from (8), we now study how to compute the corre-sponding ISTFT.If
X
denotes the exact STFT of a given time domain se-quence
x
, we can uniquely determine
x
. In practical applica-tions, however, the STFT is often modiﬁed before it is trans-formed back into the time domain. In this case, there is ingeneral no time domain sequence
x
whose STFT matches ex-actly
X
because
X
contains more elements than
x
if
S < N
(true overlapping) or
K > N
(true zero-padding). The prob-lem is overdetermined. The question is how to compute areasonable estimate
ˆ
x
for
x
from
X
?A simple but heuristic idea is based on the following ob-servation: If we multiply both sides of (8) by the
J
×
MK
matrix
O
H
(
I
M
⊗
A
p
)
from left with
A
p
=
W
p
−
1
P
H
F
−
1
∈
C
N
×
K
, weobtainaccordingto
(
A
⊗
B
)(
C
⊗
D
) =
AC
⊗
BD
[7] the following result
O
H
(
I
M
⊗
A
p
)
X
=
O
H
(
I
M
⊗
A
p
)(
I
M
⊗
(
FPW
))
O
x
=
O
H
(
I
M
⊗
(
A
p
FPW
))
O
x
=
O
H
(
I
M
⊗
W
p
)
O
x,
(
p
= 0
,
1
,
2
...
)
.
(9)
W
p
is a diagonal matrix containing the diagonal elements
w
p
(
n
)
. This motivates the following family of estimates
ˆ
x
p
=
D
−
1
p
O
H
(
I
M
⊗
A
p
)
X,
D
p
=
O
H
(
I
M
⊗
W
p
)
O
.
(10)We call it the
p
-ISTFT estimate.Step Meaning1)
F
−
1
inverse Fourier transform of
X
m
2)
P
H
keep only the ﬁrst
N
samples of
F
−
1
X
m
3)
W
p
−
1
weight these
N
samples by
w
p
−
1
(
n
)
4)
I
M
⊗
do the above computations for all
M
sections5)
O
H
overlap-add of the results of all sections6)
D
−
1
p
ﬁnal normalization
Table 1
. Steps of
p
-ISTFTNote that Eq. (10) has a simple interpretation. Table1 summarizes all steps of
p
-ISTFT. The ﬁrst four steps areeasy to understand. According to the deﬁnition of the over-lap matrix
O
in (4), the operation
O
H
z
in the 5-th step with
z
= [
z
T
0
,...,z
T M
−
1
]
T
∈
C
MN
describes the well knownoverlap-add of the
M
sections
z
m
. The shadowed areas inFig. 2 represent the overlap-add regions. Fig. 3 illustratesthe matrix overlap-add operation
O
H
(
I
M
⊗
U
)
O
where
U
is any
N
×
N
matrix. The result is a
J
×
J
matrix in whichadjacent matrices
U
along the main diagonal overlap and addin an
(
N
−
S
)
×
(
N
−
S
)
area. Clearly, if
U
=
W
p
=diag(
w
p
(0)
,...,w
p
(
N
−
1))
isdiagonal, thenthematrix
D
p
=
O
H
(
I
M
⊗
W
p
)
O
is diagonal as well
D
p
= diag(
d
p
(0)
,...,d
p
(
J
−
1))
∈
R
J
×
J
.
(11)The diagonal elements
d
p
(
n
)
are obtained by overlap-adding
M
sections of
w
p
(0)
,...,w
p
(
N
−
1)
as in Fig. 2. The lastnormalization step in Table 1 involves thus only
J
scalar di-visions.
3.1. LS inverse STFT estimator
Since ISTFT is a linear overdetermined problem, it is naturalto apply the least squares (LS) approach to the signal model
3542
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 11, 2009 at 02:38 from IEEE Xplore. Restrictions apply.
N N N
z
0
z
1
z
2
++
SS
O
H
z
⇓
Fig. 2
.
O
H
z
: Overlap-add of vectors
UUU
N N SS
...
Fig. 3
.
O
H
(
I
M
⊗
U
)
O
: Overlap-add of matrices(8). After some calculations, we obtain the LS estimate
ˆ
x
LS
= (
H
H
H
)
−
1
H
H
X
=
D
−
1LS
O
H
(
I
M
⊗
(
WP
H
F
H
))
X,
D
LS
=
H
H
H
=
O
H
(
I
M
⊗
U
)
O
,
U
=
WP
H
F
H
FPW
.
(12)It is referred to as LS-ISTFT.
4. DISCUSSIONS
Below we discuss the relationship between different ISTFT:
ã
Theclassicaloverlap-add method[1,2,3,4]isaspecialcase of our
p
-ISTFT with
p
= 1
.
ã
The LS solution from [6] is identical to
p
-ISTFT with
p
= 2
. It assumed, however, a continuous-frequencyFourier transform.
ã
OurLS-ISTFTisderivedforthediscrete-frequencycase.In general, i.e. for arbitrary discrete frequencies
ω
k
,
U
and thus
D
LS
in (12) are not diagonal. In this case, thematrixinversion
D
−
1LS
isexpansive and
ˆ
x
LS
differsfrom
ˆ
x
p
and the LS solution from [6].
ã
In the special case of DFT with equally spaced discretefrequencies
ω
k
=
2
πkK
(0
≤
k
≤
K
−
1)
,
F
H
F
=
K
I
K
,
U
=
K
W
2
,
F
H
=
K
F
−
1
, and
ˆ
x
LS
simpliﬁesto
ˆ
x
2
.Besides the choice of the estimator, the inverse STFT alsodepends on a number of other factors like the modiﬁcation of the STFT, the choice of the window, the shift length
S
, andthe number of appended zeros
K
−
N
. Below we study theinﬂuence of these factors on ISTFT.
ã
If
X
denotestheexactSTFTofatimedomainsequence
x
, then both
ˆ
x
p
and
ˆ
x
LS
return the same
x
. This can beeasily shown be combining (8) and (10) as well as (12).
ã
If we use the rectangular window
w
(
n
) = 1
, then allestimators
ˆ
x
p
return the same result. In this case,
W
=
I
N
and
ˆ
x
p
in (10) does not depend on
p
.
ã
If there is no overlap between adjacent sections (
S
=
N
), thenallestimators
ˆ
x
p
returnthesameresultaswell.In this case,
O
=
I
MN
and
ˆ
x
p
simpliﬁes to
(
I
M
⊗
(
W
−
1
P
H
F
−
1
))
X
.We expect a small difference between various
p
-ISTFT esti-mators if the modiﬁcation to STFT or the deviation of
w
(
n
)
from the rectangular window or the overlap length is small.Concerning the computational complexity, all
p
-ISTFTestimators and the DFT-based version of LS-ISTFT are com-parable to the classical overlap-add method. For each section,one inverse Fourier transform has to be computed. The onlydifference is the use of different windows
w
p
−
1
(
n
)
for thescaling of the inverse Fourier transform and the computationof the ﬁnal normalization sequence
d
p
(
n
)
in (11).
5. EXPERIMENTS
Inthissection, wecomparetheperformanceofdifferentISTFTestimators. We use clean speech signals of roughly 6 sec-ond duration sampled at 16 kHz in our experiments. For eachspeech signal
x
, we compute its STFT and modify it by ad-ditive noise or multiplicative masking. Then we compute thesignal estimate
ˆ
x
for different ISTFT estimators. Since DFT(FFT) is used,
ˆ
x
LS
is identical to
ˆ
x
2
and will not be consid-ered separately. The performance measure is the signal-to-distortion ratio (SDR) in the time domain after the signal re-construction
SDR
p
= 10log
10
(
x
2
/
x
−
ˆ
x
p
2
) dB
.
(13)In particular, we focus on
SDR
2
of the LS estimator and theperformance loss of other estimators
ΔSDR
p
= SDR
2
−
SDR
p
with respect to that. For statistical averaging, we use8 different utterances from a male and a female speaker andcalculate the average values of
SDR
2
and
ΔSDR
p
over these8 speech signals. The default parameter set for the Fouriertransform is hamming window, window length
N
= 512
,shift length
S
=
N/
2
, and FFT length
K
=
N
unless other-wise stated.
Additive distortion
First we consider additive distortion.
X
is modeled as
H
x
+
N
.
H
x
is the exact STFT of
x
and
N
contains realizationsof zero-mean uncorrelated random variables with equal vari-ance
σ
2
. The variance is chosen to achieve a certain signal-to-noise ratio (SNR) in the time-frequency domain
SNR =10log
10
(
H
x
2
/
N
2
)
. For this particular signal model, itis well known that the LS estimator
ˆ
x
LS
= ˆ
x
2
achieves thesmallest variance among all linear unbiased estimators like
ˆ
x
p
. In addition, the variance of
ˆ
x
p
increases linearly with
σ
2
.The default value of SNR is 10 dB. In Table 2, we usedifferent invertible windows. The window names are takenfrom MATLAB. In Table 3, we vary the overlap length
N
−
S
.In Table 4, we change the FFT length
K
.
Multiplicative distortion
Inasecondseriesofexperiments, wemultiplytheexactSTFT
3543
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 11, 2009 at 02:38 from IEEE Xplore. Restrictions apply.
with a mask, a typical operation in underdetermined blindsource separation based on the spareness of speech signalsin the time-frequency domain [4]. Due to limited space, weonly consider a binary mask. The mask is determined suchthat
p
mask
percentage of the time-frequency points with thehighest amplitude of
H
x
pass the mask. In all other time-frequency points, the binary mask is zero.In Table 5, we choose different values for
p
mask
. Theother parameters are identical to the default choice in the pre-vious subsection. In Table 6 to 8, we keep
p
mask
= 30%
andrepeat the same performance study with respect to window,overlap, and zero-padding as previously.window
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
rectwin 13.00 0 0 0kaiser (
β
=0) 13.00 0.01 0 0hamming 12.47 9.26 1.08 0.10triang 7.22 20.30 0.33 0.03
Table 2
. Additive distortion: Varying window
N
−
S
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
0 0.23 0 0 0
N/
8
5.00 0.82 0.17 0.07
N/
4
8.50 3.97 0.55 0.09
N/
2
12.47 9.26 1.08 0.10
3
N/
4
15.27 9.13 1.15 0.20
Table 3
. Additive distortion: Varying overlap length
K
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
N
12.47 9.28 1.08 0.10
1
.
5
N
14.25 9.30 1.10 0.10
2
N
15.50 9.34 1.10 0.10
Table 4
. Additive distortion: Varying zero-padding
Observations
We draw the following conclusions from the above experi-ments:
ã
In additive distortion,
ˆ
x
LS
= ˆ
x
2
is the best one as ex-pected. In binary masking,
ˆ
x
LS
is almost the best oneamong the tested linear estimators, but not always be-cause of
ΔSDR
3
<
0
in Table 6 and 8.
ã
ˆ
x
0
has a poor performance. The weighting of the in-verse Fourier transform with
w
−
1
(
n
)
ampliﬁes the dis-tortion if
w
(
n
)
is close to zero. This happens at bothend of each section, particularly for the window “tri-ang”.
ã
The classical overlap-add method
ˆ
x
1
is always worsethan the LS one with a SDR loss of up to 1 dB. Inter-estingly, it is also worse than
ˆ
x
3
.
ã
There is almost no performance difference between
ˆ
x
2
and
ˆ
x
3
.
ã
As expected, the larger the overlap length is, the largerthe SDR improvement of
ˆ
x
2
is.
ã
Zero-padding has a fairly small impact to the perfor-mance difference.
ã
In order to achieve a good absolute performance
SDR
2
,a large overlap is highly desirable resulting in a largernumber of noisy samples. Also zero-padding is advan-
p
mask
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
10% 21.32 5.48 0.29 0.0720% 28.16 6.31 0.43 0.0630% 33.60 7.41 0.60 0.0640% 38.74 8.75 0.82 0.06
Table 5
. Multiplicative distortion: Varying mask window
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
rectwin 29.18 0 0 0kaiser (
β
=0) 29.40 0.18 0.09
−
0
.
09
hamming 33.60 7.41 0.60 0.06triang 33.52 21.74 0.36 0.06
Table 6
. Multiplicative distortion: Varying window
N
−
S
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
0 23.51 0 0 0
N/
8
28.85 1.59 0.45
−
0
.
03
N/
4
31.33 3.88 0.43 0.09
N/
2
33.60 7.41 0.60 0.06
3
N/
4
34.49 5.87 0.51 0.02
Table 7
. Multiplicative distortion: Varying overlap length
K
SDR
2
ΔSDR
0
ΔSDR
1
ΔSDR
3
N
33.60 7.41 0.60 0.06
1
.
5
N
33.90 5.95 0.38 0.06
2
N
34.02 5.72 0.35 0.06
Table 8
. Multiplicative distortion: Varying zero-paddingtageous though the improvement is much smaller.
ã
Foradditivedistortion, ﬂatwindowslike“rectwin, kaiser,hamming”arebetter. Formultiplicativedistortion, “ham-ming, triang” windows are preferred. Hence the ham-ming window seems to be a good compromise.
6. REFERENCES
[1] J. B. Allen and L. R. Rabiner, “A uniﬁed approach toshort-time Fourier analysis and synthesis,”
Proc. IEEE
,vol. 65, pp. 1558–1564, 1977.[2] L. R. Rabiner and R. W. Schafer,
Digital processing of speech signals
, Prentice-Hall, 1978.[3] “Inverse short-time FFT,” MATLAB Signal ProcessingBlockset Documentation.[4] S. Araki, H. Sawada, et al., “Underdetermined blindsparse source separation for arbitrarily arranged multi-ple sensors,”
Signal Processing
, vol. 87, pp. 1833–1847,2007.[5] R. Crochiere, “A weighted overlap-add method of short-time Fourier analysis/synthesis,”
IEEE Trans. Acous-tics, Speech, and Signal Processing
, vol. 28, pp. 99–102,1980.[6] D.W.GrifﬁnandJ.S.Lim, “Signalestimationfrommod-iﬁed short-time Fourier transform,”
IEEE Trans. Acous-tics, Speech, and Signal Processing
, vol.32, pp.236–243,1984.[7] A. Graham, Ed.,
Kronecker Products and Matrix Calcu-lus with Applications
, John Wiley & Sons, 1981.
3544
Authorized licensed use limited to: Nanyang Technological University. Downloaded on August 11, 2009 at 02:38 from IEEE Xplore. Restrictions apply.

Similar documents

Search Related

We Need Your Support

Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...Sign Now!

We are very appreciated for your Prompt Action!

x