From 3624e70d939115c89fae823cdc02125b38c6a482 Mon Sep 17 00:00:00 2001 From: drowe67 Date: Mon, 7 Sep 2009 04:40:59 +0000 Subject: [PATCH] added lots of doco to the README git-svn-id: https://svn.code.sf.net/p/freetel/code@41 01035d8c-6547-0410-b346-abe4f91aad63 --- codec2/README.txt | 221 ++++++++++++++++++++++++++++++++++++++-------- codec2/TODO.txt | 4 + 2 files changed, 190 insertions(+), 35 deletions(-) diff --git a/codec2/README.txt b/codec2/README.txt index a62ee9a4..3b638924 100644 --- a/codec2/README.txt +++ b/codec2/README.txt @@ -1,19 +1,45 @@ -README.txt for codec2 -David Rowe -21 August 2009 +Codec2 - Open Source Low Bit Rate Speech Codec +=============================================== Introduction ------------ -codec2 is a open source low bit rate speech codec designed for +Codec2 is a open source low bit rate speech codec designed for communications quality speech at around 2400 kbit/s. Applications -include low bandwidth HF/VHF digital radio. +include low bandwidth HF/VHF digital radio. It fills a gap in open +source, free as in speech voice codecs beneath 5000 bit/s. Status ------ -Unquantised encoder and decoder running under Linux/gcc, pitch -estimator untested. +1. Unquantised encoder (sinenc) and decoder (sinedec) running under + Linux/gcc, pitch estimator untested. The decoder (sinedec) is a + test-bed for various modelling and quantisation options - these are + controlled via command line switches. + +2. LPC modelling working and first pass LSP vector quantiser working + at 37 bits/frame with acceptable voice quality. Lots could be done + to improve this. + +3. Phase model developed that uses 0 bits for phase and 1 bit/frame + for voiced/unvoiced decision. + +The Source Code +--------------- + +Browse: + http://freetel.svn.sourceforge.net/viewvc/freetel/codec2/ + +Check Out: + $ svn co https://freetel.svn.sourceforge.net/svnroot/freetel/codec2 codec2 + +The Mailing List +---------------- + +For any questions, comments, support, suggestions for applications by +end users and developers: + + https://lists.sourceforge.net/lists/listinfo/freetel-codec2 Quick Start ----------- @@ -21,36 +47,154 @@ Quick Start To encode the file raw/hts1a.raw to a set of sinusoidal model parameters (src/hts1.mdl) then decode to a raw file src/hts1a_uq: -$ cd src -$ make -$ ./sinenc ../raw/hts1a.raw hts1.mdl 300 ../pitch/hts1a.p -$ ./sinedec ../raw/hts1a.raw hts1.mdl -o hts1a_uq.raw -$ play -f s -r 8000 -s w ../raw/hts1a.raw -$ play -f s -r 8000 -s w hts1a_uq.raw + $ cd codec2/src + $ make + $ ./sinenc ../raw/hts1a.raw hts1.mdl 300 ../pitch/hts1a.p + $ ./sinedec ../raw/hts1a.raw hts1.mdl -o hts1a_uq.raw + $ play -f s -r 8000 -s w ../raw/hts1a.raw + $ play -f s -r 8000 -s w hts1a_uq.raw Plan ---- -[X] Milestone 0 - Project kick off -[X] Milestone 1 - Baseline unquantised codec running under Linux/gcc -[ ] Milestone 2 - Spectral amplitudes quantised -[ ] Milestone 3 - Prototype 2400 bit/s codec + [X] Milestone 0 - Project kick off + [X] Milestone 1 - Baseline unquantised codec running under Linux/gcc + [ ] Milestone 3 - Prototype 2400 bit/s codec + [X] Spectral amplitudes modelled and quantised + [X] Phase and voicing model developed + [ ] Pitch estimator integrated into encoder + [ ] Frame rate/quantisation schemes for 2400 bit/s developed + [ ] Refactor to develop a encoder/decoder functions + [ ] Test phone call over LAN + +How it Works +------------ -LPC Modelling -------------- +Speech is modelled as a sum of sinusoids: -$ ./sinedec ../raw/hts1a.raw hts1a.mdl --lpc 10 - hts1a_lpc10.raw + for(m=1; m<=L; m++) + s[n] = A[m]*cos(Wo*m + phi[m]); + +The sinusoids are multiples of the fundamental frequency Wo +(omega-naught), so the technique is known as "harmonic sinusoidal +coding". + +For each frame, we analyse the speech signal and extract a set of +parameters: + + Wo, {A}, {phi} + +Where Wo is the fundamental frequency (also know as the pitch), {A} +is a set of L amplitudes and {phi} is a set of L phases. L is +chosen to be equal to the number of harmonics that can fit in a 4 KHz +bandwidth: + + L = floor(pi/Wo) + +Wo is sepecified in radians normalised to 4 kHZ, such that pi radians += 4 kHz. the fundamental frequency in Hz is: + + F0 = (8000/(2*pi))*Wo + +We then need to encode (quantise) Wo, {A}, {phi} and transmit them to +a decoder which reconstructs the speech. A frame might be 10-20ms in +length so we update the parameters every 10-20ms (100 to 50 Hz +update rate). + +The speech quality of the basic harmonic sinusoidal model is pretty +good, close to transparent. It is also relatively robust to Wo +estimation errors. Unvoiced Speech (e.g. consonants) are well modelled +by a bunch of harmonics with random phases. Speech corrupted with +background noise also sounds OK, the background noise doesn't +introduce any unpleasant artifacts. + +As the parameters are quantised to a low bit rate and sent over the +channel, the speech quality drops. The challenge is to achieve a +reasonable trade off between speech quality and bit rate. + +Challenges +---------- -Discuss why LPC modelling works so well when Am recovered via RMS method -(Section 5.1 of thesis). Equal area model of LPC spectra versus harmonic? -Seems to work remarkably well, especially compared to sampling. SNRs up to -30dB on female frames. +The tough bits of this project are: -m=1 harmonic problem for males when LPC modelled. The amplitude of this harmonic -comes up by as much as 30dB after LP modelling as (I think) LPC spectra must -have zero derivative at DC. This means it's poor at modelling very low freq -harmonics which unfortunately ear is very sensitive to. Consider automatic -lowering for 20dB of this harmonic or maybe an extra few bits to quantise error. +1. Parameter estimation, in particular pitch (Wo) detection. + +2. Reduction of a time-varying number of parameters (L changes with Wo + each frame) to a fixed number of parameters required for a fixed + bit rate. The trick here is that {A} tend to vary slowly with + frequency, so we can "fit" a curve to the set of {A} and send + parameters that describe that curve. + +3. Discarding the phases {phi}. In most low bit rate codecs phases + are discarded, and synthesised at the decoder using a rule-based + approach. This also implies the need for a "voicing" model as + voiced speech (vowels) tends to have a different phase structure to + unvoiced (constants). The voicing model needs to be accurate (not + introduce distortion), and relatively low bit rate. + +4. Quantisation of the amplitudes {A} to a small number of bits while + maintaining speech quality. For example 30 bits/frame is 1500 + bits/s, a large part of our 2400 bit/s "budget". + +Is it Patent Free? +------------------ + +I think so - much of the work is based on old papers from the 60, 70s +and 80's and the PhD thesis work [2] used as a baseline for this codec +was original. A nice little mini project would be to audit the +patents used by proprietary 2400 bit/s codecs (MELP and xMBE) and +compare. + +Proprietary codecs typically have small, novel parts of the algorithm +protected by patents. However the designers of these codecs rely on +large bodies of existing, public domain work. The patents cover +perhaps 5% of their codec algorithms. They did not invent most of the +algorithms they use in their codec. Typically, the patents just cover +enough to make designing an interoperable codec very difficult. These +also tend to be the parts that make their codecs sound good. + +However there are many ways to make a codec sound good, so we simply +need to choose and develop other methods. + +Is Codec2 compatable with xMBE or MELP? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Nope - I don't think it's possible to a compatable codec without +infringing on patents or access to commercial in confidence +information. + +LPC Modelling Notes +------------------- + +$ ./sinedec ../raw/hts1a.raw hts1a.mdl --lpc 10 - hts1a_lpc10.raw + +The blog post [4] discusses why LPC modelling works so well when Am +recovered via RMS method (Section 5.1 of thesis). Equal area model of +LPC spectra versus harmonic seems to work remarkably well, especially +compared to sampling. SNRs up to 30dB on female frames. + +m=1 harmonic problem for males when LPC modelled. The amplitude of +this harmonic is raised by as much as 30dB after LPC modelling as (I +think) LPC spectra must have zero derivative at DC. This means it's +poor at modelling very low freq harmonics which unfortunately the ear +is very sensitive to. Consider automatic lowering for 20dB of this +harmonic or maybe a few extra bits to quantise error. + +Phase Modelling Notes +--------------------- + +Phase modelling makes no attempt to match harmonic phases at frame +edges. This area would be worth experimenting with, as it could cause +roughness. Then again it might be responsible for effective mixed +voicing modelling. + +Unvoiced speech can be represented well by random phases and a Wo +estimate that jumps around randomly. If Wo is small the number of +harmonics is large whcih makes the nosie less periodic and more noise +like to the ear. With Wo jumping around phase tracks are +discontinuous between frames which also makes the synthesised signal +more noise like and prevents time domain pulses formeing that the ear +is senstive to. Octave Scripts -------------- @@ -76,19 +220,26 @@ speex - LSP quantisation code borrowed from Speex for testing src - C source code octave - Matlab/Octave scripts pitch - pitch estimator output files -raw - speech files in raw format (16 bits signed linear) +raw - speech files in raw format (16 bits signed linear 8 KHz) unittest - Unit test source code wav - speech files in wave file format References ---------- -[1] Introductory Blog Post: - http://www.rowetel.com/blog/?p=128 - -[2] Bruce Parens introducing the project: +[1] Bruce Parens introducing the project concept: http://codec2.org/ -[3] Davids Thesis, used for baseline algorithm: +[2] David's PhD Thesis, "Techniques for Harmonic Sinusoidal Coding", + used for baseline algorithm: http://www.itr.unisa.edu.au/~steven/thesis/dgr.pdf +[3] Open Source Low rate Speech Codec Part 1 - Introduction: + http://www.rowetel.com/blog/?p=128 + +[4] Open Source Low rate Speech Codec Part 1 - Spectral Magnitudes: + http://www.rowetel.com/blog/?p=128 + +[5] Open Source Low rate Speech Codec Part 2 - Phase and Male Speech + http://www.rowetel.com/blog/?p=128 + diff --git a/codec2/TODO.txt b/codec2/TODO.txt index 4da876f3..c8faad82 100644 --- a/codec2/TODO.txt +++ b/codec2/TODO.txt @@ -14,3 +14,7 @@ TODO for codec2 [ ] Is BW expansion and Rk noise floor required before LSP quant [ ] test split VQ to make sure no silly errors + for example test MSE or index historgram for training data + +[ ] Go through papers referenced in thesis and credit various + techniques to papers. + + sure there was somthing about zero phase synthesis is those papers -- 2.25.1