Started on Tuesday. After doing some administrative stuff, we started talking about possibilities for the project. It seems like we'll be using Reinforcement Learning (RL) applied to real time processing of musical audio signals, rather than MIDI data. We haven't yet decided what exactly the application will be, although we talked about a few possibilities:
We've decided to use Pd (Pure Data) and/or Max/MSP, two similar real time object-based audio and midi programming environments, to implement our system. One can write externals for these systems in C in order to implement algorithms that would be too slow, impossible or very messy to program in the graphical environment. Pd and Max/MSP have very similar frameworks for developing externals, due to their common origin in the work of Miller Puckette. Pd has the advantage of being open source (and therefore free), but lacks the polish and documentation of Max/MSP (a commercial product).My week's goal was to write a simple external for one of the systems. First I was intending to write a Max external, but I found out quickly that the version of Max installed on my OS X system is 4.2.1, which only supports externals compiled in the CFM binary format, which is a holdover from Mac OS 9 and earlier, and which neither gcc or xcode can produce. The latest version of Max/MSP (4.5) supports the newer Mach-O binary executable format. We are in the process of obtaining Max/MSP 4.5, but in the meantime I've decided to pursue Pd, since it supports Mach-O-style externals, as well as using flext...
There exists a very nicely written and documented C++ wrapper layer called flext, which allows one to write externals once for flext, then compile them specifically for Pd or Max/MSP on any platform: Linux, Mac OS X, or Windows. I've decided to learn to write externals with flext rather than writing directly for the Pd or Max API for a couple reasons: (1) it is obviously nice to have the platform independence, (2) it actually seems cleaner and easier than either of the native APIs, and (3) if we later decide to use Max/MSP rather than Pd (once we have the 4.5 version), it will be relatively painless to switch. Here are some of my notes on installing and compiling Pd and the flext libraries on OS X
There are several binary distributions of Pd available for OS X, which
at first I was wary of, because I knew I'd need the source tree, and
wasn't comfortable with the idea of having a big old Pd.app in
/Applications/. Fortunately this works out fine as I'll show.
(There are some Pd installers for OS X that have been made which include a
collection of externals libraries, but I chose to use the one available
straight from Miller Puckette's website (linked above) in order to keep it
simple.) The install couldn't be easier; just drop Pd.app into the
applications folder and you're ready to go. I wanted a UNIX-style install
of Pd in order to install the flext library and headers though, which I actually
accomplished simply by symlinking:
ln -s /Applications/Pd.app/Contents/Resources/ /usr/local/pd
I am really starting to like OS X! Anyway, now on to installing flext.
Downloaded the flext source distribution from the flext link above,
untar'ed into /usr/src. Then I just modified the config file and pointed
/usr/local/pd, etc. and the compile went fine. I chose
/usr/local/pd/flext to install the flext library to.
I downloaded the flext-intro.pdf document as well as the archive to
tutorial examples from the flext web site. For some reason the signal
based externals won't compile, which is something we might eventually need
to contend with. The non-signal based externals all compiled fine, and once
installed in the
/usr/local/pd/extra directory, the example
patches from the tutorial distribution run fine.
I found the answer to the signal externals not compiling...need to add
-framework veclib to the linker flags for OSX 10.3...found this on the PD mailing list archives (http://lists.puredata.info/pipermail/pd-list/2004-01/016979.html)
I wrote a simple external of my own from scratch called
accum. It has two inlets and two outlets. Inlet 0 receieves
floats and adds them to a running sum. Inlet 1 sets the variable
maxiumum. When the running total exceeds maximum, the object
sends a bang to Outlet 1 and the running sum gets set to the sum fmod
To compile my external, I modified the makefile provided with the tutorial examples. There are a bunch of compiler options for Mac hardware that I don't understand; the compile options specified in the Makefile in the Linux tutorial package are much simpler. Once I had the Makefile set up properly for each platform, I was able to compile my external without any changes for Pd on either platform; very nice!
Here is my source code for
accum: [ accum-main.cpp ] There are comments in the source.
NOTE: When creating an external for flext/PD with a variable argument list, it seems that if you create the object with arguments in a patcher, you can't then create another object without arguments....strange.
I made a second external using Flext for PD, called
[Mport]. I wanted to deal with MIDI data in this one. To keep it simple, I use input from the
[notein]/[noteout] objects, along with
[stripnote] which eliminates note-offs. The object uses attributes
which is a consistent system provided by Flext for getting and setting,
um, attributes in your object. They can be made persistent (saved with
the patch) which is nice, and have other nice features.
NOTE: I found out the hard way that
#define FLEXT_ATTRIBUTES 1 must come before
Mport keeps internal variables for the current MIDI pitch and velocity, and attributes for Portamento Time (port-time), Speed (speed), and Duration Ratio (dur-ratio). Upon receiving a new velocity/pitch pair in its two inlets, it generates a sequence of notes, beginning with the pitch/velocity of the previous note, and ending with those of the note just received, a sort of discrete MIDI portamento.
In addition to using attributes, the big thing here was using the Timer class in Flext. It is actually fairly simple:
Timer.Delay(int s) causes the timer to trigger in s
seconds. Similarly there are Now(), Periodic(int s) and Reset() methods
for the Timer. When that timer triggers, it calls the method according
FLEXT_ADDTIMER(timer,m_method) called in the
object's constructor. In order to play my sequence of notes, I had my
timer callback method call itself in a loop until it has played all the
notes in the array I set up before triggering the timer.
The source code, fairly well commented, is here: [ Mport-main.cpp ]
It is definitely worth mentioning that the name of an object MUST match the name of the direectory of its source (i.e. Mport/main.cpp corresponds to an object given PD name Mport). This frustrated me for a while.
Talked this morning about project ideas. What we are currently thinking is setting up a reinforcement learning agent to control a granular synthesis engine that will generate "responses" to live audio input.
Link: SndObj Library Homepage. SndObj (say: "sound objects") is a library of C++ classes for various signal processing and synthesis routines. Flext supports it nicely, so developing signal externals for PD/flext is made a a lot easier with SndObj. The library includes a class for doing simple Synchronous granular synthesis, so I was thinking at first of using this for our application.
Investigating more about granular synthesis and how it relates to our task... Here is a good overview of granular synthesis that I found. Synchronous granular synthesis refers to a fixed linear relation on the length and windowing of the grains. Asynchronous is more complex and irregular since the timing and windowing of each grain might be controlled individually. This has its advantage in the sheer amount of control over the texture produced, but the disadvantage is the number of parameters that one must control continuously! Many approaches have been made to this problem by composers such an Iannis Xenakis, et al. by using statistical/stochastic models for parameter control. It seems like a natural extension to try and apply a learning technique to the problem of parameter control of asynch. gran. synthesis.
So on the practical side, it seems a lot more feasable to build a suitable GS engine in PD than to program one as an external, using SndObj or otherwise. I was looking at syncgrain~ which is a Pd/flext object that basically just ports the functionality of the SndObj SyncGrain class. Then I discovered a 32-voice asynchronous granular synthesizer implemented entirely as a Pd patch...very nice. It's called Particlechamber. It offers basic random variation for all the basic granular parameters (you set the mean and deviation, basically). Eight separate sample buffers to choose from (although as provided it only uses one at a time...a simple modification lets you choose grains randomly from the separate buffers).
I went on today to build the beginnings of my own simple one voice
buffer granulator. It skips around the buffer at random, and generates
grains between and upper and lower bound (in milliseconds). I am
currently applying a triangular window to each grain. I would like to
give it the ability to overlap grains; this will require more than one
[tabplay~] object, used in a polyphonic manner. Grain overlap (crossfade) within a single voice is something that Particlechamber doesn't do, although you do get a sort of overlap (less strictly controlled) with running numerous voices at once.
Also started to read Microsound by Curtis Roads (cool website, check it out). Seems like a really great book, and very useful.
Read most of Microsound through chapter 3. I'm going to skip
to the section about file granulation since that's what we'll be
dealing with, although the pure synthetic granular stuff is neat.
Since I wrote last, I've read most of Microsound
chapters 1 (an overview), 2 (history of microsound), 3 (intro to GS), 5
(transformations of microsound) and most of 6 (windowed analysis and
transformation). I've begun to implement some basic GS algorithms in
Pd. One problem I ran into last week was multiple voice overlap. I
found an abstraction called not-quite-poly
that does simple polyphonic voice management and have been using it to
manage the generation of grains. I simple wrote my single voice grain
player (which accepts its arguments as a list in the first inlet) as an
abstraction that satisfies the template needed by nqpoly, and was able
to generate overlapping streams of grains. I also switched to using a
Hanning window, which is described by .5(1-cos(2*pi*t)) for t in [0,1].
This gave the grains a much smoother envelope.
Monday: worked on my GS program, experimenting with various controls on the grains. Trying to come up with a reliable way of controlling the frequency range of the grains. My very crude solution is to provide a "transposition" factor for each grain, which simply scales the number of samples played in the fixed duration of the grain. So for example, if transp = 1.5, and grain dur = 1000ms = 44100 samp, the final values for playing the grain will be 44100*1.5 samples played in 1000ms. This actually turns out to control the percieved freqency spread quite well, although of course it is relative to the frequency content of the input material being granulated. Having an absolutely controlled freq. range would obviously require some tricky analysis of the grains. Update: I've changed the control units to octave transposition up and down.
Wednesday: Today I would like to think about analysis. Yesterday afternoon I read a paper by Miller Puckette called "Low-dimensional parameter mapping using spectral envelopes" (ICMC 2004 Proceedings). In the paper he describes a system for analyzing a live input stream and mapping it to a parametrized curve in "timbre space", some high dimensional vector space. There is also a library of synthetic or stored and pre-analyzed sounds (though storing waveforms for quick access will require a lot of memory). These sounds are also assigned values in the same timbre space. In real time, the system then synthesizes sound to match the live input. The simplest way of doing this is, for each analysis interval to simply find the synthetic sound with the minimum euclidean distance from the live input timbre.
Particularly valuable to think about right now are Puckette's criteria for success. He mentions: (1) Perceptibility: changes in input timbre should correspond to audible changes in output. (2) Robustness: Slight variation in input shouldn't cause wild changes in output (and the converse). (3) Continuity: Smoothly changing input shouldn't create discontinuities in output. (4) Correspondence: When there is an audible change in the input, the change in the output should be compatible (softer->softer for example). (5) Fast response: musician must feel in control of the system.
I think that many of these apply in some way to our question of a RL system using granular synthesized sounds as output and live audio as input. At this point, I think it is an adequate challenge to use RL to have a performer "teach" the GS to create a certain sort of texture. I am still wondering whether it is necessary for the computer to be granulating the actual audio input, or using it to control synthesis parameters...
I also am going to look at doing some basic FFT analysis of the granular output stream, with different window sizes. The idea is to use this information for the RL state, consisting of a representation of the frequency content of a recent timeframe of synthesized output. Then maybe this information could be combined with/compared to a similar windowed analysis of the live input signal to come up with a useful ACTION. Some FFT based analysis techniques worth looking into seem to be:
[bonk~]object, which outputs strength of certain freqency bins. I'm going to look more into that.
Thursday:Worked on my GS patch a lot. I don't really know where to go with the RL thinking so I'm leaving it alone until next week. I have turned my patch (currently called GranStretch) into something very usable for composition/improvisation. I made several examples yesterday of some possibilites. Most of them came after I realized that there are great possibilities of input granulation by sequentially moving the grain starting index through the buffer, rather than constantly random positions being chosen. I stumbled across pitch-preserving time stretching, which was exciting, I wasn't even trying to create that ability. So a summary of the synth capabilities is in order. There are essentially two parts; I'll start from the lowest level:
[oats2v~]>It is built to work with the
nqpoly4polyphonic voice manager patch. It accepts its parameters as a list in the first inlet, and a bang on loading the abstraction (unused currently). The parameter list is: grain start index (ms), grain duration (ms), grain playback speed scaling (natural units), pan (from -45 to 45 degrees) and source table number. Uses the
[xplay~]object from Thomas Grill's xsample~ library. (Update: I have a version that just uses tabplay~ that works just as well since I couldn't get xsample to work properly on my linux machine.) Upon recieving a start time, throw~'s the grain audio output to OUTL and OUTR summing busses. Has one outlet, which bangs when the voice is done playing the current grain (which tells
[nqpoly4]that the voice is free).
[pan~]external, which accepts an angle in the range [-45,45] degrees. This is just constant power panning, which I should code as an abstraction because I don't know how standard the external is (I didn't have it on my linux machine). (An abstraction is simply a Pd patch with inlets and outlets, saved as a separate file. It can then be invoked the same way an external is, as an object. The advantage over requiring an external is that the abstraction can simply be packaged with the main patch, without requiring a certain external to be installed [externals have to be loaded at application start time].)
[freeverb~]object at the output adds some useful reverb possibilities (note: freeverb~ on linux doesn't seem to work properly on my machine). The reverb tail "freeze" is a neat effect. I've also set up a recording function using
[sfwrite~](note that this outputs a RAW sound file; I've been using the freeware program SoundHack to set the header information and save as AIFF).
I uploaded the example pieces I made using my GranStretch patch last week. Here they are, annotated. The source sounds for these came from the speech accent archive at George Mason University. The description from their site: "This site examines the accented [English] speech of speakers from many different language backgrounds reading the same sample paragraph." These proved to be interesting source material not only for the frequency variation of the speech sounds, but also for the possibility of granulating several of them in parallel, so grains of similar sounds appear correlated in time in several of these examples.
We've been talking about using the analysis of an input audio signal to create some sort of control stream for the granulation parameters. For these experiments, we'll probably use some fixed sound material for granulation, rather than using the live audio input itself in order to simplify the experiment. Before we consider using RL to make granular action decisions based on some audio input, we decided that it would be good to see what we can do in terms of direct parameter mapping from some parameters extracted from analysis.
I've also posted a few examples of grain clouds that I created using Curtis Roads's CloudGenerator program, which can perform synthetic GS or granulation, asynchronous or synchronous, with starting and ending values for frequency range, grain duration, grain density, etc. I did these several weeks ago but they are still somewhat instructive...
This week I focused more on the analysis end of things. The goal for right now is just to come up with some useful analysis that will produce a consistent control stream from some audio input (whether it's a file or live audio input). I have been investigating the spectral centroid, which I mentioned several weeks ago. It can be thought of as the "center of gravity" of the spectrum in a certain finite frame of analysis. It has been found to correspond to the intuitive "brightness" of a sound. I've found it to be fairly easy to work with, although it took me a while to realize how to implement it.
My implementation takes a signal input, applies a Hanning window to
it, then performs a real FFT. The modulus of the FFT output is taken
and written to an array (whose length is N/2 where N is the blocksize,
i.e. the window size of the analysis). I used the
object, which sends out a bang at the end of each DSP cycle. The bang
triggers an iteration through the array and accumulates two sums, one
which is the dot product of a vector of FFT bin numbers and a vector of
their respective amplitudes, and the other which is the dot product of
the vector (1,1,...,1) and the amplitudes. A bang then divides the
first by the second and outputs the centroid of that frame. Here are
And a sonification (just a sine wave oscillator following the centroid (which has been lowpassed to smooth it, and then scaled to an appropriate frequency range): birdSpectralCentroid.mp3
I made a post to the PD mailing list about other possible methods for feature extraction.... The replies are definitely worth reading through. Some of the statistical measures of the power spectrum could prove to be useful. although I don't know anything about how the correlate to perceived features in the audio.
We've decided on experimental setup that will be our work for the
rest of the summer. We've gotten Judy's flext port of the reinforcement
learning code working fairly well. The external is currently called
and implements the Sarsa-lambda algorithm for Sutton & Barto, based
on the implementation of another student of Judy's. I added flext
attribute functionality, so we can change the learning parameters on
the fly without recompiling the object.
Our thinking started with the pole-balancer task from the Sutton & Barto's "Reinforcement Learning". This RL problem involves a cart on a track which can move left and right, learning to balance a vertical pole which is hinged to the cart. From the book:
The objective here is to apply forces to a cart moving along a track so as to keep a pole hinged to the cart from falling over. A failure is said to occur if the pole falls past a given angle from vertical or if the cart reaches an end of the track. The pole is reset after each failure. This task could be treated as episodic, where the natural episodes are the repeated attempts to balance the pole. The reward in this case could be +1 for every time step on which failure did not occur, sos that the return at each time would be the number of steps until failure....
The first mini-experiment with RL in Pure Data, superficially similar to the pole-balancing problem, was to get the RL object to learn a hidden rule (one particular desirable state). The states are integers [0, 9]. The two possible actions are: increment the state by one; or decrement the state by one. A reward of 0 is given when the current state is the desired state (the hidden rule), and -1 is given otherwise. Each time a reward of -1 is received, the RL algorithm "fails" and begins a new episode, setting the state to a random value and starting again.
The second setup, shown below, uses the same states and actions, but gives a reward of 1 upon hitting the target, and a reward of 0 otherwise. Upon getting a zero reward, the algorithm continues by generating a new action. When a reward of 1 is received, the episode is finished (so the actual reward "1" is used instead of the expected reward from the algorithm's value function) but we still continue generating actions. In summary, the state never gets reset to a random starting value, making this more of a continuous learning setup.The test setup looks like this:
Below is the attributes dialog box, brought up by right-clicking on
GrControl and selecting "Properties". Here we can update the learning
parameters, which much be followed by an
[init( message to the first inlet to actually update the params inside the algorithm. The
[reset( message resets the Q values to random low values and clears the eligibility traces.
The next step will be to apply this to some audio output from the granular synth patch ("Gran"). We decided it would be best to start out with a very simple state space and actions space, similar to the experiment above. The action space will be simply increasing or decreasing the grain transposition setting in Gran. The states will be read from a spectral centroid measurement of the granular output. This seems to be a good choice since in my earlier experiments I found the centroid to be strongly correlated to the transposition setting (as one would expect). A positive (or zero) reward will be given if the centroid at a given time step is within the desired range of values. A negative reward will be given if the centroid at a given time step is outside of that range. We hope that the RL program will be able to find its way to the hidden rule of the desired centroid range, and possibly that the learning process itself will generate interesting musical material.
In order to have better control over the RL experiment, I created an altered version of my Gran patch, called GranWave, which uses wavetable oscillators to generate pure grains, rather than granulation techniques. This is especially useful when I simply generate sine grains, because I can set the grain transposition upper and lower bounds to certain values, then actually expect the output to fall within those ranges of the frequency spectrum (no harmonics to deal with or anything). This works very well in practice, and will make it possible to do much more controlled experiments.
I dicovered that my previous spectral centroid measurement patch actually doesn't work properly (for whatever reason). I had a hunch that it didn't, since it's measurement was sensitive to amplitude changes in an otherwise constant sound input. I discovered that it is much simpler and cleaner to implement in Python using pyext. Now I am confident that the measurement works properly. I've set up a new spectral centroid visualization patch that shows the instantaneous power spectrum with a range bar displaying the centroid right alongside it. Feeding the input from GranWave into it, is is very satisfying to watch the spectroid indicator consistently seek right to the visual as well as aural center of the frequency distribution.
I also devised another power spectrum measurement that I'm calling the "spread", which I also implemented in the same python script. I wanted a measurement that would correspond to the "bandwidth" of the grain cloud. At first I was thinking that a sort of "standard deviation", using the centroid as the mean, would do it, but of course that doesn't indicate the width of the spread around the centroid. After thinking about it for a bit, I decided to try the following calculation:
Where x = x0,...,xN-1 is the signal vector of length N, and i' is the spectral centroid (in units of FFT bins). The idea was to use the distance of each bin from the centroid bin, weighted by the amplitude in that bin. This way, the more bins that are far away from the centroid bin and have high amplitudes, the higher the total measure will be. It turned out to be a good guess which correlates quite well with the random transposition range of GranWave.
One thing to note about this "spread" measurement is that the way it is currently set up, the spread is linear over the FFT bins, and therefore doesn't correspond well to what we hear as the subjective pitch spread. I think it wouldn't be too hard to change this to a "constant Q" spread that would give values that made sense with respect to pitch rather than frequency. I think that might be unnecessary complication for the experiment right now though, so I'm going to stick with this frequency-linear version.
I'm working on applying our RL experiment setup to granular sound.
I'm planning on using 100 states, so first about that. There was a bit
of difficulty in getting the
[GrControl] object to work with 100 states. I realized that I had to change a constant in RLControl.h:
#define MaxState 10 to
#define MaxState 100
Then it is just a matter of setting the numAgents attribute in the
GrControl object to 100, which you can do either with the "properties"
dialog box, or by creating the object as
[GrControl @numAgents 100]
in which case that value is saved with the patch. When I had numAgents
exceeded MaxState, the code was generating some strange action values,
like constantly outputting "45", etc.
Next I found that if the actions are only +1,0,-1, a state space of 100 values is difficult to explore efficiently. I added two more actions, +5 and -5, which allow for quicker exploration.
I'm going to use almost the same setup as we used in the simple "number finding" example (described in Week Nine). The state will range in [0, 100]. Each state corresponds to a 220.5Hz wide subband of the frequency spectrum. I am keeping it as linear divisions of the spectrum for simplicity's sake, although it would be more intuitive to divide it logarithmically.
Each signal block, the centroid is measured (in the GranAnalyze
patch). The GranRL patch receives that value and checks whether it is
within a certain range (settable by number boxes). If it is within the
range, a reward of "1" is sent, otherwise a reward of "0". The
object then first gets the state (the current setting of the
transposition control), followed by the reward. It outputs one of five
actions, which are adjustments of 0,+1,-1,+5,or -5 to the transposition
control. The process is then repeated.
When I got all this to work properly, it succeeded in finding the optimal setting fairly quickly.