[Clam-devel] Re: Attacking your core goal

Mon Jul 14 03:35:27 PDT 2008

I'm moving the discussion to the list. Context: Jun planifies the steps until 
the final term. Topics: Aggregator python scripts, musicbrainz, qt 
interfaces.

On Divendres 11 Juliol 2008, JunJun wrote:
> David,
> Hi,
> The aggregator.py need some script as input. I failed in finding such a
> script as a reference. What is that script like?

Just take a look at AggregatorTest.py for examples.

What a pity that the class is not fully covered by tests because an example of 
execution could be also very clarifying.

The scripts have several lines specifying each a command (always 'copy'), the 
index of the origin xml file (you may aggregate several sources), the 
descriptor scope and name you want to take from the origin and the descriptor 
scope and name you want to create on the target.

> I planned several milestones of the core target. For the clarity, I just
> take concrete instances in the steps: milestone-1:
> build a new extractor, who calls a public web service, say musicBrainz.
> Below is an example result that is fetched from musicBrainz:
> <?xml version="1.0" encoding="UTF-8"?>
> <metadata xmlns="http://musicbrainz.org/ns/mmd-1.0#"
> xmlns:ext="http://example.org/ext-9.1#"> <track
> id="d6118046-407d-4e06-a1ba-49c399a4c42f">
>         <title>Silent All These Years</title>
>         <duration>253466</duration>
>         <ext:annotation>This is a <em>very</em> nice song.</ext:annotation>
>     </track>
> </metadata>

It seems like you have given MusicBrainz a lot of weight and it's a very side 
goal, unless you have any reason i am missing. You already have a lot of 
extractors to play with in order to have the aggregator working. Indeed you 
could even use the aggregator to disggregate existing ones an do the tests 
joining them back. So my advice is to go for the aggregation itself and the 
interface to control it. Once we have the core project done, adding 
MusicBrainz would be a very good as closure of your project.

In summary, MusicBrainz is interesting, and it is not that far to get it, but 
lets give more priority to the core.

> (double-quick Todo: figure out what the python-musicbrainz2 is about.)

It is an abstraction of the webservice so i guess that it hides in some way 
the xml communication with the server i also hope it deals with the 
fingerprint computation.

> milestone-2:
> A related mapping schema, e.g. MusicBrainzDescriptors.sc.
>
> milestone-3:
> wrap the result of the extractor as CLAM pool xml, according to the schema
> above.
>
> milestone-4:
>
> Here the files in ./scripts should be taken advantage of--
> Aggregate the new schema with an existing schema. For instance,
> MusicBrainzDescriptors.sc will be aggregeted with
>
> CLAMDescriptors.sc
> Aggregate the new pool with the existing pool.
> Test on the annotator.
>
> (The below is pending)
> A graphical interface to build merging script.
> There is no training session about the graphical interface, right?

Yes, and that's another reason to deal with it now, that i am still around. In 
any case, most people in the CLAM devel community, and most GSoC students are 
proficient in Qt so you will be able to get help and advice from them. Just 
another reason to get into the IRC and posting in the mailing list ;-)

So, regarding the planning, i would suggest you the following:

First of all, i would create a new extractor that does the aggregation of some 
fixed sources. Such script will ovey the same command line that uses existing 
extractors [1] and will spot any pending problem (or not).

Having such an extractor let's figure out which is the varying information 
when changing the sources and how to feed the variable information from the 
Annotator and the script (extra parameters, config files...), feed it being 
constant in the annotator but input parameters for the script.

Then once we know which is the changing information we need for different 
aggregations let's design an interface to configure them instead of the 
current 'Extractor' field of the Project. Also providing means to store such 
configuration into the project.

When storing we are writting into an aggregated pool, but original pools won't 
be written so we need a write back path that currently doesn't exist.

This will cover the minimum core part. Then we should stop again and 
prioritize the following aspects. Due to the current timing I will be very 
happy if you cover just 2 or 3 of them, happier if you end up doing more :-)

- Adding extractors (ie. MusicBrainz, i like the steps you wrote for it)
- Doing a second iteration on the configuration interface (just having it 
working is not the best, sure)
- Writing an upload script as example (MusicBrainz? Boca?)
- Being able to configure parameters on the extractors with an configuration 
file
- A per descriptor read-only flag in order to control which descriptor can be 
modified (avoiding the write back if not supported).
- A per descriptor modified flag in order to control which descriptor must be 
saved (avoiding the write back if not needed).
- Addressing building a description from a blank sheet, ie. what to do when 
Music Brainz has not found the song, or when you don't have an extractor for 
it and want to generate it by hand.

> One more question: the Project.GetExtractor(), where is it?

Project is a DynamicType, i though that we saw them before, if not, you have 
been lucky of dealing with CLAM code for two months and not having to deal 
with them :-) They are just a kind of Component that may have or not a given 
attribute. Attributes are declared with macros that conveniently expand code 
for getters, setters, interface to add and remove, xml storage... 'Extractor' 
is an attribute and 'GetExtractor' is the generated getter.

BTW, i read in your blog that you got Sebastian's danceability algorithm 
working. Congratulations. Of course we would like it in CLAM :-) The 
algorithm gives greater number for less danceable excerpts. It is not a bug 
in your code that it is inverted. Really confusing, i agree.

Regards.

-- 
David García Garzón
(Work) dgarcia at iua dot upf anotherdot es
http://www.iua.upf.edu/~dgarcia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.clam-project.org/pipermail/clam-devel-clam-project.org/attachments/20080714/3a457a5a/attachment-0003.pgp>