Monday, April 16, 2007

16 April 2007: Dollars

Working on UIMA right now, which isn't particularly interesting - the core product seems to be rather like GATE except slightly better organised. The idea is that you feed in a document or ten, and it will annotate the text of said document according to whatever you want it to pick out - sentences, nouns, titles, specific words, even grammatical constructs from which you might be able to determine the meaning of the document. Then we write it out to some kind of file (probably RDF in our case), stick it in a triplestore and call it 'semantic annotation'.

The advantage of 'semantic' annotation, rather than just regular full-text indexing is, well, say you wanted to look up documents that were about or contained reference to the current US President. Instead of doing the standard search on the word 'bush' and thus coming up with all manner of references to gardens, rock bands and the broadcast centre for the BBC World Service, you instead do a semantic search, thus getting only references that refer to, say, 'person:Bush', thus narrowing down your results to include a couple of US Presidents, a Florida Governor and the singer of 'Wuthering Heights'. Maybe the tagger is smarter: maybe you could in fact search for 'president:Bush' to narrow it still further. So anyway, that's what I'm up to.

Problem is, it's a pain to implement. No disrespect to the UIMA folks or the Apache developers, but for someone coming to this from cold I've found it hard enough to even get through the walk-throughs, let alone understand the processes enough to allow me to develop my own solutions. In fact, if I hadn't spent a good amount of time on GATE a couple of years ago, I might well be totally lost. Took me almost a full day just to get the thing fully and correctly implemented into Eclipse, what with all the extra downloads and additional bits here and there.

The whole process reminded me of the first time I dabbled with J2EE and spent days just downloading stuff I didn't understand and faithfully following walk-through tutorials, performing ANT deployments and editing XML files that were a mystery to me. I came out far more confused that I'd gone in, and didn't understand why it was so much more complex than the PHP and ASP alternatives that I was used to. And the thing is, I'd made servlets before and fully understood that. Only later, after discovering an actual 'Hello World' JSP, was I able to build up from there and realise that half the stuff I'd done in the initial tutorial hadn't been required, it was just a way to keep the programmer from having to write code.

So that's my gripe. I don't want to be saved from writing code by means of complex helper applications and Eclipse plug-ins, OK? Give me the code, whether it's OO, script or even something XMLish. I understand stuff like variables, data types, arrays, branching, looping, subroutines, classes, recursion, regular expressions and all that stuff. I'm happy with client-server constructs, port connections, messaging. But when I'm faced with new stuff - and lots of new stuff at that - it throws me. Chunks of UIMA seem to be about nice GUI forms that you fill in to make the background XML file: why can't I just make the XML file myself? Answer: I can, but it's not how the tutorials take you, and I need tutorials since I'm coming to this pretty much as a starter. So I go through the tutorials, and then figure out how to translate it into the lower-level concepts I'm more familiar with.

So I'll battle on, but wish things could be simpler. One of the best things I programmed recently was dead simple: a short java application to sit on the desktop and tell me the current exchange rate between the pound and the dollar, so we can figure out when is best to transfer funds to the US (for things such as student loan payments and root beer). It's short but elegant, multi-threaded, web-enabled and runs in the background, only popping up when important events occur, such as heading towards the two-dollar pound. Which, it seems, we may be not far off accomplishing as I write this...




I just wish the Reuters website didn't insist on calling it BST when they're actually giving the time in GMT.

No comments: