Fixing UTF-8 encoding on my Tomcat websites

Just spent a few hours fixing some UTF-8 encoding problems on my blog.

I had a problem with non-ascii character being displayed incorrectly.

Turns out that I had a number of different problems to solve.

First I read through Cagan Senturk’s (very useful) UTF-8 Encoding fix (Tomcat, JSP, etc) post.

Fortunately I’d already read Joel Spolsky’s epic unicode post so I had the theory.

First off I needed to make sure all my JSPs had the correct pageEncoding at the top.

I also added the ‘Content-Type’ meta header to my template file.

Next I needed to wire in the EncodingFilter that Cagan so kindly provided.

That meant that non-ascii characters in my JSPs rendered fine but I still had two problems.

Any text that I entered into a form was still being screwed up, as was anything read from the database.

Stack Overflow had the solution (as usual) for the form input.

I needed to amend my Tomcat config to ensure my connector had ‘URIEncoding=”UTF-8″ ‘ added to it.

That fixed the form input problem.

That just left my Postgres database.

I first used ‘psql -l’ to see what encoding my database had.

It was set to ‘LATIN1’ – obviously it needed to be ‘UTF-8’.

To fix this I needed to drop and recreate my database.

Luckily this was only my local development database (my production one was already UTF-8) so that was simple enough.

Finally, after all that was done, I had proper UTF-8 support on my site.

And to prove it – here’s some non-ascii content from the UTF-8 SAMPLER website.

¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯

Adding Sphinx to your Java website with jsphinx

I’ve been using Sphinx on my FilmDev website to search user’s recipes and it’s been working really well.

So well that I wanted to add it to my Java websites too.

Setting up Sphinx on a rails site is made very easy thanks to the Thinking Sphinx plugin.

Unfortunately there is no such plugin for Java so setting it up requires a little more work (though not too much).

First off I downloaded and configured Sphinx until I could call search on the command line and get results back from my database.

I then grabbed the sphinxapi.jar from the downloaded package and dropped it into my WEB-INF/lib directory.

The Java source for that jar is included in the downloaded package – plus a file called “test.java” that I used as the starting point for my own code.

The test.java code works but is fairly basic, I’ve expanded upon it a fair bit and have put it in a github project called jsphinx.

Feel free to grab this code and use and amend as appropriate for your own site.

I encourage you to share any changes you make by forking it on github.

Bear in mind it’s coded against the 0.9.9-release version, I have no idea if it works with the 2.0.1-beta version.

The code includes examples for doing weighting, filtering and ordering.

The command object also supports pagination.

I’m using the code on this blog right now and it works great.

The final thing in that code is something to handle delta indexing.

That’s enough of an involved topic to warrant another blog post…

Always be Debugging

Back when I was a Junior Programmer at my first programming gig I used to have conversations like this with my mentor.

Me: See, I enter the value here and then I get back the correct result. So, it works.

Mentor: Did you run it through in the debugger and see what the code was doing?

Me: I don’t need to, it works.

Mentor: It might be working just by luck, run it through and check.

I’d do as he said and run it through and around 50% of the time I’d spot something that could have gone wrong (we were writing C++, there was an awful lot that could go wrong).

So, gradually, I got into the habit of always stepping through any new code I had written in the debugger.

Now I’m more experienced I don’t do that so often (I tend not to make those silly mistakes so much – plus we have unit tests for picking up on said silly mistakes, oh yeah, and I no longer code in C++).

But one habit that has stayed with me is to always run my application in debug mode in Eclipse (I develop web apps in Java now).

That way if I do see something dodgy I can set a breakpoint, refresh the browser and immediately find out what’s going on.

The option to run my application in non-debug mode may as well not exist in Eclipse for me – I think I’ve only ever clicked that button by accident.

However, when I am asked by another team mate to help them out with a coding issue at some point we might have a conversation like this.

Me: Stick a breakpoint on that line there.

Team mate: I’m not running in debug mode, I’ll need to restart.

Me: Hmm, you should always run in debug mode, it’s great for things like this.

Team mate: Debug mode is slower.

Me: Restarting every time you have a problem isn’t exactly fast either.

I try not to go off on one at this point and lecture on the benefits of running in debug mode but I do think that stepping through code and seeing what’s going on can help can make people better programmers.

It’s not just the scenario described above.

If you’re running your application and see something not quite right it’s very easy to add a breakpoint and find out what’s going on.

If you’re not running in debug mode the temptation is to think “I’ll take a look at that later” and then forget all about it.

If you don’t already do this then you might want to try it for a week or two.

My betting is that you won’t switch back.

Automatically adding photos to Flickr photosets

I’m quite lazy when it comes to organising my photos into photosets on Flickr.

The whole process has always been a bit too manual for my liking.

It’s been on my todo list to find a way of automating it so this weekend I tried to do just that.

My thinking was to somehow link my photosets to the tags I already use for my photos. These are set when I upload from my photo database (photodb).

I know Flickr Set Manager already does this but I wanted something integrated into my photo database.

I’d already decided I didn’t want to store details of the photosets in my database as it would be a maintenance pain if I removed a set.

Plus I’d need to write some code and web pages for managing it all.

As I was pondering alternatives I had the idea to add some metadata to a photoset description on Flickr then parse and match on that in my app.

Cluttering up my set description with such metadata was a little messy but as you can’t add tags to sets it seemed the simplest way.

The basic plan then was to load all my photosets from Flickr when I chose to upload a photo.

Then parse the set descriptions for my metadata and match that against my photo’s tags.

This would then pre-select those sets in a multi-choice select box displayed on my upload page.

I could then de-select any incorrect choices and choose additional sets too.

Once I had knocked together a little prototype it occurred to me that as I store lots of other metadata about my photos I could automatically add to sets based on all sorts of criteria.

So I set about feeding location data, camera and film information into it too.

The really nice part about this solution is that if I want to create a new set based on a particular location or new camera I just need to add an entry into the set description and it all “just works”.

I mentioned above my plan to use a multiple choice select box – I forgot to mention how much I hate them though. Luckily for me I’m not the only one who hates them.

This article talks about various alternatives – the best one for me being the jquery-asmselect plugin which provides a clean and elegant solution to the problem.

Of course; all this only works for newly added photos. What about the 2000+ photos I already have on Flickr?

I need some sort of batch process to re-organise my existing photos.

Fortunately I’ve already written something similar for tagging photos which I can re-use.

Finally, here’s how it looks on screen.



If you click through to the photo on Flickr you’ll see notes I’ve added to explain things in more detail.

cdripper

As a follow-up to yesterday’s post on importing CDs I’ve decided to add my CD importing code to github.

It lives here.

I suspect it’s more useful as source code for people to look at than as a project for people to use.

It’s a little bit too much “MeWare” at the moment for it to be generally useful.

Of course, it would not be that much work to make it more generally useful:

  • Put all the paths to binaries in a config file (they’re currently hard-coded in the source).
  • Write some documentation (it has JavaDoc but that is all).
  • Make it a little more flexible (it makes some assumptions about output files with specific names and in specific formats).
  • It’s only ever been run on Linux (I’m not sure if all the binaries it requires exist on Windows)

Plus, there are a million and one programs out there to rip CDs (this is the bit where I’m supposed to justify writing yet another one…).

Anyway, the code is available to browse so feel free to “check it out”.

My convoluted CD importing system

And here’s how a CD gets from HMV on to my iPod…

Hmmm, it’s complicated alright.

But it needs to be as I have several goals I am trying to meet.


Never have to rip the CD more than once.

I rip to FLAC format which is lossless so I can recreate the original wav file at any time.

Be able to play my music back from a variety of sources.

I need to be able to play music on both my iPod(s) and through MPD.

The ripping and encoding bit is done with a Java app I wrote that encodes in parallel (it can handle Ogg Vorbis too if I want).

It’s not as automated as I’d like, I need to run it by hand when I insert a CD and choose the matching CDDB entry if there are multiple matches.

Oh, and here’s what I bought.

Thinking in Ruby

So, I’m learning Ruby (it only took me a year to get started!).

I’m working my way through Programming Ruby and doing a few different scripts to see what it can and can’t do.

Most of it seems fairly straightforward stuff and I’m liking what I’m seeing for the most part.

One of the things that crops up from time to time in examples in books and online is something along these lines:

print total unless total.zero?

That’s it, the “unless construct”.

I’ve seen this before in Perl and I’ve always avoided using it – I personally find it unintuitive so I always write my code in the if x do y style.

Do x unless y has always seemed a little, errr, backwards.

Seeing it again in Ruby I again decided I’d avoid using it and carry on as I had before – then I began to wonder if I was simply imposing my “Java programming style” on to my Ruby code.

It’s an easy enough trap to fall into, much like early C++ programmers wrapping their C-style static methods up in a class and think they were doing OO.

Thinking about it, most of my Perl code is written in a similar style to my Java – I always apply “use strict” and enable warnings, always put code into methods, almost always have a main method etc.

But hold on, am I writing Perl in a Java style and thereby restricting my ability with the language, or am I simply applying sensible practices to my Perl code?

My Perl code never really extended much beyond occasional scripts to process photos so I have no clear answer to that.

I hope that my Ruby coding will move beyond that (possibly into the realm of Ruby on Rails) so as it does I’ll have to constantly be asking myself if I am thinking in Java or thinking in Ruby.

Unchecked Exceptions

On our new project at work we’re using JPA sitting on top of Hibernate.

I’ve used Hibernate several times now and am familiar with it.

JPA is mostly similar in use but there are a few gotchas.

One that got me the other day was what happens when you write a query that you expect to return a single result.

In Hibernate I’d have called query.uniqueResult();

The Javadoc for that method says:

Convenience method to return a single instance that matches the query, or null if the query returns no results.

So, the query either returns my object or null (an exception is thrown if my query returns more than one result – fair enough).

I had to do something similar in JPA-land so I looked at its Query class.

It offered a similarly named method: query.getSingleResult();.

All good, I wrote my code, compiled it and restarted my application server.

Unfortunately, when I ran the code, it fell over with a NoResultException.

For my particular query, there were no results in our test database.

Fine, my code can deal with that, but clearly the JPA method works quite differently from the Hibernate version.

Its Javadoc says:

Execute a SELECT query that returns a single result.

Returns:

the result

Throws:

NoResultException – if there is no result


So, unlike the Hibernate version this one will throw an exception if the query returns no results.

Hmmmm, I think I prefer the Hibernate version.

Of course, if it had thrown a checked exception my code would not have even compiled.

As it was it was just luck that the database had no results so I found the problem right away.

I’m not saying unchecked exceptions are bad, on the whole I prefer them.

But there’s a certain element of retraining your brain to no longer rely on the compiler to tell you that you’re dealing with all possible error conditions.

I know, I know, there wouldn’t be a problem if I’d read the Javadoc up front, but how many people can honestly say that they read the Javadoc for every new method the first time that they call it?

Purging jsessionids

jsessionid is the parameter that a Servlet engine adds to your site’s URL if you’ve enabled cookies in your config but the user viewing the site doesn’t have cookies enabled.

It then allows a cookie-less user to use your site and maintain their session.

It seems like a good idea but it’s a bit flawed.

The author of randomCoder has summarised the flaws quite well.

Every link on your site needs manual intervention

Cookieless sessions are achieved in Java by appending a string of the format ;jsessionid=SESSION_IDENTIFIER to the end of a URL. To do this, all links emitted by your website need to be passed through either HttpServletRequest.encodeURL(), either directly or through mechanisms such as the JSTL <c:out /> tag. Failure to do this for even a single link can result in your users losing their session forever.

Using URL-encoded sessions can damage your search engine placement

To prevent abuse, search engines such as Google associate web content with a single URL, and penalize sites which have identical content reachable from multiple, unique URLs. Because a URL-encoded session is unique per visit, multiple visits by the same search engine bot will return identical content with different URLs. This is not an uncommon problem; a test search for ;jsessionid in URLs returned around 79 million search results.

It’s a security risk

Because the session identifier is included in the URL, an attacker could potentially impersonate a victim by getting the victim to follow a session-encoded URL to your site. If the victim logs in, the attacker is logged in as well – exposing any personal or confidential information the victim has access to. This can be mitigated somewhat by using short timeouts on sessions, but that tends to annoy legitimate users.

There’s one other factor for me too; public users of my site don’t require cookies – so I really don’t need jsessionids at all.

Fortunately, he also presents an excellent solution to the problem.

The solution is to create a servlet filter which will intercept calls to HttpServletRequest.encodeURL() and skip the generation of session identifiers. This will require a servlet engine that implements the Servlet API version 2.3 or later (J2EE 1.3 for you enterprise folks). Let’s start with a basic servlet filter:

He then goes on to dissect the code section by section and presents a link at the end to download it all.

So I downloaded it, reviewed it, tested it and implemented it on my site.

It works a treat!

However, I still had a problem; Google and other engines still have lots of links to my site with jsessionid in the URL.

I wanted a clean way to remove those links from its index.

Obviously I can’t make Google do that directly.

But I can do it indirectly.

The trick is first to find a way to rewrite incoming URLs that contain a jsessionid to drop that part of the URL.

Then to tell the caller of the URL to not use that URL in future but to use the new one that doesn’t contain jsessionid.

Sounds complicated, but there are ways of doing both.

I achieved the first part using a thing called mod rewrite.

This allows me to map an incoming URL to a different URL – it’s commonly used to provide clean URLs on Web sites.

For the second part there is a feature of the HTTP spec that allows me to indicate that a link has been permanently changed and that the caller should update their link to my site.

301 Moved Permanently

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible.

So, putting these two together, I wrote the following mod rewrite rules for Apache.


ReWriteRule ^/(\w+);jsessionid=\w+$ /$1 [L,R=301]
ReWriteRule ^/(\w+\.go);jsessionid=\w+$ /$1 [L,R=301]

The first rule says that any URLs ending in jsessionid will be rewritten without the jsessionid.

The second does the same but maps anything ending in .go – I was too lazy to work out a single pattern to do both types of URLs in one line.

And I used that all-important 301 code to persuade Google to update its index to the new link.

So, from now on – my pages will no longer output jsessionids and any incoming links that include them will have them stripped out.

In other words; jsessionids purged.

How to mentor programmers

I was reading an entry over at Raganwald where the author talks about managing projects.

He covers a list of things that one should try to do to ensure a project is a success.

One of his main points is that a tech lead should always know exactly what everyone is doing on a daily basis.

Whenever I’ve allowed the details of a project to escape me, I’ve failed. […] And I can tell you, whenever the details of a project have slipped from my grasp, the project has started to drift into trouble.

I’ve been the tech lead on many projects over the last 6 years or so but I’ve always stopped short of asking people what they are doing on a daily basis.

It has always struck me as a form of “micro-managing”, which is something that I’ve hated when I’ve been on the receiving end of it.

I should clarify though; I always know who is working in what area on a day to day basis (Jim is working on the email module for these two weeks), but I don’t necessarily know what specific task they are trying to achieve on a particular day (I don’t know if Jim is writing display code today, or back-end logic).

However, after reflecting on how this has worked on my projects, I have to conclude that my approach was wrong.

I should know what people are doing – I just need to find a balance between knowing what they are doing and getting on their nerves.

Clearly a balance can be found.

I make no apologies for now insisting on knowing exactly who, what, where, when, and why. There’s a big difference between being asked to explain your work in detail and being told how to do your job.

I’m not sure of the best way to handle getting to that level of detail though.

Daily meetings where everyone reports progress?

I find these to be a bit of a waste of time (especially in large teams) where you talk for 2 minutes then sit and listen to everyone else for 20 minutes.

Walking around and sitting down next to each person in turn (sort of like a Doctor doing his rounds)?

This is better for the team as they are only interrupted when I am talking to them.

I’ve done this before but never in a “tell me what code you are writing now” way.

I still think this might annoy me if I was on the receiving end of this.

Another way?

What about other tech lead people reading this, what works for you?

Or, if you’re on the receiving end of this, where exactly does that all-important line sit?