My plan for spam – one month on

A while back I wrote about “my plan for spam“.

“My plan” has been running for a month now so it’s time to review it.

Before I implemented it I was getting around 600 spams a month.

After running it for a full month I’m down to around 250.

So, I’d consider that a fairly successful plan.

Of course, all the ones that were caught in my spam filter were being sent to my home account and put into a spam folder.

I said I’d monitor that and if there were no false positives I’d set it to delete them upon arrival.

There are no false positives, so I will be setting it to delete (need to test properly before I go putting delete rules into my filters).

As for getting that 250 even lower – I’m kinda stuck.

Around 80% of those 250 are direct to my gmail address so they don’t touch my filtering system.

I don’t really use my gmail address so I could set gmail to delete anything sent to that address.

The thought of doing that scares me a bit – I’ll wait and see how annoyed I get by it all I think.

Of course, no doubt in 6 months time I’ll be back up to 600 a month again, but what can you do?

Update:

I found out that if I tell gmail to delete an email in a filter rule it puts it into the Deleted Items folder which is automatically cleaned out after being in there for 30 days.

So, with that in mind I’ve set it to delete any emails that are addressed to my gmail address.

That way I have 30 days to find a real email if I have reason to believe it was sent to my gmail address.

Since doing that I’ve received an average of just over one spam a day!

That’s going to be around 40 a month.

Eat that spammers!

Purging jsessionids

jsessionid is the parameter that a Servlet engine adds to your site’s URL if you’ve enabled cookies in your config but the user viewing the site doesn’t have cookies enabled.

It then allows a cookie-less user to use your site and maintain their session.

It seems like a good idea but it’s a bit flawed.

The author of randomCoder has summarised the flaws quite well.

Every link on your site needs manual intervention

Cookieless sessions are achieved in Java by appending a string of the format ;jsessionid=SESSION_IDENTIFIER to the end of a URL. To do this, all links emitted by your website need to be passed through either HttpServletRequest.encodeURL(), either directly or through mechanisms such as the JSTL <c:out /> tag. Failure to do this for even a single link can result in your users losing their session forever.

Using URL-encoded sessions can damage your search engine placement

To prevent abuse, search engines such as Google associate web content with a single URL, and penalize sites which have identical content reachable from multiple, unique URLs. Because a URL-encoded session is unique per visit, multiple visits by the same search engine bot will return identical content with different URLs. This is not an uncommon problem; a test search for ;jsessionid in URLs returned around 79 million search results.

It’s a security risk

Because the session identifier is included in the URL, an attacker could potentially impersonate a victim by getting the victim to follow a session-encoded URL to your site. If the victim logs in, the attacker is logged in as well – exposing any personal or confidential information the victim has access to. This can be mitigated somewhat by using short timeouts on sessions, but that tends to annoy legitimate users.

There’s one other factor for me too; public users of my site don’t require cookies – so I really don’t need jsessionids at all.

Fortunately, he also presents an excellent solution to the problem.

The solution is to create a servlet filter which will intercept calls to HttpServletRequest.encodeURL() and skip the generation of session identifiers. This will require a servlet engine that implements the Servlet API version 2.3 or later (J2EE 1.3 for you enterprise folks). Let’s start with a basic servlet filter:

He then goes on to dissect the code section by section and presents a link at the end to download it all.

So I downloaded it, reviewed it, tested it and implemented it on my site.

It works a treat!

However, I still had a problem; Google and other engines still have lots of links to my site with jsessionid in the URL.

I wanted a clean way to remove those links from its index.

Obviously I can’t make Google do that directly.

But I can do it indirectly.

The trick is first to find a way to rewrite incoming URLs that contain a jsessionid to drop that part of the URL.

Then to tell the caller of the URL to not use that URL in future but to use the new one that doesn’t contain jsessionid.

Sounds complicated, but there are ways of doing both.

I achieved the first part using a thing called mod rewrite.

This allows me to map an incoming URL to a different URL – it’s commonly used to provide clean URLs on Web sites.

For the second part there is a feature of the HTTP spec that allows me to indicate that a link has been permanently changed and that the caller should update their link to my site.

301 Moved Permanently

The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible.

So, putting these two together, I wrote the following mod rewrite rules for Apache.


ReWriteRule ^/(\w+);jsessionid=\w+$ /$1 [L,R=301]
ReWriteRule ^/(\w+\.go);jsessionid=\w+$ /$1 [L,R=301]

The first rule says that any URLs ending in jsessionid will be rewritten without the jsessionid.

The second does the same but maps anything ending in .go – I was too lazy to work out a single pattern to do both types of URLs in one line.

And I used that all-important 301 code to persuade Google to update its index to the new link.

So, from now on – my pages will no longer output jsessionids and any incoming links that include them will have them stripped out.

In other words; jsessionids purged.

Me and my Yashica

I’ve just bought this camera on Ebay; it arrived yesterday.

It’s a Yashica Mat-124G and shoots 6×6 medium format film.

.

So, today I wandered around Covent Garden at lunchtime to take a few shots.

Wow!

What a world of difference from using a digital SLR.

Having to check and set exposure for each shot and wind the film on kept throwing me (predictably).

I’m shooting colour negative film which has decent exposure latitude so hopefully it won’t matter too much if I under or over expose.

Manually focusing the camera also made a change, but it was so easy with such a massive bright viewfinder – I manually focus with my SLR too sometimes but with the tiny dark viewfinder I often struggle to see if my subject is sharp enough.

It’s so much easier with this camera.

Those are all big changes sure, but the biggest change for me was the way that I held the camera and shot with it.

With my SLR I hold the camera to my eye and use the viewfinder to shoot.

The Yashica though is on a strap around my neck and I look down at the viewfinder.

It may not seem like a big deal but it’s a crucial distinction for me.

I enjoy street photography but often feel very self-conscious when I raise my camera to my eye to shoot a street scene – I guess I’m paranoid that someone I’m shooting will take offence.

But when shooting with the Yashica there’s no obvious sign that I’m taking a photo as you don’t move the camera at all.

It makes all the difference for me.

I know I’m probably making myself sound like some sort of sneak who is taking pictures of people without their permission and it’s true that I don’t ask permission when taking street photos.

But I’m not exactly invading privacy, my photos tend to be from a respectable distance and often you can’t make out people’s faces.

As I walked around Covent Garden today with the camera around my neck I got a few curious looks from people who were close enough to me to see what I was carrying, but the vast majority of people didn’t look twice at me.

At one point I stood in the middle of the crowded central market and took a photo of some people browsing a market stall. I couldn’t really imagine myself doing that with my SLR – hard to define why, just natural shyness I guess.

As it was, nobody turned a blind eye to me.

So, I managed to shoot about half a roll of film (around 6 shots) – I need to go out again on the weekend and shoot the rest of the roll, then get it developed and scanned.

Hopefully the camera works fine and I didn’t mess up too much and I’ll get some half-decent shots.

Hopefully.

Errrr…

I’ve just been asked to track time for Christmas Day and Boxing Day.

Just as well I keep detailed notes about this sort of thing.

My plan for spam

Every now and again I get sufficiently annoyed by spam to want to do something about it.

Today was one of those days where I had enough time to sit down and work on my plan for dealing with spam.

My email setup is a little unusual so it became a bit complicated.

All my incoming email is processed on my mail server; then one copy is forwarded to gmail, the other to my computer at home.

I then read all my email via gmail, and keep the copy at home just for backup purposes.

I’ve been doing this for a while; but found the manual task of scanning through spam in gmail to be tiresome and annoying.

What I wanted to do was cut out a lot of the spam on my server, before it was even passed on to gmail.

I figured I could script a better spam-filtering solution than gmail’s system.

My plan was to install bogofilter on the mail server and use that to filter out a lot of the spam.

There were two problems with this, one was that I needed a body of spam and ham emails to train it with and I didn’t keep email on the server.

The other was that it would need constant training as new emails came in, this was very tricky as emails were passsed on to gmail and my home computer in an automated process with no scope for human intervention.

To deal with the first problem I decided to install bogofilter at home too and train it there, then upload the training database to the mail server.

For the second problem I came up with the following solution:

I would use bogofilter on the mail server; send anything flagged as non-spam on to gmail and home; and send anything classed as spam just to home.

Once it got to home, it would be passed through bogofilter a second time; this instance would be set up slightly differently to the first one; it would classify emails into one of three folders; ham, spam or unsure.

I would then use mutt to periodically re-train bogofilter telling it that anything in the “unsure” folder was either ham or spam.

Finally, the newly trained database would be copied back up to the mail server each night.

The more astute reader may have noticed a problem with this solution.

I seem to have replaced scanning a folder full of spam on gmail, for scanning a folder full of spam at home.

This is true, initially I will be dealing with the same amount of spam.

However, I have a longer term plan here.

Once I’m happy with the filtering I’m going to tweak my solution so that anything tagged as spam will be deleted outright once it hits my home PC.

This I expect will reduce the amount of spam that I see by about 90%.

I am comfortable with the fact that I will probably lose the occasional non-spam email.

I’m gonna run this system for about a month, if I get through that with zero false positives I’ll feel brave enough to set it to delete.

Reclaiming ext3 disk space

A while back I bought an external hard drive for backing up my flacs and my photos.

Recently it started to fill up.

This was, of course, a bad thing.

I looked at how much space I had left on it:


Filesystem Size Used Avail Use% Mounted on
/dev/external 276G 259G 2.5G 100% /mnt/external

So, I had 2.5GB free. But hold on, the maths doesn’t make sense.

I have a 276GB drive, with 2.5GB free, yet I’ve only used 259GB.

I’m missing 14.5GB!

I did some googling and found out the following:

The most likely reason is reserved space. When an ext2/3 filesystem is formated by default 5% is reserved for root. Reserved space is supposed to reduce fragementation and allow root to login in case the filesystem becomes 100% used. You can use tune2fs to reduce the amount of reserved space.

So, ext2/3 reserves 5% of space, which on my drive is 13.8GB – well, that’s close enough to 14.5GB, so that explains that mystery.

The next questions was; can I and should I reduce that amount of reserved space?

More googling:

The reserved blocks are there for root’s use. The reason being that the system gets really narky if you completely run out of room on / (specifically /var, or /tmp, I think). Programs won’t start, wierd errors will pop up, that sort of thing. With some room reserved for root, you can at least be sure to be able to run the really important programs, like sudo and rm .

So, in short, if the drive doesn’t contain /var or /tmp, then there’s not much point in having space reserved for root.

So, some poster on some Internet forum says it’s probably OK to do away with that reserved space.

That’s usually good enough for me, but I figured this time I’ll play it safe and reduce it to 1%.

So I unmounted the drive and ran the following command: tune2fs -m 1 /dev/external

I re-mounted and voila, 11.5GB appeared out of nowhere!


Filesystem Size Used Avail Use% Mounted on
/dev/external 276G 259G 14G 95% /mnt/external

I’ve now run this on my other non-booting partitions.

All seems fine so far.

I’ll leave my system partition at 5% I think though, just to be safe.

How to mentor programmers

I was reading an entry over at Raganwald where the author talks about managing projects.

He covers a list of things that one should try to do to ensure a project is a success.

One of his main points is that a tech lead should always know exactly what everyone is doing on a daily basis.

Whenever I’ve allowed the details of a project to escape me, I’ve failed. […] And I can tell you, whenever the details of a project have slipped from my grasp, the project has started to drift into trouble.

I’ve been the tech lead on many projects over the last 6 years or so but I’ve always stopped short of asking people what they are doing on a daily basis.

It has always struck me as a form of “micro-managing”, which is something that I’ve hated when I’ve been on the receiving end of it.

I should clarify though; I always know who is working in what area on a day to day basis (Jim is working on the email module for these two weeks), but I don’t necessarily know what specific task they are trying to achieve on a particular day (I don’t know if Jim is writing display code today, or back-end logic).

However, after reflecting on how this has worked on my projects, I have to conclude that my approach was wrong.

I should know what people are doing – I just need to find a balance between knowing what they are doing and getting on their nerves.

Clearly a balance can be found.

I make no apologies for now insisting on knowing exactly who, what, where, when, and why. There’s a big difference between being asked to explain your work in detail and being told how to do your job.

I’m not sure of the best way to handle getting to that level of detail though.

Daily meetings where everyone reports progress?

I find these to be a bit of a waste of time (especially in large teams) where you talk for 2 minutes then sit and listen to everyone else for 20 minutes.

Walking around and sitting down next to each person in turn (sort of like a Doctor doing his rounds)?

This is better for the team as they are only interrupted when I am talking to them.

I’ve done this before but never in a “tell me what code you are writing now” way.

I still think this might annoy me if I was on the receiving end of this.

Another way?

What about other tech lead people reading this, what works for you?

Or, if you’re on the receiving end of this, where exactly does that all-important line sit?

Do algorithms still matter?

I attended the BCS Mini-SPA event a few months back.

The premise of the mini SPA is as follows:

If you attended SPA2006 you might find that the miniSPA2006 programme allows you to catch up with sessions you didn’t select at the event.



The annual SPA conference (formerly known as OT) is where IT practitioners gather to reflect, exchange ideas and learn.

It also served as a convenient advert for next year’s full SPA event.

It was also free, had a free lunch and got me out of the office for a day, so it pretty much fulfilled all my criteria.

The structure of the day was 6 sessions, divided into two parallel streams.

I attended “Distributed workforces”, “Modelling with Views” and “A Good Read” but the one that really interested me was “A Good Read”.

This was a panel of five people who had each proposed a book to discuss. Each member of the panel then read each book so they could discuss it and give their views and insights.

The really interesting part for me was that someone proposed Programming Pearls by Jon Bentley.

I’ve owned a copy of this book for years but have yet to finish it, (it’s back on my “to read” list now though).

Everybody roundly praised the book but one of the members of the panel questioned whether we needed to know that level of detail when it comes to coding efficient algorithms – “wouldn’t it be simpler to throw more CPU and RAM at a problem?” they said.

Someone in the audience then countered that algorithm efficiency was relevant once again when programming Web-apps. They said something along the lines of “Wait until 100 people hit that page on your site“.

Sadly the session ran out of time at that point so no conclusion was reached.

My own belief is that you do need to know code at that level, especially if you write Web sites or other similar client/server apps with many concurrent client requests.

I’m not saying everyone should know the Quicksort algorithm inside out, but if you program in Java (for example) you should know the difference between a Vector, an ArrayList and a plain old array and when to use each.

I have had personal experience of a badly written for loop bringing down a Web site on launch day.

The for loop in itself wasn’t the worst code ever written by any means, but it was probably executed 30 to 40 times per individual home page hit.

Multiply that by a few dozen concurrent hits (it was a busy site) and any flaws in that code were mercilessly exposed.

Embarrassingly for me, it was my code. Oops.

Ever since that day I’ve been unable to forget that no amount of “CPU and RAM” (and we had a lot) will help if you don’t get your algorithms right in the first place.

Sporadic service will resume

The server that hosts this site is dying.

It’s due to be replaced by a brand new server with lots of RAM – once I’ve finish building it and I can get it installed.

So until then, the site may be up and down a lot.

Once it’s fixed, sporadic service will resume.

Update: A reboot seems to have fixed it, no random restarts since.

Need a new Programming Language

I need to learn a new Programming Language.

This is for two reasons.

In my time as a programmer I’ve learned and used; Basic, Ada, C, C++, VB, Perl and Java.

So that’s 7 (5 if you merge Basic with VB and C with C++).

It’s a reasonable amount, a little on the small side.

But that list is only half the truth; most of those languages I’ve not touched in years, some I’m definitely never going to touch again (Ada!).

The only ones I now use in any form now are Java and Perl.

I use Java in my day job and to write things like this site, and I use Perl for the odd scripting task.

My first reason for needing a new language is a pragmatic one. I need to learn a new scripting language.

I need a new scripting language because every time I go to do something in Perl I find I have forgotten how to do one of:

a, list the files in a directory.

b, pass an array to a function.

c, iterate over an array.

d. all of the above.

This is because I find Perl’s syntax to be on the whole inconsistent and unintuitive.

So, I’ve had enough of Perl’s kooky ways and would like to learn something a little bit more “sane” (definition: consistent and intuitive syntax).

My second reason goes a little deeper.

I’ve been reading a few articles and blogs of late that in some way or another point out some problems with Java.

A Quick Tour of Ruby

Java doesn’t provide a utility method for opening a text file, reading or processing its lines, and closing the file. This is something you do pretty frequently.

— Steve Yegge

Can Your Programming Language Do This?

Java required you to create a whole object with a single method called a functor if you wanted to treat a function like a first class object.

— Joel Spolsky

What was interesting was that once I was over my initial denial of such heresy, I found myself mostly agreeing with what they had said.

The surprising part for me was that I had not consciously noticed these things myself – even though I now realise such things had annoyed me at the time.

The reason that they had not bubbled up to the level of consciousness was that I could not see beyond the Java language itself.

Something was awkward to do in Java (ever tried reading a file?) – well, that’s just the way Java is.

I couldn’t question it, because I was so deeply ingrained in the ways of Java, I could see no alternatives.

This worried me somewhat, what other concepts and ideas was I ignorant of due to my Java mindset?

Sometimes you need to take a step back and get a fresh perspective on things.

And what better way than to learn a new programming language.

I’m a busy guy though.

I can’t simply afford to take two weeks off just to learn a new language.

So, to be pragmatic (I’m a pragmatic guy too) I’m going to try to solve both of these problems with a single language.

So, I want a general purpose language that’s also good for scripting work.

My shortlist of languages is not long:

Python.

I’ve dabbled with Python.

It’s fun, quick, easy etc.

I’ve not done enough to know if it’s “sane” as defined above, it doesn’t seem as freaky as Perl though.

Ruby.

Ummm, everyone’s talking about it.

A friend of mine is learning it and he’s not swearing about it too much yet.

Apparently it’s mostly “sane”.

I’ve not completely decided yet, I’m leaning towards Ruby at the moment mind.

Anyone care to convince me either way, or suggest other languages I should be looking at?