Catch Up

I've not posted for ages. So here is a summary of a bunch of stuff I've been looking at for fun.

Machine Learning

First up, after finishing the MIT Introduction to Algorithms lectures, I was excited to hear about Stanford's free computer science courses. They are full, taught and (machine) assessed university modules for free! I'm studying Machine Learning and I am really impressed with the quality of the teaching. Thanks Stanford.

There is of course speculation that this is a trial for a new paid remote service. To be honest I feel the quality of the course I've done would be worth paying for if they could find a way to acredit something as a real qualification without proper human assessment.

C++ Experiments

Following on from my experiments with LevelDB, I have played around with creating a C++ gossip implementation based on Cassandra's using ZeroMQ. I spent a lot of time getting a really basic grasp on the intricacies of threading vs event driven style + message passing etc. Ended up with multiple processes on same machine (different ports) gossiping and effectively sharing cluster state. Didn't get around to implementing the full phi accrual failure detection for machine/up down inference and I'm sure the code would need to be torn apart and re-written for anythign resembling real use, but a good learning exercise.

I've now moved on to fiddling about with on-disk data structures. So far I'm mostly just learning. I've read through the specs for SQLite's db file and some articles on CouchDB's Copy-on-write B-tree (not to mention LevelDB/Cassandra's LSM trees). I've also read Acuna's paper on Stratified B-Trees which is all really interesting stuff. Not quite sure what I want to implement now but I may start with trying to get a basic block and free-list allocator working. Just the experience of actually working with C++ and "real" algorithms is fascinating for me, a lowly PHP developer.

In summary then, I'm still doing loads of geeky computer stuff, just forgetting to write about any of it.

9th November 2011 01:47 PM
C++, Machine Learning

LevelDB Fun

Google recenetly open-sourced LevelDB which is "a fast key-value storage library". I've used it as an excuse to play about in C++.

There is nothing new or exciting to report to the tech world here - just that I've enjoyed playing about in a language I've not worked much with in the past.

So far I have hooked up libevent to LevelDB and made my own little Key-Value database server that can accept multiple clients.

I've also written a C++ client library to talk to it and made up my own Ascii-based data transfer format.

None of this is useful to anyone other than me - it's great to actually play around with a language like this and to get a feel for it. Much more productive that code tutorials or algorithm exercises.

17th September 2011 08:50 PM
C, C++, LevelDB

More Efficient PHP Arrays

One of my first posts here was about how surprisingly inefficient PHP arrays can get. Today I learned of a solution that is probably a lot better than my PHP string serialisation. It's an extension called intarray.

The extension exposes integer-only arrays as strings to PHP but provides several useful methods for interacting with them such as sort, slice and binary search. This means if you are using PHP arrays to store sets of integers, you will likely see a very large improvement in speed and memory usage using this extension.

I've yet to do any real benchmarking but I thought I'd post this as a follow-up from my original post. I know at least one very large site who has used this extension in production with no issue although I obviously urge anyone to evaluate stability etc of any software themselves before deploying.

24th August 2011 10:25 AM
PHP, intarray

PHPUnit's Expensive SetUp

I've been working a lot with PHPUnit 3.5 recently. It's good in many ways but it is not fast. That's understandable perhaps given the feature set but there is one apparently obvious oversight which totally ruins the experience.

The problem I'm talking about I've reported as a possible bug and yet it has gotten zero attention in over two months. I'll describe it again here.

The Problem

PHPUnit has a whole multitude of ways to construct a test suite and pick which test to run. Using the command line runner, you can specify specific test case files or dirs and you can use filter and group options to further restrict.

The problem is that, whatever you pass as filter or group arguments, all setUp() and setUpBeforeClass() methods in all test cases loaded will be run. That's because filtering is applied after setup methods called. I really don't see the rationale behind that decision.

At work we have a large test suite. One part of it is for our database layer and as such has some very expensive setup routines which setup an entire test environment in our test db. Even when you limit the runner to a specific test case, that may mean this very expensive setup operation has to run for every test in the file - even when you are just trying to work with a single test method.

But we shouldn't have to fiddle about with specifying specific test cases. The filter and group options are powerful and (should be) very useful for cherry picking from a suite. This seemingly obvious error totally ruins them and makes working with big test suites decidedly awkward.

Even more confusing is the fact that no-one else I've seen online seems to think this is a problem. I've found no other mention of the behaviour and zero interest in my ticket. Did I miss something here? Is there an obvious reason that setup should be run all the time even when filtering tests? All my colleges and other PHP developers I've mentioned this to personally seem to agree it is very odd behaviour. I'd have expected many people to be using PHPUnit with large suites. Does no one else wonder why running a single simple test can take minutes?

I hope I can update this post when something changes, but I've not been encouraged by the response to my ticket so far.

9th August 2011 01:45 PM
PHPUnit, PHP, Testing
Older Posts »