CodeMash 2008

By on January 17, 2008 1:49 pm

CodeMash 2008 was my first conference since coming to work for SitePen, and my first opportunity to talk about Dojo. I had two talks: an introduction to Dojo (i.e. “how much Dojo can I cram into an hour”) and a talk about Dojo Offline and Google Gears. I love talking about cool technology, and CodeMash is great fun, so this was a treat.

I’d like to highlight a couple of topics that came up at CodeMash. Over on my blog I have posted more general comments about CodeMash 2008 with a bit less detail on these topics.

Concurrency

The best talk I attended, not counting open spaces (participant-driven discussions), was Brian Goetz’s talk about concurrency. Brian works for Sun and knows his concurrency (he’s a co-author of “Java Concurrency in Practice”). Concurrency is becoming an increasingly hot topic, as it’s common now for machines to have two or more CPU cores. He used Java as a backdrop and had a great example of how threads are hard. What made the example perfect was its simplicity: an object representing a bank account that had only three methods. Working from memory, the example was something like this:

public class BankAccount {
    private int balance;
    
    public int getBalance() {
        return balance;
    }
    
    public void debit(int amount) {
        balance -= amount;
    }
    
    public void credit(int amount) {
        balance += amount;
    }
}

What can go wrong? Quite a bit, it turns out, once you go multithreaded. Consider this method:

public static void transfer(BankAccount acct1, 
                            BankAccount acct2, 
                            int amount) 
                   throws LowBalanceError {
    if (acct1.getBalance() < amount) {
        throw new LowBalanceError();
    }
    acct1.debit(amount);
    acct2.credit(amount);
}

If two threads try to transfer money out of acct1 at the same time, it’s possible that the first debit will happen after the balance check in the second transfer. Imagine a sequence like this:

  • Thread 1: transfer $500 from acct1 to acct2
  • Thread 2: transfer $750 from acct1 to acct2
  • Thread 1 checks balance ($1000)
  • Thread 2 checks balance ($1000)
  • Thread 1 debits acct1 $500
  • Thread 2 debits acct1 $750 oops

acct1 has now gone negative, whereas the correct behavior is to throw an exception. Without completely rehashing Brian’s example: the obvious solution is to synchronize access to the accounts, since Java has the “synchronized” keyword right in it. So, now you’ve traded your data consistency problem for another one: deadlocks can occur once you start adding those locks.

Fairly quickly, you come to the conclusion that shared state is hard to handle well concurrently. This is part of the reason that Python’s creator, Guido van Rossum, has favored using multiple processes for concurrency. For people who are used to Java, using multiple processes might seem annoying and painful. Take a look at some Parallel Python examples, and I think you’ll find that it’s not that difficult… it is less difficult and more reliable than dealing with messed up data and deadlocks. As a bonus, Parallel Python lets you move easily from one machine to many.

Some promising approaches to concurrency mentioned by Brian:

  1. Software transactional memory. It sounds like a silver bullet to people who like the threaded model, but it’s still very much a research project. The idea is that you can declare that an operation is atomic and if two atomic operations try to manipulate the same data, one will fail. This is an optimistic kind of concurrency, because you don’t actively lock the data. You just assume everything will work out okay and deal with it if it doesn’t.
  2. Erlang’s Actor model has proven to be quite a successful way for doing many things concurrently. However, Erlang is a functional programming language that doesn’t have many of the language features that people today are used to (like, say, class-based OO).
  3. Scala includes an Actor library modeled after Erlang’s. It combines functional and OO styles and provides access to the Java libraries. If you write programs in Scala’s Actor style, you can get excellent, reliable concurrency… you just have to be careful about Java code that you import.

Brian also pointed out that you can improve your concurrency picture in Java by making your objects immutable. If operations that change the object return new copies of the object instead of mutating in place, your code is much more concurrency friendly. Unfortunately, you’ll still need to watch out for libraries that other people created.

While it is certainly true that machines are going multicore, and that’s something to consider when building an app, it’s important to also consider the kind of application you’re building. I’ve found that it’s pretty easy to avoid shared state concurrency issues in web applications, because you build your code around the lifecycle of a request. Since your data objects are not shared between requests, the kinds of concurrency issues above don’t come into play. Of course, this assumes you’re not doing something stateful on your web server.

If concurrency is important to your app, Scala is a good language to watch out for. Erlang has a better story to tell as far as the purity of its approach to concurrency, but Scala’s design and implementation on the JVM have some practical advantages for apps people are building today.

Non-Relational Databases

I hosted two open spaces sessions. The first was about non-relational databases. We discussed when you might consider using a non-relational DB and many of the available options.

CouchDB has gotten a lot of press. It’s a “document-oriented” database written in Erlang. You can access it via any language using a REST API and JSON documents. It has an interesting style of querying where you write “views” in JavaScript. Recently, CouchDB’s author, Damien Katz, was hired to work full time on CouchDB for IBM. CouchDB has also spawned imitators (RDDB and Basura), which shows just how much interest there is in easy-to-use, easy-to-access databases.

From what I’ve seen of it, Persevere seems more “done” than CouchDB. I think I’ve had more exposure to Persevere than most people, because its author, Kris Zyp, recently became a coworker here at SitePen. Persevere is written in Java and uses Rhino for its JavaScript goodies. Like CouchDB, you send JSON documents and get JSON documents in reply. Unlike CouchDB, Persevere is built around a hierarchical object database and includes a JavaScript library to give you transparent persistence for your JS objects. I’ve only toyed with Persevere a little bit at this point, but it looks neat at first glance.

The Zope Object Database (ZODB) is a great choice for Python programmers. It has the unusual characteristics of being mature and transactional. It’s been around for a decade or so, so you can bet that people have worked out many of its kinks through real world usage. Like Persevere, it provides a hierarchical object database, this time with transparent persistence for Python objects. Also like Persevere, ZODB has pluggable storages. I have personal experience working with the ZODB and I really like it. My only complaint is that I have yet to see a reliable storage that doesn’t require “packing” like the default FileSystem storage does. A ZODB can grow quite quickly, depending on how you manage your writes.

We talked a bit about tuplespaces and Gigaspaces. There was also a mention of a similar project for Python called NetWorkSpaces. These are not really object databases so much as they are shared data transfer mechanisms. Tuplespaces are great for coordinating the work of multiple processes, potentially across multiple machines.

Amazon has SimpleDB. SimpleDB is a recent addition to their Amazon Web Services line up. It provides a document oriented database with automatic indexes for every column. Since SimpleDB is one of the services which stores its data in Amazon’s cloud, you have “eventual consistency”… that means that data that you store in SimpleDB will eventually land there and be queryable, but you can’t predict exactly when that will happen. During our CodeMash session, we talked a fair bit about eventual consistency. For people used to traditional client/server RDBMS interactions, this is a bit of a shocker, but eventual consistency is how big sites like Amazon and Google can achieve amazing performance.

There’s also MonetDB, a column-oriented database. Column-oriented databases provide for blazing fast reads for things like data mining. Another example of a column-oriented DB is Google’s BigTable, an open source version of which is under development as Hbase in the Hadoop project. I found out about MonetDB because it provides an SQL query interface and there is work afoot to provide an SQLAlchemy connector for it. That makes MonetDB easy to use for things that I do in Python. Thus far, I have no experience with it, though.

We did also talk a bit about OR mapping, though the focus of the discussion was alternative database technologies.

The mismatch between the kinds of problems we’re solving and the kinds of development tools we use (largely OO languages today) make other kinds of databases an interesting line of experimentation. Some are ready for prime time use, but you have to figure out which problems you’re trying to solve and find the solution that solves it best.

The Server Side Is Changing

SitePen is at the forefront of changes happening on the browser end of browser-based software. We’re also looking deeply at how things are changing on the server. An increasing number of CPU cores in our systems and techniques like Comet require us to think about concurrency beyond simple threading on the server. Relational databases are powerful and reliable, but non-relational databases have the potential to speed access to our ever-increasing volumes of data or to improve productivity by more closely matching our main programming languages.

Keep an eye on this space for more thoughts about how server-side development is evolving.

Comments

  • Pingback: Blue Sky On Mars » Blog Archive » More about concurrency and non-relational DBs()

  • CouchDB is quite interesting. The REST DB API for CouchDB is very simple, which is part of it’s appeal (http://www.couchdbwiki.com/index.php?title=HTTP_Db_API). Since the ZODB is great at storing Python objects, and Python is great at handling objects with arbitrary attributes, and since there is a wealth of Python web code out there, it would make for an interesting experiment to implement the CouchDB API in a Python stack.

    Performance wise, CouchDb is likely to scale up much better once the CouchDB developers start to tackle optimization, but it’d be interesting to see how fast a ZODB solution could go. You could have similar relationship as SQLLite to MySQL/PostgreSQL with a Python/ZODB backed CouchDB-a-like. Start small in development, and export/import your data once you need to.

    The Javascript views would be trickier to implement – however, you would also have the option of having server-side, on the filesystem database Views in Python, since storing code in the database can be a pain sometimes.

  • Sam Ruby has started a Python implementation of CouchDB that is backed by Berkeley DB:

    http://www.intertwingly.net/blog/2007/09/18/Introducing-Basura

    It seems to me that the main reason to implement something like CouchDB in Python rather than just using CouchDB itself would be to allow for views written in Python.

    One thing to consider here, also, is that if you treat these databases with HTTP APIs (CouchDB and Persevere) as database servers in the traditional Mysql/PostgreSQL sense, then it may not matter that they’re written in languages other than Python.