Note: This is an experimental site, if you have a feed you would like to add or you would like your feed to be removed please contact me.

Planet 5

August 31, 2010

Airs – Ian Lance Taylor

Condensation Computing

It’s a frequent observation that computing has oscillated between centralized and distributed. We’ve gone from centralized computing facilities (which were effectively single-user, but only the administrators could use them) to time-sharing to personal computers to server farms. Data has moved from the card deck you kept in your office to the tape deck at the computer center to the hard disk on your personal computer to the hard disk on a centralized server to a cloud site like Flickr. E-mail has moved from a mailbox on a timesharing system to your personal computer to a cloud site like GMail. People increasingly access data from their phones, but the data is stored in the cloud.

Right now we are clearly in a distributed trend. Data is increasingly stored in the cloud and accessed from a variety of devices. People shift from phones to laptops to desktops and expect to see the same list of contacts, the same e-mail, the same calendar. The cloud sites are an extreme version of centralization: millions of users store their data in the same place. What will the computing world look like when and if it oscillates back to a more distributed system?

One possibility is that people will increasingly acquire their own data storage which will be accessible over the net. They’ll keep small cheap redundant servers to hold their data. They’ll have one server at home and one at the office, and they will automatically sync up. Access will be very fast most of the time, and will be possible at over times. The servers will be updated automatically and so forth, and they will (somehow) be easy to administer. The advantage will be fast access to data most of the time and actual control over your data. If you want to delete something, it’s gone, and not available for resurrection.

That particular vision is easy for me to think of because it’s similar to our ideas when I co-founded a company, Zembu, back in 1998. I don’t know how compelling it is. I suspect that going back to a distributed environment will require some cost advantage, and I’m not sure I see that here. Much of cloud computing these days tends to be free, in the sense that advertising pays the bills. Few people will spend money to avoid ads. Few is more than zero, but it’s not enough to build a business on.

During any predominant paradigm it’s difficult to see what the next paradigm will be. History suggests that we will oscillate back, that the cloud will condense at some point. But history is not always right. It seems inherently unlikely to me that data will increasingly be centralized. But I don’t know what the alternative will look like.

by Ian Lance Taylor at August 31, 2010 02:28 PM

August 30, 2010

RSC

_rsc: @jespern metoo on the ssh access failures. ssh accepts the public key but refuses to run the hg --serve command.

_rsc: @jespern metoo on the ssh access failures. ssh accepts the public key but refuses to run the hg --serve command.

August 30, 2010 07:28 PM

Google's Go Guide

パッケージドキュメント更新のお知らせ

パッケージドキュメントの翻訳を更新しました。

mime
regexp
syscall

by noboru at August 30, 2010 03:38 PM

August 29, 2010

August 28, 2010

Command Center

Know your science

Except for the TV show "The Big Bang Theory", popular culture gets science wrong. We all know that.

But there's a way it tends to get science wrong that upsets me more than most. That is when it misuses the tools of science by willfully ignoring what science actually means.

One common example is celebrity equations, wherein some mathematical-looking expression mixes two or more celebrities together, as in (I'm making this one up and I'm not a cultural critic, let alone a comic, so please bear with me): Lady Gaga = (2*Madonna + Carrot Top)/3. Mathematically savvy readers will recognize that I normalized that equation. If you don't know what that means, you shouldn't be writing celebrity equations, because mathematical equations mean something, they're not just symbols. Like musical comedy based on bad notes, bogus mathematical equations are not funny, just lazy.

Some years ago I even wrote a letter to Entertainment Weekly when they had a long article full of egregious celebrity equations. To their credit, they published the letter and even mended their ways for a while. I quote the letter here:

According to EW math, the more buzz or intelligence you have, the less likely you are to be on the It List. That may be true, but I bet you didn't mean that. Your equation is art-directed nonsense. EW seems to think the joke is that the equations look cute: If Einstein is funny, his square root is hilarious...
In short, mathematics may look funny if you don't understand it but that doesn't make it funny if you misuse it in ignorance.

Another sort of abuse is comedy periodic tables: periodic tables of the vegetables, period table of the desserts, periodic table of the presidents, and on and on. There are zillions of them. I believe the vegetables one was the first widely distributed example.

What's wrong with them? Again, they miss the point about the one true periodic table, Mendeleev's periodic table of the elements. In fact, to put things with no structure into a periodic table not only misses the point of the periodic table, it misses the profound idea that some things have periods.

Mendeleev's table, by recognizing the periodic structure of the elements, predicted not only properties of the elements, but the very existence of undiscovered elements. It was a breakthrough.

The periodic table is not some artistic layout of letters, it's science at its very best, one of the great results of the 19th century and the birth of modern chemistry. It doesn't honor science to take, say, typefaces and put them in a funny-looking grid. That just mocks the idea that science can predict the way the world works.

Science is not arbitrary. Making arbitrary cultural artifacts by abusing scientific ideas is not just wrong, it's offensive. It cheapens science.

Another area of abuse is quantum mechanics, and a common victim is Heisenberg's uncertainty principle. Despite what some ill-informed academics would have you believe, Heisenberg's principle is not some general statment about weird shit happening in the world, it is a fantastically precise scientific statement about the limits of measurement of two simultaneous physical properties: position and momentum. It's not a metaphor!

What's really sad is that many of the commonest misuses of the terminology of quantum mechanics come from other areas of science and technology. For instance, there is a term in computer engineering called a Heisenbug, which refers to faults that are unpredictable, most often for bugs that go away when you examine them. It's a cute name but it isn't even a correct reference. The quantum mechanical property of things changing when you observe them is not the Heisenberg uncertainty principle, it's the observer effect. These two ideas are often confused but they are not the same. They're not even closely related.

The observer effect in quantum mechanics describes how the act of measuring a quantum system forces the system to cough up a measurable quantity, which triggers a "wave function collapse". Heisenberg's uncertainty principle says that the minimum product of the error in simultaneous measurement of a particle's position and momentum is Planck's constant divided by 4π, or as we write it in physics, ℏ/2. (By the way, that's an extremely small value.)

Not only are these very different ideas, neither of them has anything to do with computer bugs. The term Heisenbug is trendy but bogus and ignores some strange and beautiful ideas. It's no better informed than the square root of Einstein or the periodic table of the typefaces.

If you're going to use the terms of science to inform your world, please make a point to understand the science too. Your world will be richer for it.

by rob (noreply@blogger.com) at August 28, 2010 01:46 AM

August 27, 2010

Nick Johnson

New in 1.3.6: Namespaces

The recently released 1.3.6 update for App Engine introduces a number of exciting new features, including multi-tenancy - the ability to shard your app for multiple independent user groups - using a new Namespaces API. Today, we'll take a look at the Namespaces API and how it works.

One common question from people designing multi-tenant apps is how to charge users based on usage. While I'd normally recommend a simpler charging model, such as per user, that isn't universally applicable, and even when it is, it can be useful to keep track on just how much quota each tenant is consuming. Since multi-tenant apps just got a whole pile easier, we'll use this as an opportunity to explore per-tenant accounting options, too.

First up, let's take a look at the basic setup for namespacing. You can check out this demo for an example of what a fully featured, configurable namespace setup looks like, but presuming we want to use domain names as our namespaces, here's the simplest possible setup:

def namespace_manager_default_namespace_for_request():
  import os
  return os.environ['SERVER_NAME']

That's all there is to it. If we wanted to switch on Google Apps domain instead, we could do this:

def namespace_manager_default_namespace_for_request():
  from google.appengine.api import namespace_manager
  return namespace_manager.google_apps_namespace()

Edit: Note that this function is broken in 1.3.6, unfortunately.

With that, presto, your app now handles multi-tenancy. On each request, your configuration function is called to determine the current namespace, and that namespace is used to segment calls to the other APIs for the rest of the request.

Of course, your app may not be that simple - it may need to access data that isn't namespace-specific, or even access data across namespaces. In which case, the set_namespace function, and recipes like this are your friend.

Moving on to accounting, we'll need a way to record the resources consumed on each request. For that, we'll need to write some middleware, so that we can run before and after each request. Here's how we'll start:

class AccountingMiddleware(object):
  CURRENT_INSTANCE = None

  def __init__(self, application):
    self.application = application

  def __call__(self, environ, start_response):
    self.counters = {}
    self.disabled = False
    AccountingMiddleware.CURRENT_INSTANCE = self
    try:
      ret = list(self.application(environ, start_response))
      self.update_counters(environ, ret)
      self.store_counters()
      return ret
    finally:
      AccountingMiddleware.CURRENT_INSTANCE = None

Our middleware in this case is a callable class. The call method initializes a dict of counters, representing values we're accounting for, then calls the original application, saving the response. On some WSGI platforms, materializing the response like this would be a problem, but since we know App Engine receives the entire response before sending anything to the client, this should be fine. Then, we call a to-be-defined method, update_counters(), and another, store_counters(), to store it. Let's take a look at store_counters first:

  def store_counters(self):
    if self.disabled: return
    memcache.offset_multi(self.counters, key_prefix='quota:', initial_value=0)
    if memcache.add("quota:_lock", 1):
      defer(write_counters, self.counters.keys())

  def disable_accounting(self):
    """Disables accounting for this request."""
    self.disabled = True

Here's where our first tradeoff becomes apparrent. We need to account for the quota usage of every request, but we don't want to impose too much overhead doing so, by doing something like writing a datastore record for every request. We also can't risk datastore contention by performing a transaction on the quota record for every update.

Fortunately, there's a lightweight way to do this. While we would like our accounting to be as accurate as possible, the slight risk of under-accounting is probably acceptable, so we can use memcache. In the routine above, we do two things: We use the offset_multi function to update the counters for each of the values we've recorded, and we use memcache.add to atomically add a lock record. If the lock record did not already exist, we start off a task to write the counters to the datastore. With a setup like this, every request will record its own usage, and then add a task queue task if one isn't already in progress. If we want to reduce overhead further, at the cost of more risk of under-accounting, we could add a delay to the new task, enforcing a maximum rate at which we will write out quota updates.

It's also worth noting that none of the methods here have an explicit namespace specified. This is because the namespace facility automatically handles this for us. Let's see how the write_counters function works:

def write_counters(counter_names):
  """Writes counters from memcache to the datastore."""
  AccountingMiddleware.CURRENT_INSTANCE.disable_accounting()
  # Get the amounts to increment
  counters = memcache.get_multi(counter_names, key_prefix='quota:')
  counters = dict((k, int(v)) for k, v in counters.items())
  # Subtract the retrieved amounts from the counters
  memcache.offset_multi(dict((k, -v) for k, v in counters.items()),
                        key_prefix='quota:')
  # Remove the lock
  memcache.delete("quota:_lock")
  # Update the datastore
  QuotaUsage.write_quotas(datetime.date.today(), counters)

This is a little more complicated than anything we've seen so far, so let's go through it step by step. The order of operations here is important - otherwise we can introduce undesirable race conditions. The first thing we do is disable accounting for this request - otherwise, we'd create an infinite loop of task queue requests, each recording the quota used by the previous update! Then, we fetch the current values of the counters from memcache, so we know what to write. Next, we subtract the values we just received from the same counters. We have to do this, rather than deleting them, because the values could've been updated by another request between us fetching them and deleting them. Then, we remove the memcache lock, allowing another task to be enqueued, and finally we run the QuotaUsage.write_quotas method, which performs the actual transaction. Let's see that now.

class QuotaUsage(db.Model):
  """Records quota usage for a given namespace in a given time interval."""
  cpu = db.IntegerProperty(required=True, default=0)
  api_cpu = db.IntegerProperty(required=True, default=0)
  data_in = db.IntegerProperty(required=True, default=0)
  data_out = db.IntegerProperty(required=True, default=0)
  
  @classmethod
  def write_quotas(cls, date, quota_dict):
    """Increments quotas for the specified time interval.
    
    Args:
      date: A datetime.Date object for the date to increment quotas for.
      quotas: A dict mapping quota names to quantities to increment.
    """
    def _tx():
      key_name = date.isoformat()
      quotas = cls.get_by_key_name(key_name)
      if not quotas:
        quotas = cls(key_name=key_name)
      for k, v in quota_dict.items():
        setattr(quotas, k, getattr(quotas, k) + v)
      quotas.put()
    db.run_in_transaction(_tx)

This code, at least, should be fairly straightforward - it's a simple read, modify, write operation. The only complication is that we're making it data-driven, iterating over the values in the dict to determine which fields we update. Also note that we name the entity after the current date, neatly assuring we have daily accounting.

Finally, let's see update_counters, which is fairly straightforward:

  def update_counters(self, environ, ret):
    self.increment_counter('cpu', quota.get_request_cpu_usage())
    self.increment_counter('api_cpu', quota.get_request_api_cpu_usage())
    self.increment_counter('data_in', int(environ.get('CONTENT_LENGTH') or 0))
    self.increment_counter('data_out', sum(len(x) for x in ret))

  def increment_counter(self, name, value):
    self.counters[name] = self.counters.get(name, 0) + value

Here, we use a couple of sources of data for our quota values: the request and response, which we extract the incoming and outgoing bandwidth from (or rather, a lower bound on those values), and the quota API, which we use to get the CPU time used for both API calls and regular CPU usage.

Here's the magic appengine_config invocation to add our middleware:

def webapp_add_wsgi_middleware(app):
  import accounting
  app = accounting.AccountingMiddleware(app)
  return app

And that's all there is to it! Questions? Comments? Post them below!

by Nick Johnson at August 27, 2010 07:39 PM

August 24, 2010

Google's Go Guide

パッケージドキュメント更新のお知らせ

しばらく更新をさぼっていたら、かなり原文が更新されていました。
すべてではありませんが、パッケージドキュメントの翻訳を更新しました。

【削除】

exp/bignum

【更新】

asn1, big, bufio, bytes, container/vector, crypto/rand, crypto/tls, encoding/binary, exp/draw, exp/eval, exp/iterable, exp/nacl/av, fmt, go/parser, go/scanner, http, io, io/ioutil, net, os, runtime, scanner, strings, sync, template

残りは後日、アップデートします。

by noboru at August 24, 2010 10:42 AM

Airs – Ian Lance Taylor

Changing Minds

I’ve long felt that most people do not change their minds about things that are important to them, including areas like religion and politics. I think most people seem to pick a set of beliefs sometime in early adulthood or before and stick to them. It’s not that people are necessarily close-minded as such. It’s just that there are major aspects of life for which no argument can be sufficiently convincing in practice.

This has always just been a pet belief of mine. I’ve rarely managed to convince people that it is correct, which of course just reinforces my belief. Oddly, however, there was recently some scientific research which somewhat supports it. The theory is, essentially, that people disregard facts that disagree with their beliefs in order to reduce cognitive dissonance. The research was based on facts seen in news reports, so one interpretation of it is that people don’t believe the news, and that it doesn’t say anything about personal conversations. Still, I find it interesting, and of course, if true, it has implications for the ideal of an informed voting population.

Here is a Google Docs link to the study. Here is a Boston Globe article. I don’t expect you to be convinced, unless of course you already are.

by Ian Lance Taylor at August 24, 2010 06:57 AM

August 23, 2010

Airs – Ian Lance Taylor

Medical Side-Effects

My aunt died 12 days ago. I don’t want to write about my personal feelings about this, but I do want to write something about the events that led to her death. I don’t know all the details, but this is the basic story I heard as it evolved. Two or three years ago, at the age of 60, my aunt had a chest X-ray for some reason. She was in reasonably good health, but this X-ray picked up an unexpected growth of some sort. The doctor recommended that they wait six months and take another look. After six months, they looked again, and the growth was still there, and, though it was hard to be certain, it might have gotten slightly larger. The doctors started discussing the possibility of doing a biopsy to see if it was cancerous. My aunt was understandably alarmed about all this, and on the advice of a doctor agreed to go ahead with the biopsy.

I don’t know exactly where the growth was, but the operation for the biopsy was major. They had to left up the rib cage and make a major incision in the chest. I spoke to her just after the operation and she was still grappling mentally with how major it was, how weak she felt, and what her projected recovery was going to be like. As it happened, she never truly recovered. While she did leave the hospital for a time, she had a series of increasingly serious infections due to the side-effects from the operation, and spent more and more time at the hospital. After a couple of years of this, she died from one of the infections. The biopsy showed that the growth was benign, and she would have been fine if nothing had been done.

This all happened in the U.S., where medicine has a strong bias toward doing something rather than nothing. Doctors get paid when something is done, not when nothing is done. I believe that U.S. culture in general is biased toward taking action rather than waiting. It would be unfair of me to blame this entirely on the medical system. Once the growth was detected, my aunt would have worried about it if nothing had been done. But I don’t think she truly understood what she was getting into with the operation even if all had gone well.

In the end, she died at the age of 62. The cause of death was the modern medical system. Perhaps she would have done better if she had felt differently about the information, or if she had gotten different advice, or if she had lived in a different country. We’ll never really know. But it’s certainly hard to feel comfortable about what actually happened.

by Ian Lance Taylor at August 23, 2010 05:49 AM

August 19, 2010

Go on Web

To new() or Not To new() [Link]

Rob ‘Commander’ Pike has asked for feedback on a proposal to remove the “new” function for a number of reasons, one of which is to reduce confusion for new users. There’s a lot activity around the Proposed Change to the language.  Most people … Continue reading


by hokapoka at August 19, 2010 08:49 AM

August 17, 2010

Renee French

from edison steelhead's lost portfolio - sorry

by renee (noreply@blogger.com) at August 17, 2010 06:47 AM

August 16, 2010

Adam Langley

DNSSEC and TLS

Ever since the DNS root was signed on July 15th, quite a few people have been wondering about the interaction of TLS and DNSSEC. On the one hand, trust in the CA system is lukewarm but, on the other, the deployment issues with getting DNSSEC to the client seem immense.

Those who saw Dan Kaminsky's talk at BlackHat, which included a patched version of Chromium performing DNSSEC validation, have probably already guessed that Dan, myself and others have been working on this problem. In this blog post I'm going to try to explain the design space as I see it.

Objectives

In the long term, we want a stronger foundation of trust for the Internet. This means both pairing back the power of the 1500 root certificates and making TLS easier to deploy and thus more commonly used.

So one of the goals is to serve sites which currently don't have a CA signed certificate. We can do that by allowing them to publish fingerprints in DNS and having browsers accept those (DNSSEC secured) fingerprints.

There might also be speed advantages to be had by avoiding OCSP and CRL lookups.

We also want to serve those sites which currently do have CA signed certificates. For them we can provide exclusion of other certificates issued either mistakenly or maliciously by CAs.

Yet it's important to note that this isn't a plan for the elimination of CAs. CAs are still a good way to link DNS names to legal entities. For that we need EV certificates and CAs to issue them.

The Design Space

Firstly we have the issues which are, in many respects, the least important. How should we encode the data?

What type of record? The two major candidates are a TXT record and a new type of CERT record. It's important that we figure it out as DNS queries can only ask for one type of response (ignoring ANY, which we don't want) and each query is another chance for packet loss and a painful timeout.

TXT records have the advantage that the crippled web interfaces, by which many people manage their DNS, often support them. They are also already common (try looking up TXT records for google.com, amazon.com etc). They are human readable. They are easily extendable given a sensible format.

A new CERT type is the ‘cleaner’ solution. Stuffing things in TXT records is clearly a hack and working around crappy web interfaces feels like being a good man standing still.

Where to put the record? If one were to use a TXT record then the temptation would be to put it on a prefix label like _tls.www.example.com. However, if www.example.com is a CNAME then we won't get a helpful answer: probably just an NXDOMAIN and we'll have to hunt around: taking many round trips and a painful latency hit. Because of this, I like putting the record on the target domain name.

Next we'll consider some of the deployment options because they inform some of the points to follow.

What about clients without DNSSEC resolution ability? This is a big question. We don't care if your ISP's resolver is going to set the AD (Authenticated Data) bit in DNS replies: we don't trust it. We need either a local recursive resolver or we need the full DNSSEC chain so that we can check the signatures ourselves.

(It's worth pointing out that, although we can't trust any fingerprint information without DNSSSEC, we can provide exclusion without DNSSEC. It's a bit of a weak threat model: the attacker would have to control some of your network but not your DNS resolutions: maybe you have the records cached with a long TTL.)

One option is to put an aggressive DNSSEC resolver in the client and set DO (DNSSEC OK) and CD (Checking Disabled) bits in the requests to get the signatures themselves.

What about clients which can't even do DNS resolution correctly? (Also known as the Hotel Network Problem.) For them we could tunnel DNS over HTTP. In fact, if you encode the request in a GET request you can set the HTTP caching headers so that HTTP caches are DNS caches too (Dan's trick).

If we're talking over HTTP, why not get the server to give us the whole chain? In that case, what about an EDNS0 option to ask for the full chain? Both are possibilities.

How about putting the DNSSEC chain in other protocols? What about embedding the chain in an X.509 certificate? Chain embedding can solve the needs of sites which want to use self-signed certificates.

Now we can start to get to some of the more meaty issues:

Fingerprint in record? If you only have a single certificate then you want to put the fingerprint in a record on the domain name. That way the client can start the lookup at the same time as for the A record. But if you have many fingerprints, that becomes troublesome and you want to lookup a domain named by the fingerprint. That's slower because you can only start that lookup once you have the certificate from the server. The answer looks to be that one has to handle both types of lookup.

What to hash? If we are going to embed a DNSSEC chain in a certificate, then the fingerprint can't cover the whole certificate (because then the hash would have to cover itself). So that suggests hashing only the public key. However, if we do that then the attacker can change other parts of the certificate at will.

Here we have a bit of an impedance mismatch with the X.509 world. Existing APIs are usually of the form “please validate this certificate”. Once validated, everything in the certificate is trusted because CAs are the Voice of God. However, in a DNSSEC world, authority is scoped.

If we hash only the public key then an attacker could include things in an X.509 certificate for example.com which would appear to have the authority of example.com to an unsuspecting application. If we hash the whole certificate then example.com could put things in its certificate which might assert things which example.com shouldn't be allowed to. Applications which work on the Voice of God model could be misled.

It's tricky. At the moment we are supporting both models.

Should we include a flag to perform CA validation in addition? For performance reasons, people might want to use just DNSSEC because one can avoid OCSP and CRL lookups that way. For EV certs, we certainly want to perform CA validation in addition, but what about DV certs? Some people might like the idea of enforcing the expiry and OCSP checks.

TLS extensions? Embedding a DNSSEC chain in an X.509 certificate means that you need to regenerate the certificate once a week or so (because DNSSEC signatures expire). Also, CAs aren't going to sign certificates with a random blob of binary data that they don't understand. Because of this, it's been suggested that we could carry the DNSSEC chain in a TLS extension. However, chain embedding is for those servers which don't have a CA issued certificate. Clients aren't going to be able to require DNSSEC for many years, if ever. And if we're going to accept CA certificates without DNSSEC, then an embedded chain can't provide exclusion.

The Code

The code is minimal so far. Chrome trunk already includes support for validating DNSSEC chains embedded inside an X.509 cert, although it requires a command line flag to enable. I also have code to generate the chains. However, the format of the chain is going to change so I'm not making that public yet.

Hopefully there will be something more substantial in a couple of months.

August 16, 2010 07:00 AM

August 15, 2010

Go on Web

GOAuth – An OAuth Consumer Written in Go

We have just released GOauth an OAuth Consumer Written in the Go Programming language. GOAuth has been tested with twitter, Buzz & other Google OAuth products and has worked flawlessly. Opps removed some of the Bespoke google params, such as … Continue reading


by hokapoka at August 15, 2010 11:25 AM

August 14, 2010

Go on Web

Mindchunk: A quick tour of Go’s dict package [Link]

Check out this post on the new addition of Golang “net/dict” package. Link : Mindchunk: A quick tour of Go’s dict package. Related Posts:Web chat using WebSockets and Go [Link]Serve on multiple hosts / domains with GoTo new() or Not To … Continue reading


by hokapoka at August 14, 2010 01:35 PM

Using GO to render HTML for WordPress or PHP

Some people maybe already be aware, others may not, we’re using WordPress. Which, of course, is written in PHP. It’s time we added some Go Power to WordPress! We’ve created a Go Routine that is generating HTML we want to … Continue reading


by hokapoka at August 14, 2010 11:18 AM

August 13, 2010

Nick Johnson

Reduced posting schedule

As some of you will no doubt have already noticed, I've been posting here less frequently since I went on holiday than I previously was. Previously, my schedule was 3 posts a week - Monday, Wednesday, Friday. This let me turn out a lot of content, but it also consumed a lot of time.

Unfortunately, demands on my time are such that at the moment I can't keep that schedule up. Also, though there's still lots of App Engine and other posts I'd like to write, I'm not sure I can come up with 3 a week at the moment. For that reason, I'm going to be reducing the posting schedule, for now, to 1 post a week.

On the plus side, you can expect bigger, more detailed posts than was the average when I was posting 3 a week. An example of this would be the recent Damn Cool Algorithms post on Levenshtein Automata.

Also, due to unforseen problems with both of the posts in my pipeline currently, there'll be no post at all this week - but you can look forward to two posts next week, instead!

by Nick Johnson at August 13, 2010 12:51 PM

August 12, 2010

Go on Web

Append / Add to a Go lang Collection Part 2

In the previous couple of Article’s we added some methods to a struct that contained a []*T a few people have pointed out that you can do that same with type foo []*T. This article looks at how to achieve … Continue reading


by hokapoka at August 12, 2010 01:22 PM

August 11, 2010

Cheesesun Works

Circular Number System in C

In a previous post I discussed the idea of a circular number system. To reiterate: it is a system whereby the passing the maximum value will return one to the beginning. Take a twenty-four hour clock, for example, which has the range of [0, 23]. The hour that is two hours beyond 23 is not 25, but 1.

Another example is angle measurement. If one has two angles, one at 5 degrees and another at 355 degrees, how far is the second from the first. Linear measurement has one subtracting the two, resulting in an answer of 350 degrees, but looking at the two angles marked on a circle one can see that that is not entirely correct. It is a distance and in some situations has value, but it is not the shortest distance. For that, the correct answer is -10 degrees.

The implementation of this system that I posted was written in Python, but different programming languages have different behaviours for the modulo operator. In some cases, like C++, the behaviour depends upon the implementation. I will not discuss these differences here, as more knowledgeable individuals have already done so elsewhere. For the purpose of this article, I refer to the default C standard used by gcc under Ubuntu 10.04 Linux.

One further difference between C and Python is that Python's modulo operator also works with floating point numbers, while in C it works only with integers. For floating point numbers one must use fmod, which is found in math.h.

Here is the difference function in C:

double diff(double x0, double x1) {
double r = fmod(x1 - x0 + 0.5*UPPER, UPPER);
return (r
}

The function finds the shortest distance to x1 from x0 distance, assuming a range of [0, UPPER). The maximum absolute distance between any two values will never be more than half of UPPER.

by Ostsol (noreply@blogger.com) at August 11, 2010 06:24 PM

Go on Web

Append / Add to a Go lang Collection

Go lang’s composite literal notation is really nice, but more often than not we find ourselves needing to dynamically append to slices & arrays. This article looks at how implement a .Add feature into a struct that contains a []*T, … Continue reading


by hokapoka at August 11, 2010 08:33 AM

Implementing sort.Sort on a struct in Golang – Part 2

We’re about to release an implementation of an OAuth Consumer written in Go lang that required some additional Sorting logic. In this article we’ll look at this sorting logic. According to the OAuth specification RFC5849 the request parameters need to … Continue reading


by hokapoka at August 11, 2010 07:59 AM

August 06, 2010

Nick Johnson

Using OpenID authentication on App Engine

With the release of SDK 1.3.4, preliminary support is available for native OpenID authentication in App Engine. Today, we'll demonstrate how to use the new OpenID support in your app.

Edit: There's now an official article on OpenID on App Engine!

The first step in setting up OpenID authentication is to change your app's authentication settings. Log in to the admin console, select your app, and go to "Application Settings". There, you can pull down the "Authentication Options" box, and select "(Experimental) Federated Login".

Once you've enabled OpenID authentication for your app, a few things change:

  • URLs generated by create_login_url without a federated_identity parameter specified will redirect to the OpenID login page for Google Accounts.
  • URLs that are protected by "login: required" in app.yaml or web.xml will result in a redirect to the path "/_ah/login_required", with a "continue" parameter of the page originally fetched. This allows you to provide your own openid login page.
  • URLs generated by create_login_url with a federated_identity provider will redirect to the specified provider.

In order to make best use of this functionality, here's what we'll do:

  1. Provide an OpenID login page on /_ah/login_required
  2. Have this login page use create_login_url to redirect users to their openid login page.
  3. Modify existing calls to create_login_url to instead send users to /_ah/login_required

Here's a straightforward implementation of the /_ah/login handler:

class OpenIdLoginHandler(webapp.RequestHandler):
  def get(self):
    continue_url = self.request.GET.get('continue')
    openid_url = self.request.GET.get('openid')
    if not openid_url:
      path = os.path.join(os.path.dirname(__file__), 'templates', 'login.html')
      self.response.out.write(template.render(path, {'continue': continue_url}))
    else:
      self.redirect(users.create_login_url(continue_url, None, openid_url))

This handler takes care of the first two changes we described earlier. First, we retrieve the 'continue' parameter (if it's provided), and the 'openid' parameter. If no openid parameter is provided, we show the login form. If an openid URL is provided, we pass both that and the continue URL to create_login_url, redirecting the user to the URL that function generates.

Here's an excessively simplistic version of the login.html template:

<html>
<head>
  <title>Log in with OpenID</title>
</head>
<body>
  <h1>Log in with OpenID</h1>
  <form method="get" action="/_ah/login_required">
    {% if continue %}
      <input type="text" name="continue" value="{{continue|escape}}" />
    {% endif %}
    <input type="text" name="openid" />
    <input type="submit" value="Log In" />
  </form>
</body>
</html>

As you can see, all the page does is solicit the user's OpenID URL, and send it back, along with the already-provided continue URL, to the same handler we looked at earlier.

Now that we've got basic OpenID support working, we should modify our app to send users to our own login page, instead of the default one for OpenID login with Google accounts. To do this, we'll define a simple function to take the place of create_login_url in these places:

def create_openid_url(continue_url):
  continue_url = urlparse.urljoin(self.request.url, continue_url)
  return "/_ah/login?continue=%s" % urllib.quote(continue_url)

Then, we replace any invocations of "users.create_login_url(something)" with "create_openid_url(something)" throughout our code, and we're done!

Friendlier login pages

The login page we demonstrated is pretty spartan. That's easily fixed, of course - you can style it as you would the rest of your site, and add descriptive text, and so forth. Worse, though, many users, when asked for their "openid URL" will simply look at you in puzzlement. What we need is a solution that avoids the need for most users to enter their URL themselves, while still allowing savvy users to do just that.

Fortunately, there are a number of such solutions. One of them is clickpass. After signing in there and setting up an entry for your site, they'll provide you with code for a button that you can embed in your login page, and which allows users to choose from a number of well-known identity providers, including Hotmail, Yahoo!, Google, and Facebook, as well as entering their own URL.

Once you've added your site in Clickpass, click on 'Log users in', and you'll be asked for a few details. Here's what they ask for, and how you should fill it out:

OpenID trust rootThe root URL of your site, eg http://myapp.appspot.com/ or http://www.myapp.com/
begin_openid_login/_ah/login_required
OpenID parameter labelopenid
Submission methodGET
SSL enabledTrue if you're hosting off appspot.com and have "secure: optional" or "secure:required" for your URLs; false otherwise.

There's one further advantage of using a system such as ClickPass's: You can embed the button widget on any and all pages of your site - which means your users don't have to visit your login page at all in the usual case: they'll go straight from your site to their provider's login page.

by Nick Johnson at August 06, 2010 11:09 AM

Go on Web

Will Go, go like wave

After the news earlier this week about Google abandoning Wave what that the chances of the same thing happening to Golang? The more time & effort developers and companies put into using, learning & implementing Go the more they have … Continue reading


by hokapoka at August 06, 2010 10:38 AM

August 05, 2010

Nick Johnson

Using BlobReader, wildcard subdomains and webapp2

Today we'll demonstrate a number of new features and libraries in App Engine, using a simple demo app. First and foremost, we'll be demonstrating BlobReader, which lets you read Blobstore blobs just as you would a local file, but we'll also be trying out two other shinies: Wildcard subdomains, which allow users to access your app as anything.yourapp.appspot.com (and now, anything.yourapp.com), and Moraes' excellent new webapp2 library, a drop-in replacement for the webapp framework.

Moraes has built webapp2 to be as compatible with the existing webapp framework as possible, while improving a number of things. Improvements include an enhanced response object (based on the one in webob), better routing support, support for 'status code exceptions', and URL generation support. While the app we're writing doesn't require any of these, per-se, it's a good opportunity to give webapp2 a test drive and see how it performs.

But what are we writing, you ask? Well, to show off just how useful BlobReader is, I wanted something that demonstrates how you can use it practically anywhere you can use a 'real' file object - such as using it to read zip files from the blobstore using the (native code) zipfile module. So, we'll build a simple app that lets users upload zips of static content, and serve them on a custom subdomain.

Let's get started. We'll begin by defining our data model:

BASE_DOMAIN = "%s.appspot.com" % os.environ['APPLICATION_ID']


class Site(db.Model):
  owner = db.UserProperty(required=True)
  last_updated = db.DateTimeProperty(required=True, auto_now=True)
  zipfile = blobstore.BlobReferenceProperty(required=True)

  @property
  def url(self):
    return '%s.%s' % (self.key().name(), BASE_DOMAIN)

Simple enough. Our Site entities will use the subdomain as the key name, and we store the user that owns the site, when it was last updated, and a reference to the zip they uploaded. Next, let's define a base request handler, and a main page handler, which will list the user's sites and let them update them or upload a new one:

class BaseHandler(webapp2.RequestHandler):
  def __call__(self, *args, **kwargs):
    self.user = users.get_current_user()
    if not self.user:
      self.redirect(users.create_login_url(self.request.url))
    else:
      return super(BaseHandler, self).__call__(*args, **kwargs)

  def render_template(self, file, template_vars):
    path = os.path.join(os.path.dirname(__file__), 'templates', file)
    self.response.out.write(template.render(path, template_vars))


class MainHandler(BaseHandler):
  def get(self):
    sites = Site.all().filter('owner =', self.user).fetch(20)
    self.render_template('index.html', {
        'sites': sites,
        'upload_url': self.url_for('upload'),
    })

This should look eerily familiar - as I've said, webapp2 imitates the webapp framework fairly closely. One item of note is our overriding of __call__: unlike webapp, webapp2 does its own dispatch, which means that we can do 'middleware' type things such as checking the current user by overriding that functionality in our subclass, rather than needing to wrap each method in a decorator. Otherwise, everything we do here is no different to how it would work in webapp. We won't include the templates here, since they're pretty straightforward - see the full code if you're interested.

Next, we should define an upload handler. The part to render the upload form is pretty simple:

class UploadHandler(BaseHandler):
  def get(self):
    self.render_template('upload.html', {
        'upload_url': blobstore.create_upload_url(self.url_for('upload')),
        'site_name': self.request.GET.get('site', None),
    })

The only item of note here is our use of webapp2's support for URL generation by calling the 'url_for' method, which generates a URL given the name of the route (which we'll define later). The code to process the upload is a little more complicated, since we need to handle a few edge cases: handling both new sites and updates to existing ones, as well as making sure people can't update others's sites, but it's still fairly straightforward:

  def post(self):
    site = self.request.POST['site']
    blob_key = blobstore.parse_blob_info(self.request.POST['file'])
    db.run_in_transaction(self.upload_tx, site, blob_key)
    self.redirect_to('main')

  def upload_tx(self, site_name, blob_key):
    site = Site.get_by_key_name(site_name)
    if site:
      if site.owner != self.user: return
      site.zipfile.delete()
      site.zipfile = blob_key
    else:
      site = Site(key_name=site_name, owner=self.user, zipfile=blob_key)
    db.put(site)

Note the use of the redirect_to method - this is syntactic sugar for self.redirect(self.url_for('main')).

Finally, we should define the handler that actually serves the sites up. This is where the important stuff happens:

INDEX_FILES = ['index.html', 'index.htm']
SUBDOMAIN_RE = re.compile("^([^.]+)\.%s\.appspot\.com$"
                          % os.environ['APPLICATION_ID'])


class SiteHandler(webapp2.RequestHandler):
  def get(self, path):
    site_name = SUBDOMAIN_RE.search(self.request.host).group(1)
    site = Site.get_by_key_name(site_name)
    if not site:
      self.abort(404)
    zip_key = Site.zipfile.get_value_for_datastore(site)
    site_zip = zipfile.ZipFile(blobstore.BlobReader(zip_key))

    path, data = self.get_contents(site_zip, path)
    self.response.headers['Content-Type'] = mimetypes.guess_type(path)[0]
    self.response.out.write(data)

First, we extract the subdomain that was used, and look up the site entry from that. Then, we extract the blob key of the zipfile, and pass that to blobstore.BlobReader, which we pass in turn to zipfile.ZipFile, returning a ZipFile object we can use to read individual files from the zip. Then we call a method, self.get_contents, where the important bits happen:

  def get_contents(self, site_zip, path):
    if path.endswith('/'):
      for idx in INDEX_FILES:
        newpath = os.path.join(path, idx)[1:]
        try:
          data = site_zip.read(newpath)
          return newpath, data
        except KeyError:
          pass
      self.abort(404)
    else:
      try:
        return path, site_zip.read(path[1:])
      except KeyError:
        self.abort(404)

If the path ends with a trailing slash, it's a directory, so we check the zip for the existence of various common index files - index.html, index.htm - and return those. Otherwise, we simply fetch the requested file and return it to the caller. Easy - just as if it really were a local file. Finally, back in the get() method of the handler, we use the mimetypes module to guess the correct content type to return, and send the data back to the user.

There's still a little more glue required to finish things up. First, we need to define webapps, with the appropriate routes, for the admin site and for the sub-sites:

main_app = webapp2.WSGIApplication([
    webapp2.Route(r'/', MainHandler, name='main'),
    webapp2.Route(r'/upload', UploadHandler, name='upload'),
])


site_app = webapp2.WSGIApplication([
    webapp2.Route(r'<path:.*>', SiteHandler),
])

This is very similar to how webapp handles things - and indeed, you can even use regular webapp routes if you want - but the Route class has more flexibility, including naming the handlers, so we can use URL building as we demonstrated above. Also note the use of named groups for the site_app, to extract the path and provide it as an argument to our SiteHandler's get method.

Next, we need some way to direct requests to the appropriate webapp depending if the request is for the bare domain or a subdomain. For that, we'll write a simple piece of middleware:

def domain_middleware(domain_map):
  domain_map = [(re.compile('^%s$' % x) if isinstance(x, basestring) else x, y)
                for x, y in domain_map]
  def middleware(environ, start_response):
    domain = environ['SERVER_NAME']
    for regex, app in domain_map:
      if regex.match(domain):
        return app(environ, start_response)
  return middleware

The domain_middleware function accepts a list of tuples, corresponding to regular expressions used to match domains, and the WSGI app that matching requests should be sent to. Look out for integrated support for this in webapp2 in the near future! In the meantime, here's our own middleware in action:

app = domain_middleware([
    (SUBDOMAIN_RE, site_app),
    ('.*', main_app),
])


def main():
  run_wsgi_app(app)


if __name__ == '__main__':
  main()

And that's all there is to it. To see it in action, check out http://pydocs.zip-site.appspot.com/, where I've uploaded the pydocs for Python 2.7, or see the admin interface at http://zip-site.appspot.com/. The full source code is online here.

by Nick Johnson at August 05, 2010 01:17 PM

August 04, 2010

Go's official blog

Defer, Panic, and Recover

Go has the usual mechanisms for control flow: if, for, switch, goto. It also has the go statement to run code in a separate goroutine. Here I'd like to discuss some of the less common ones: defer, panic, and recover.

A defer statement pushes a function call onto a list. The list of saved calls is executed after the surrounding function returns. Defer is commonly used to simplify functions that perform various clean-up actions.

For example, let's look at a function that opens two files and copies the contents of one file to the other:

func CopyFile(dstName, srcName string) (written int64, err os.Error) {
src, err := os.Open(srcName, os.O_RDONLY, 0)
if err != nil {
return
}

dst, err := os.Open(dstName, os.O_WRONLY|os.O_CREATE, 0644)
if err != nil {
return
}

written, err = io.Copy(dst, src)
dst.Close()
src.Close()
return
}

This works, but there is a bug. If the second call to os.Open fails, the function will return without closing the source file. This can be easily remedied by putting a call to src.Close() before the second return statement, but if the function were more complex the problem might not be so easily noticed and resolved. By introducing defer statements we can ensure that the files are always closed:

func CopyFile(dstName, srcName string) (written int64, err os.Error) {
src, err := os.Open(srcName, os.O_RDONLY, 0)
if err != nil {
return
}
defer src.Close()

dst, err := os.Open(dstName, os.O_WRONLY|os.O_CREATE, 0644)
if err != nil {
return
}
defer dst.Close()

return io.Copy(dst, src)
}

Defer statements allow us to think about closing each file right after opening it, guaranteeing that, regardless of the number of return statements in the function, the files will be closed.

The behavior of defer statements is straightforward and predictable. There are three simple rules:

1. A deferred function's arguments are evaluated when the defer statement is evaluated.

In this example, the expression "i" is evaluated when the Println call is deferred. The deferred call will print "0" after the function returns.

func a() {
i := 0
defer fmt.Println(i)
i++
return
}

2. Deferred function calls are executed in Last In First Out order after the surrounding function returns.

This function prints "3210":

func b() {
for i := 0; i < 4; i++ {
defer fmt.Print(i)
}
}

3. Deferred functions may read and assign to the returning function's named return values.

In this example, a deferred function increments the return value i after the surrounding function returns. Thus, this function returns 2:

func c() (i int) {
defer func() { i++ }()
return 1
}

This is convenient for modifying the error return value of a function; we will see an example of this shortly.

Panic is a built-in function that stops the ordinary flow of control and begins panicking. When the function F calls panic, execution of F stops, any deferred functions in F are executed normally, and then F returns to its caller. To the caller, F then behaves like a call to panic. The process continues up the stack until all functions in the current goroutine have returned, at which point the program crashes. Panics can be initiated by invoking panic directly. They can also be caused by runtime errors, such as out-of-bounds array accesses.

Recover is a built-in function that regains control of a panicking goroutine. Recover is only useful inside deferred functions. During normal execution, a call to recover will return nil and have no other effect. If the current goroutine is panicking, a call to recover will capture the value given to panic and resume normal execution.

Here's an example program that demonstrates the mechanics of panic and defer:

package main

import "fmt"

func main() {
f()
fmt.Println("Returned normally from f.")
}

func f() {
defer func() {
if r := recover(); r != nil {
fmt.Println("Recovered in f", r)
}
}()
fmt.Println("Calling g.")
g(0)
fmt.Println("Returned normally from g.")
}

func g(i int) {
if i > 3 {
fmt.Println("Panicking!")
panic(fmt.Sprintf("%v", i))
}
defer fmt.Println("Defer in g", i)
fmt.Println("Printing in g", i)
g(i+1)
}

The function g takes the int i, and panics if i is greater than 3, or else it calls itself with the argument i+1. The function f defers a function that calls recover and prints the recovered value (if it is non-nil). Try to picture what the output of this program might be before reading on.

The program will output:

Calling g.
Printing in g 0
Printing in g 1
Printing in g 2
Printing in g 3
Panicking!
Defer in g 3
Defer in g 2
Defer in g 1
Defer in g 0
Recovered in f 4
Returned normally from f.

If we remove the deferred function from f the panic is not recovered and reaches the top of the goroutine's call stack, terminating the program. This modified program will output:

Calling g.
Printing in g 0
Printing in g 1
Printing in g 2
Printing in g 3
Panicking!
Defer in g 3
Defer in g 2
Defer in g 1
Defer in g 0
panic: 4

panic PC=0x2a9cd8
[stack trace omitted]

For a real-world example of panic and recover, see the json package from the Go standard library. It decodes JSON-encoded data with a set of recursive functions. When malformed JSON is encountered, the parser calls panic is to unwind the stack to the top-level function call, which recovers from the panic and returns an appropriate error value (see the 'error' and 'unmarshal' functions in decode.go). There is a similar example of this technique in the Compile routine of the regexp package. The convention in the Go libraries is that even when a package uses panic internally, its external API still presents explicit error return values.

Other uses of defer (beyond the file.Close() example given earlier) include releasing a mutex:

mu.Lock()
defer mu.Unlock()

printing a footer:

printHeader()
defer printFooter()

and more.

In summary, the defer statement (with or without panic and recover) provides an unusual and powerful mechanism for control flow. It can be used to model a number of features implemented by special-purpose structures in other programming languages. Try it out.

by Andrew Gerrand (noreply@blogger.com) at August 04, 2010 11:20 PM

August 03, 2010

Go on Web

Implementing sort.Sort on a struct in Golang – Part 1

There maybe a number of times when you’re using Go that you will want to sort a collection.  In order to do do so you just need to implement the sort.Interface. This article assume you are know how to Implement … Continue reading


by hokapoka at August 03, 2010 05:32 PM

Web chat using WebSockets and Go [Link]

Check you this great article using Golang & WebSockets to create a Browser based Chat client. WebSockets are part of the HTML5 initiative, to extended full-duplex communications to Clients via Javascript. The author, Gary Burd, shows just how easy it is to … Continue reading


by hokapoka at August 03, 2010 11:40 AM

Nick Johnson

Webapps on App Engine, part 6: Lazy loading

This is part of a series on writing a webapp framework for App Engine Python. For details, see the introductory post here.

A major concern for many people developing for App Engine, particularly those building low-to-medium traffic sites, is instance load time. When App Engine serves the first request to a new instance of your app, it must import the request handler module you specified, which in turn imports all the other modules required to serve the request. In large apps, this can add up to quite a lot of additional overhead for loading requests, and substantially impact the experience for end users.

There are a number of things you can do to reduce loading times, including using lighter weight frameworks instead of all inclusive ones, and breaking seldom used components up into separate handlers - an approach taken by bloggart for the admin interface. One source of inefficiency stands out as a prime candidate for optimisation, though: unnecessary imports.

Many frameworks, including the built in webapp framework, require you to provide a list of handler classes that should be instantiated to serve requests, in a 'url map'. When a request comes in, the framework simply instantiates the relevant class and calls it to handle the request. However, doing this requires you to import all your handler classes so you can construct the url map, and this likely results in transitively importing your entire webapp, even though only a small proportion of it may be required to handle the request at hand.

You may have noticed that our framework, so far, doesn't appear to do any better on this front - and you'd be right. That's about to change, however. One thing we've kept a constant throughout writing the framework is making as many components as possible independent, often by making them WSGI applications in their own right. The router is a WSGI application, and so are handler classes, and so is the WebOb response object. Today, we'll make use of that feature to add support for lazy loading in a manner that's completely independent of our framework - and would work on any other WSGI-based framework, too.

The interface to our lazy loader will be straightforward: Instantiate a class with the fully qualified name of a module containing a WSGI application (such as a handler for our framework), and it will act as a WSGI application that imports the handler module on first access, and calls it in turn. To do this, we need to be able to import a module whose name is defined at runtime - and for that we use the Python builtin __import__.

__import__'s operation is fairly straightforward: Simply call it with the name of the module to be imported as the first argument, and it imports the relevant module into the Python runtime. Somewhat confusingly, though, it returns the top-level module rather than the one we asked for - so __import__('foo.bar.baz') returns 'foo'. As additional information, we pass __import__ it our local and global variables using two more arguments, to allow it to resolve relative imports. In short, this:

from mypackage import mymodule

Is equivalent to this:

__import__('mypackage.mymodule', globals(), locals())
mymodule = sys.modules['mypackage.mymodule']

We also need to know how to retrieve a name from a module dynamically. This is even simpler: Every object in python (more or less) is backed by a dict, which can be accessed with its '__dict__' attribute. Retrieving a name from a module is thus achieved like so:

myclass = mymodule.__dict__['myclass']

With that in mind, writing our lazy importer is simple:

class WSGILazyLoader(object):
  def __init__(self, fullname):
    self.modulename, self.objname = fullname.rpartition('.')
    self.obj = None

  def __call__(self, environ, start_response):
    if not self.obj:
      __import__(self.modulename, globals(), locals())
      module = sys.modules[self.modulename]
      self.obj = module.__dict__[self.objname]
    return self.obj(environ, start_response)

To use it, we simply define our handler (and create an instance of it) in one module:

import framework

class HomeHandler(framework.RequestHandler):
  def get(self):
    self.response.body = "Hello, world!"

home_handler = HomeHandler()

And reference it using the lazy loader in another:

import framework
from google.appengine.ext.webapp.util import run_wsgi_app

application = framework.WSGIRouter()
application.connect('/', framework.WSGILazyLoader('home.home_handler'))

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()

You can enhance the lazy loader, of course. Possibilities include having it take the name of a class instead of an instance, and have it automatically instantiate the handler class for each new request, thus duplicating the instance-per-request mechanism of webapp and other frameworks.

by Nick Johnson at August 03, 2010 09:31 AM

August 02, 2010

Go on Web

Using Makefile with Golang

It won’t be long before you start to create more complex Go projects with a large number of .go files that need to be compiled. While there’s no Go tool to do this it’s possible to use the GNU make, … Continue reading


by hokapoka at August 02, 2010 03:47 PM

July 30, 2010

Adam Langley

SSL Survey

Ivan Ristić has put together a fantastic survey of the state of SSL/TLS on the web. Some highlights include:

  • 50% of the root CAs in Firefox appear to be unused.
  • 44% of sites send unneeded certificates in their chains.
  • 99% of sites work with only 23 roots.
  • A total of 3 DSA keys from all valid sites found.
  • 50% of servers still support SSLv2
  • Almost nobody uses TLS 1.1 or 1.2
  • Only 12 sites support STS :(
  • 20.5% support the reneg extension.
  • (32% support insecure renegotiation.)

July 30, 2010 07:00 AM

July 28, 2010

Nick Johnson

Damn Cool Algorithms: Levenshtein Automata

In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance - or any other metric that obeys the triangle inequality. Today, I'm going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata.

Introduction

The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It's thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance - not a bad improvement to start with!

Of course, if that were the only benefit of Levenshtein automata, this would be a short article. There's much more to come, but first let's see what a Levenshtein automaton looks like, and how we can build one.

Construction and evaluation

The diagram on the right shows the NFA for a Levenshtein automaton for the word 'food', with maximum edit distance 2. As you can see, it's very regular, and the construction is fairly straightforward. The start state is in the lower left. States are named using a ne style notation, where n is the number of characters consumed so far, and e is the number of errors. Horizontal transitions represent unmodified characters, vertical transitions represent insertions, and the two types of diagonal transitions represent substitutions (the ones marked with a *) and deletions, respectively.

Let's see how we can construct an NFA such as this given an input word and a maximum edit distance. I won't include the source for the NFA class here, since it's fairly standard; for gory details, see the Gist. Here's the relevant function in Python:

def levenshtein_automata(term, k):
  nfa = NFA((0, 0))
  for i, c in enumerate(term):
    for e in range(k + 1):
      # Correct character
      nfa.add_transition((i, e), c, (i + 1, e))
      if e < k:
        # Deletion
        nfa.add_transition((i, e), NFA.ANY, (i, e + 1))
        # Insertion
        nfa.add_transition((i, e), NFA.EPSILON, (i + 1, e + 1))
        # Substitution
        nfa.add_transition((i, e), NFA.ANY, (i + 1, e + 1))
  for e in range(k + 1):
    if e < k:
      nfa.add_transition((len(term), e), NFA.ANY, (len(term), e + 1))
    nfa.add_final_state((len(term), e))
  return nfa

This should be easy to follow; we're basically constructing the transitions you see in the diagram in the most straightforward manner possible, as well as denoting the correct set of final states. State labels are tuples, rather than strings, with the tuples using the same notation we described above.

Because this is an NFA, there can be multiple active states. Between them, these represent the possible interpretations of the string processed so far. For example, consider the active states after consuming the characters 'f' and 'x':

This indicates there are several possible variations that are consistent with the first two characters 'f' and 'x': A single substitution, as in 'fxod', a single insertion, as in 'fxood', two insertions, as in 'fxfood', or a substitution and a deletion, as in 'fxd'. Also included are several redundant hypotheses, such as a deletion and an insertion, also resulting in 'fxod'. As more characters are processed, some of these possibilities will be eliminated, while other new ones will be introduced. If, after consuming all the characters in the word, an accepting (bolded) state is in the set of current states, there's a way to convert the input word into the target word with two or fewer changes, and we know we can accept the word as valid.

Actually evaluating an NFA directly tends to be fairly computationally expensive, due to the presence of multiple active states, and epsilon transitions (that is, transitions that require no input symbol), so the standard practice is to first convert the NFA to a DFA using powerset construction. Using this algorithm, a DFA is constructed in which each state corresponds to a set of active states in the original NFA. We won't go into detail about powerset construction here, since it's tangential to the main topic. Here's an example of a DFA corresponding to the NFA for the input word 'food' with one allowable error:

Note that we depicted a DFA for one error, as the DFA corresponding to our NFA above is a bit too complex to fit comfortably in a blog post! The DFA above will accept exactly the words that have an edit distance to the word 'food' of 1 or less. Try it out: pick any word, and trace its path through the DFA. If you end up in a bolded state, the word is valid.

I won't include the source for powerset construction here; it's also in the gist for those who care.

Briefly returning to the issue of runtime efficiency, you may be wondering how efficient Levenshtein DFA construction is. We can construct the NFA in O(kn) time, where k is the edit distance and n is the length of the target word. Conversion to a DFA has a worst case of O(2n) time - which leads to a pretty extreme worst-case of O(2kn) runtime! Not all is doom and gloom, though, for two reasons: First of all, Levenshtein automata won't come anywhere near the 2n worst-case for DFA construction*. Second, some very clever computer scientists have come up with algorithms to construct the DFA directly in O(n) time, [SCHULZ2002FAST] and even a method to skip the DFA construction entirely and use a table-based evaluation method!

Indexing

Now that we've established that it's possible to construct Levenshtein automata, and demonstrated how they work, let's take a look at how we can use them to search an index for fuzzy matches efficiently. The first insight, and the approach many papers [SCHULZ2002FAST] [MIHOV2004FAST] take is to observe that a dictionary - that is, the set of records you want to search - can itself be represented as a DFA. In fact, they're frequently stored as a Trie or a DAWG, both of which can be viewed as special cases of DFAs. Given that both the dictionary and the criteria (the Levenshtein automata) are represented as DFAs, it's then possible to efficiently intersect the two DFAs to find exactly the words in the dictionary that match our criteria, using a very simple procedure that looks something like this:

def intersect(dfa1, dfa2):
  stack = [("", dfa1.start_state, dfa2.start_state)]
  while stack:
    s, state1, state2 = stack.pop()
    for edge in set(dfa1.edges(state1)).intersect(dfa2.edges(state2)):
      state1 = dfa1.next(state1, edge)
      state2 = dfa2.next(state2, edge)
      if state1 and state2:
        s = s + edge
        stack.append((s, state1, state2))
        if dfa1.is_final(state1) and dfa2.is_final(state2):
          yield s

That is, we traverse both DFAs in lockstep, only following edges that both DFAs have in common, and keeping track of the path we've followed so far. Any time both DFAs are in a final state, that word is in the output set, so we output it.

This is all very well if your index is stored as a DFA (or a trie or DAWG), but many indexes aren't: if they're in-memory, they're probably in a sorted list, and if they're on disk, they're probably in a BTree or similar structure. Is there a way we can modify this scheme to work with these sort of indexes, and still provide a speedup on brute-force approaches? It turns out that there is.

The critical insight here is that with our criteria expressed as a DFA, we can, given an input string that doesn't match, find the next string (lexicographically speaking) that does. Intuitively, this is fairly easy to do: we evaluate the input string against the DFA until we cannot proceed further - for example, because there's no valid transition for the next character. Then, we repeatedly follow the edge that has the lexicographically smallest label until we reach a final state. Two special cases apply here: First, on the first transition, we need to follow the lexicographically smallest label greater than character that had no valid transition in the preliminary step. Second, if we reach a state with no valid outwards edge, we should backtrack to the previous state, and try again. This is pretty much the 'wall following' maze solving algorithm, as applied to a DFA.

For an example of this, take a look at the DFA for food(1), above, and consider the input word 'foogle'. We consume the first four characters fine, leaving us in state 3141. The only out edge from here is 'd', while the next character is 'l', so we backtrack one step to the previous state, 21303141. From here, our next character is 'g', and there's an out-edge for 'f', so we take that edge, leaving us in an accepting state (the same state as previously, in fact, but with a different path to it) with the output string 'fooh' - the lexicographically next string in the DFA after 'foogle'.

Here's the Python code for it, as a method on the DFA class. As previously, I haven't included boilerplate for the DFA, which is all here.

  def next_valid_string(self, input):
    state = self.start_state
    stack = []
    
    # Evaluate the DFA as far as possible
    for i, x in enumerate(input):
      stack.append((input[:i], state, x))
      state = self.next_state(state, x)
      if not state: break
    else:
      stack.append((input[:i+1], state, None))

    if self.is_final(state):
      # Input word is already valid
      return input
    
    # Perform a 'wall following' search for the lexicographically smallest
    # accepting state.
    while stack:
      path, state, x = stack.pop()
      x = self.find_next_edge(state, x)
      if x:
        path += x
        state = self.next_state(state, x)
        if self.is_final(state):
          return path
        stack.append((path, state, None))
    return None

In the first part of the function, we evaluate the DFA in the normal fashion, keeping a stack of visited states, along with the path so far and the edge we intend to attempt to follow out of them. Then, assuming we didn't find an exact match, we perform the backtracking search we described above, attempting to find the smallest set of transitions we can follow to come to an accepting state. For some caveats about the generality of this function, read on...

Also needed is the utility function find_next_edge, which finds the lexicographically smallest outwards edge from a state that's greater than some given input:

  def find_next_edge(self, s, x):
    if x is None:
      x = u'\0'
    else:
      x = unichr(ord(x) + 1)
    state_transitions = self.transitions.get(s, {})
    if x in state_transitions or s in self.defaults:
      return x
    labels = sorted(state_transitions.keys())
    pos = bisect.bisect_left(labels, x)
    if pos < len(labels):
      return labels[pos]
    return None

With some preprocessing, this could be made substantially more efficient - for example, by pregenerating a mapping from each character to the first outgoing edge greater than it, rather than using binary search to find it in many cases. I once again leave such optimizations as an exercise for the reader.

Now that we have this procedure, we can finally describe how to search the index with it. The algorithm is surprisingly simple:

  1. Obtain the first element from your index - or alternately, a string you know to be less than any valid string in your index - and call it the 'current' string.
  2. Feed the current string into the 'DFA successor' algorithm we outlined above, obtaining the 'next' string.
  3. If the next string is equal to the current string, you have found a match - output it, fetch the next element from the index as the current string, and repeat from step 2.
  4. If the next string is not equal to the current string, search your index for the first string greater than or equal to the next string. Make this the new current string, and repeat from step 2.

And once again, here's the implementation of this procedure in Python:

def find_all_matches(word, k, lookup_func):
  """Uses lookup_func to find all words within levenshtein distance k of word.
  
  Args:
    word: The word to look up
    k: Maximum edit distance
    lookup_func: A single argument function that returns the first word in the
      database that is greater than or equal to the input argument.
  Yields:
    Every matching word within levenshtein distance k from the database.
  """
  lev = levenshtein_automata(word, k).to_dfa()
  match = lev.next_valid_string(u'\0')
  while match:
    next = lookup_func(match)
    if not next:
      return
    if match == next:
      yield match
      next = next + u'\0'
    match = lev.next_valid_string(next)

One way of looking at this algorithm is to think of both the Levenshtein DFA and the index as sorted lists, and the procedure above to be similar to App Engine's "zigzag merge join" strategy. We repeatedly look up a string on one side, and use that to jump to the appropriate place on the other side. If there's no matching entry, we use the result of the lookup to jump ahead on the first side, and so forth. The result is that we skip over large numbers of non-matching index entries, as well as large numbers of non-matching Levenshtein strings, saving us the effort of exhaustively enumerating either of them. Hopefully it's apparrent from the description that this procedure has the potential to avoid the need to evaluate either all of the index entries, or all of the candidate Levenshtein strings.

As a side note, it's not true that for all DFAs it's possible to find a lexicographically minimal successor to any string. For example, consider the successor to the string 'a' in the DFA that recognizes the pattern 'a+b'. The answer is that there isn't one: it would have to consist of an infinite number of 'a' characters followed by a single 'b' character! It's possible to make a fairly simple modification to the procedure outlined above such that it returns a string that's guaranteed to be a prefix of the next string recognized by the DFA, which is sufficient for our purposes. Since Levenshtein DFAs are always finite, though, and thus always have a finite length successor (except for the last string, naturally), we leave such an extension as an exercise for the reader. There are potentially interesting applications one could put this approach to, such as indexed regular expression search, which would require this modification.

Testing

First, let's see this in action. We'll define a simple Matcher class, which provides an implementation of the lookup_func required by our find_all_matches function:

class Matcher(object):
  def __init__(self, l):
    self.l = l
    self.probes = 0

  def __call__(self, w):
    self.probes += 1
    pos = bisect.bisect_left(self.l, w)
    if pos < len(self.l):
      return self.l[pos]
    else:
      return None

Note that the only reason we implemented a callable class here is because we want to extract some metrics - specifically, the number of probes made - from the procedure; normally a regular or nested function would be perfectly sufficient. Now, we need a sample dataset. Let's load the web2 dictionary for that:

>>> words = [x.strip().lower().decode('utf-8') for x in open('/usr/share/dict/web2')]
>>> words.sort()
>>> len(words)
234936

We could also use a couple of subsets for testing how things change with scale:

>>> words10 = [x for x in words if random.random() <= 0.1]
>>> words100 = [x for x in words if random.random() <= 0.01]

And here it is in action:

>>> m = Matcher(words)
>>> list(automata.find_all_matches('nice', 1, m))
[u'anice', u'bice', u'dice', u'fice', u'ice', u'mice', u'nace', u'nice', u'niche', u'nick', u'nide', u'niece', u'nife', u'nile', u'nine', u'niue', u'pice', u'rice', u'sice', u'tice', u'unice', u'vice', u'wice']
>>> len(_)
23
>>> m.probes
142

Working perfectly! Finding the 23 fuzzy matches for 'nice' in the dictionary of nearly 235,000 words required 142 probes. Note that if we assume an alphabet of 26 characters, there are 4+26*4+26*5=238 strings within levenshtein distance 1 of 'nice', so we've made a reasonable saving over exhaustively testing all of them. With larger alphabets, longer strings, or larger edit distances, this saving should be more pronounced. It may be instructive to see how the number of probes varies as a function of word length and dictionary size, by testing it with a variety of inputs:

String lengthMax stringsSmall dictMed dictFull dict
17947 (59%)54 (68%)81 (100%)
213281 (61%)103 (78%)129 (97%)
318594 (50%)120 (64%)147 (79%)
423894 (39%)123 (51%)155 (65%)
529194 (32%)124 (43%)161 (55%)

In this table, 'max strings' is the total number of strings within edit distance one of the input string, and the values for small, med, and full dict represent the number of probes required to search the three dictionaries (consisting of 1%, 10% and 100% of the web2 dictionary). All the following rows, at least until 10 characters, required a similar number of probes as row 5. The sample input string used consisted of prefixes of the word 'abracadabra'.

Several observations are immediately apparrent:

  1. For very short strings and large dictionaries, the number of probes is not much lower - if at all - than the maximum number of valid strings, so there's little saving.
  2. As the string gets longer, the number of probes required increases significantly slower than the number of potential results, so that at 10 characters, we need only probe 161 of 821 (about 20%) possible results. At commonly encountered word lengths (97% of words in the web2 dictionary are at least 5 characters long), the savings over naively checking every string variation are already significant.
  3. Although the size of the sample dictionaries differs by an order of magnitude, the number of probes required increases only a little each time. This provides encouraging evidence that this will scale well for very large indexes.

It's also instructive to see how this varies for different edit distance thresholds. Here's the same table, for a max edit distance of 2:

String lengthMax stringsSmall dictMed dictFull dict
12054413 (20%)843 (41%)1531 (75%)
210428486 (5%)1226 (12%)2600 (25%)
324420644 (3%)1643 (7%)3229 (13%)
444030646 (1.5%)1676 (4%)3366 (8%)
569258648 (0.9%)1676 (2%)3377 (5%)

This is also promising: with an edit distance of 2, although we're having to do a lot more probes, it's a much smaller percentage of the number of candidate strings. With a word length of 5 and an edit distance of 2, having to do 3377 probes is definitely far preferable to doing 69,258 (one for every matching string) or 234,936 (one for every word in the dictionary)!

As a quick comparison, a straightforward BK-tree implementation with the same dictionary requires examining 5858 nodes for a string length of 5 and an edit distance of 1 (with the same sample string used above), while the same query with an edit distance of 2 required 58,928 nodes to be examined! Admittedly, many of these nodes are likely to be on the same disk page, if structured properly, but there's still a startling difference in the number of lookups.

One final note: The second paper we referenced in this article, [MIHOV2004FAST] describes a very nifty construction: a universal Levenshtein automata. This is a DFA that determines, in linear time, if any pair of words are within a given fixed Levenshtein distance of each other. Adapting the above scheme to this system is, also, left as an exercise for the reader.

The complete source code for this article can be found here.

* Robust proofs of this hypothesis are welcome.

[SCHULZ2002FAST] Fast string correction with Levenshtein automata

[MIHOV2004FAST] Fast approximate search in large dictionaries

<style type="text/css">sup { vertical-align: super; }</style>

by Nick Johnson at July 28, 2010 03:56 PM

July 27, 2010

Go on Web

mongoDB & Golang == gomongo

Golang’s type based design makes it a language that’s good for BigTable / Document Based Databases.  One of the best Document based Database’s is mongoDB used by some very high profile services.   In this article we look at a great mongoDB … Continue reading


by hokapoka at July 27, 2010 02:38 PM

July 26, 2010

Go on Web

Embedding or Nesting Go Templates

Go templates are a very powerful means to generate HTML with Go lang.  It’s possible to Nest or Embed the templates in each other so you can easily modularize content and thus reduce repetition of code and work, by reusing … Continue reading


by hokapoka at July 26, 2010 07:18 PM

July 25, 2010

Airs – Ian Lance Taylor

Clever Machines

I’ve always found it easy to deal with machines, as I expect is true of most computer programmers. The interface to a machine is not always logical, but it is normally consistent in the sense that it always behaves the same way given the same inputs, and it is normally unambiguous in the sense that it either works or it doesn’t, and it is clear which state is which. At lest for me, dealing with machines is simpler than dealing with people when things go wrong—machines may be frustrating but at least they’re frustrating for relatively simple and ultimately comprehensible reasons.

Unfortunately, I’ve started to notice that as programs get smarter and as interface designers get more clever, machines are becoming more like people. Interfaces for web sites and phones are increasingly adjusting based on your past interactions. In many ways this is good, as over time the interaction gets smoother and easier. However, it means that there is a lack of consistency: an input today does not produce the same effect as the same input did yesterday. It also means that there is an increase in ambiguity: it’s difficult to tell the difference between working correctly and being slightly broken.

In effect, the computing world is becoming increasingly tuned for people who prefer dealing with people rather than people who prefer dealing with machines. On average this is of course a good thing, as most of the population seems to find it frustrating to deal with machines. But it’s somewhat ironic considering that the programmers doing most of the work tend to be people who prefer dealing with machines.

I don’t want to give up the advantages I get when things go well, so I guess I’m stuck in an increasingly inconsistent and ambiguous world.

by Ian Lance Taylor at July 25, 2010 08:05 PM

July 23, 2010

Go on Web

That’s it, I’ve had enough of Squarespace

A few days ago I had a short moan about SquareSpace and how I really wasn’t getting on with the UI / editor. I have to say the degree configuration is ready quite good, but it’s implementation of the blogging aspects really let it … Continue reading


by hokapoka at July 23, 2010 10:46 PM

July 22, 2010

Go on Web

Go’s interfaces

One of the key aspects of Go are interfaces, this article covers what an interface is, how to implement them and why they are used. To start with, Go has struct's, you could look at these as classes in an OO language, but they aren't. They are Structures that hold data, be it another structure or some other native type such as a string or []byte. Any logic in a go program is contain within a function (func), thus to enable you to pass different types to these functions they need's to be a means to associate the different types, this is where interfaces come in. Continue reading


by hokapoka at July 22, 2010 06:37 PM

Go Templates, {.repeated} & Arrays/Slices

This article looks at how to use Go Templates with Arrays or Slices & {.repeated section field } notation.  It assumes that you have an understanding of using {.section field} in Go templates and {field} in Go Templates. The {.repeated section … Continue reading


by hokapoka at July 22, 2010 09:29 AM

July 21, 2010

Nick Johnson

Getting unicode right in Python

Yup, I'm back from holidays! Apologies to everyone for the delayed return - it's taking me a long while to catch up on everything that built up while I was away.

Proper text processing - specifically, correct handling of unicode - is one of those things that consistently confounds even experienced developers. This isn't because it's difficult, but rather, I believe, because most developers carry around a few key misconceptions about what text (in the context of software) is and how it's represented. A search on StackOverflow for UnicodeDecodeError demonstrates just how prevalent these misconceptions are. These misconceptions date back to the days before unicode - longer than many developers have been in the industry, including myself - but they're still nothing if not widespread. This is in part because a number of well known and popular languages continue to, at worst, perpetuate the misunderstandings, and at best are insufficiently good at helping developers get it right.

We can divide languages into four categories along the axis of unicode support:

  1. Languages that were written before unicode was defined, or widespread. C and C++ fall into this category. Languages in this category tend to have unicode support that's spotty, not built into the language, or difficult to use correctly, making the path of least resistance the wrong one, more often than not.
  2. Languages that should know better. These languages were written after unicode was already widespread, but managed to get things horribly wrong. They have all the weaknesses of category 1, without the excuse of age. Prime amongst these, in my experience, is PHP, though there are doubtless many more languages that do just as badly.
  3. Languages that get it basically right, but have a few critical flaws. Languages in this category are 'modern' and understand unicode, but fail to make the path of least resistance for developers the correct one, which results in some serious shortfalls in unicode support for things implemented in these languages. Python 2.X, to my dismay, falls into this category - more about which, later.
  4. Languages that Get It Right. They support unicode, and they make it easy to do the right thing, and hard to do the wrong thing. Java and the .NET platform both fall into this category.

So what's the deal with unicode, and how are we getting it wrong? Joel's post, the absolute minimum every software developer absolutely, positively must know about unicode is an excellent place to start for this, but for the sake of brevity and those who are naturally impatient, I'll summarize.

Characters and bytes

The essential fact that you must understand in order to handle text properly is the abstract concept of a character. A character is a representation of a single symbol in a piece of text - a platonic ideal, of sorts. Crucially, a character is not a byte. Let me repeat for emphasis: A character is not, not, not a byte. Furthermore, there's no single way of representing a given character as one or more bytes - as I said, a character is the platonic ideal of the smallest unit of text.

Unicode, then, is a way of defining a set of characters that everyone can agree on. It consists of a huge database of characters, and each one is associated with a unique number, called a code point. Thus, the english letter capital A has the codepoint U+0041. The euro symbol has codepoint U+20A0, and so forth. A text string is simply a series of these codepoints, representing the character for each element in the string.

Of course, sooner or later, you need a way to store and transmit your platonic unicode strings. It helps if you choose a method that other people can understand, so that you can send text to them, and they to you, in a mutually comprehensible fashion. This is where character encodings come in.

A character encoding defines a mapping between our platonic characters and some way of representing them as bytes. The mapping doesn't have to be complete - it may have no way to represent certain characters - and it doesn't need to use the same amount of space for each character - some characters may be encoded as a single byte, while others may require several.

Because there are many ways of representing the same character as bytes, this means that if you have a series of bytes, but do not know their encoding - even if you know the data is textual - the data is meaningless. You can guess, but you'd be doing just that, guessing. In short, bytes are not text. If you forget everything else in this article, remember that. In order to read and write text, you must first know the encoding you are using, whether from of convention, out of band information, or any other mechanism.

How Python handles unicode

This is where Python's unicode support comes in. In Python's type heirarchy, there are three distinct string types: 'unicode', which represents unicode strings (text strings), 'str', which represents byte strings (binary data), and 'basestring', which acts as a parent class for both of the other string types. This is where, in my opinion, Python makes its misstep in handling unicode which pushes it into category 3, instead of category 4, by my definition above.

As I've just spent several paragraphs belabouring, bytes and characters are fundamentally different entities, only interconvertible with the help of a character encoding. Python, unfortunately, does its best to help you forget this, with two separate missteps:

The first is of debatable significance: treating sequences of bytes as strings. It's arguable whether or not this is a good thing; Java and .NET support the proposition that it's not, while other languages make good arguments in the other direction. In any case, it's certainly true that certain operations you might want to preform on text strings - regular expression matching, string replacement, and so forth - don't entirely make sense on sequences of bytes. Python, though, treats bytes as just a different type of string, and allows the same set of operations on both.

The second misstep is the more significant one: Python's attempt at transparently converting between byte strings and character strings. In a variety of circumstances, Python will attempt to convert a byte string to a unicode string or vice-versa, when the situation warrants - for example, when attempting to concatenate a byte string and a unicode string together. Since, as we've previously detailed, conversion between the two types is meaningless without an encoding, Python relies on a 'default encoding', specified by sys.setdefaultencoding(). On most platforms, this defaults to ASCII, and it's almost certainly wrong for any given conversion. Other places this encoding is used include calls to str() or unicode() without specifying an encoding yourself, and functions that expect one type of string but are passed the other.

One solution to some of your unicode woes, then, would be to call sys.setdefaultencoding(), setting the default encoding to whatever you ought to be using. This only masks the root problem, though, which is your failure to handle text correctly in the first place. It may also be impractical, since many apps, particularly webapps, may have to deal with multiple different text encodings in different places.

The correct solution is to fix your code so that you handle text correctly. Here's the cliff's notes on what you should be doing:

  • All text strings, everywhere should be of type unicode, not str. If you're handling text, and your variable is a str, it's a bug!
  • To decode a byte string as text, use var.decode(encoding) (eg, var.decode('utf-8'), with the correct encoding. To encode a text string as bytes, use var.encode(encoding).
  • Never ever use str() on a unicode string, or unicode() on a byte string without a second argument specifying the encoding.
  • Whenever you read data from outside your app, expect it to be bytes - eg, of type str - and call .decode() on it to interpret it as text. Likewise, always call .encode() on text you want to send to the outside world.
  • If a string literal in your code is intended to represent text, it should always be prefixed with 'u'. In fact, you probably never want to define a raw string literal in your code at all. For what it's worth, though, I'm terrible at this one, as I'm sure pretty much everyone else is, too.

Python 3, by the way, fixes things, and gets unicode and string handling right, putting it solidly into category 4. See this section of the What's new page for details.

Hopefully this made sense, and if you had any doubts about what exactly unicode is and how to handle it, they're now cleared up. Next time you get a UnicodeEncodeError or UnicodeDecodeError in your app, then, you'll know exactly what's gone wrong, and how to fix it!

by Nick Johnson at July 21, 2010 01:25 PM

Go on Web

Adding logic to Go Templates

This article assumes you know how to create a struct and use Go’s *Template.Execute() with a html template and a struct to create merged content. In the previous article I demonstrated a very simple example, in this article I’ll show … Continue reading


by hokapoka at July 21, 2010 01:15 PM

July 20, 2010

Go on Web

Use Go’s HTML Templates

This article assumes you know how to create a simple http server to respond to different requests.  If you don’t know it only take 5 mins, check out how to setup a basic Http Server on go before continuing. Regardless … Continue reading


by hokapoka at July 20, 2010 08:58 PM

GO & web.go

Thus far I’ve only talked about using Go’s built-in http packages directly.  And while it’s very extensive you have to implement many of the standard features yourself, Such as Session State and cookies.  While these features are relatively simple you don’t want to “re-design the … Continue reading


by hokapoka at July 20, 2010 11:28 AM

Serve on multiple hosts / domains with Go

Previously I talked about using Go’s built-in http Package to serve web content.  As some of you might have noticed, you’re binding to the IP address of the server thus regardless of the host or domain you’re serving the same … Continue reading


by hokapoka at July 20, 2010 10:27 AM

July 19, 2010

Go on Web

Make A Basic GO http Server

If you’ve read any of the Go Documentation there are a couple of examples of howto use “http” package.  In particular if you’ve read the Effective Go Doc there’s an example that’s very similar to this implymentation. The http Package Go has … Continue reading


by hokapoka at July 19, 2010 10:35 AM

July 18, 2010

Airs – Ian Lance Taylor

Living with the Past

I’m in Stockholm for a few weeks, which is why I haven’t been updating this blog. I’ve been in Sweden many times before, but one thing I’ve noticed particularly this time is the way that old existing buildings have been adapted for modern times. It’s quite common to see stone steps which look positively ancient with two pieces of wood, looking nearly as ancient, laid on top of them for use with strollers and/or wheelchairs.

When life changes for an existing city, you can either adapt the city or you can replace it piece by piece. The U.S. pretty reliably picks replacement. It’s interesting to see a place which tries harder to adapt, a spirit no doubt encouraged by the historical nature of the buildings.

Stockholm is also notable for how easy it is to get around on bike. The bike lanes here are serious alternatives to pedestrian or car traffic, with their own signs and traffic lights. They aren’t universal, but they seem to cover the city and the immediate suburbs pretty well. This too is of course fitted into the existing streets and bridges, somehow. Particularly impressive is a few construction sites I’ve come across where a temporary bike lane was built because the existing one was being built over.

Creating high quality bike lanes may seem like an inefficient use of public funds, but of course it’s really no less efficient than building roads. The U.S. does still mostly agree that roads are a common good, and it seems like, in cities, real bike lanes could be as well.

by Ian Lance Taylor at July 18, 2010 07:23 PM

July 16, 2010

Go's official blog

Share Memory By Communicating

Traditional threading models (commonly used when writing Java, C++, and Python programs, for example) require the programmer to communicate between threads using shared memory. Typically, shared data structures are protected by locks, and threads will contend over those locks to access the data. In some cases, this is made easier by the use of thread-safe data structures such as Python's Queue.

Go's concurrency primitives - goroutines and channels - provide an elegant and distinct means of structuring concurrent software. (These concepts have an interesting history that begins with C. A. R. Hoare's Communicating Sequential Processes.) Instead of explicitly using locks to mediate access to shared data, Go encourages the use of channels to pass references to data between goroutines. This approach ensures that only one goroutine has access to the data at a given time. The concept is summarized in the document Effective Go (a must-read for any Go programmer):

Do not communicate by sharing memory; instead, share memory by communicating.

Consider a program that polls a list of URLs. In a traditional threading environment, one might structure its data like so:

type Resource struct {
url string
polling bool
lastPolled int64
}

type Resources struct {
data []*Resource
lock *sync.Mutex
}

And then a Poller function (many of which would run in separate threads) might look something like this:

func Poller(res *Resources) {
for {
// get the least recently-polled Resource
// and mark it as being polled
res.lock.Lock()
var r *Resource
for _, v := range res.data {
if v.polling {
continue
}
if r == nil || v.lastPolled < r.lastPolled {
r = v
}
}
if r != nil {
r.polling = true
}
res.lock.Unlock()
if r == nil {
continue
}

// poll the URL

// update the Resource's polling and lastPolled
res.lock.Lock()
r.polling = false
r.lastPolled = time.Nanoseconds()
res.lock.Unlock()
}
}

This function is about a page long, and requires more detail to make it complete. It doesn't even include the URL polling logic (which, itself, would only be a few lines), nor will it gracefully handle exhausting the pool of Resources.

Let's take a look at the same functionality implemented using Go idiom. In this example, Poller is a function that receives Resources to be polled from an input channel, and sends them to an output channel when they're done.

type Resource string

func Poller(in, out chan *Resource) {
for r := range in {
// poll the URL

// send the processed Resource to out
out <- r
}
}

The delicate logic from the previous example is conspicuously absent, and our Resource data structure no longer contains bookkeeping data. In fact, all that's left are the important parts. This should give you an inkling as to the power of these simple language features.

There are many omissions from the above code snippets. For a walkthrough of a complete, idiomatic Go program that uses these ides, see the Codewalk "Share Memory By Communicating".

by Andrew Gerrand (noreply@blogger.com) at July 16, 2010 12:41 AM

July 14, 2010

Embrace change

Good bye mue.tideland.biz, but hello ...

The time of mue @ Tideland is over. Due to a re-orientation I decided to delete this blog. But this doesn't mean I'm stopping writing in the web. I just reactivated my old site Frank @ M.Web as my private blog in German language. It will only contain few IT related entries, mostly when the targeted audience is German. Additionally I will post articles about software development - and here especially languages, architecture, requirements engineering, and software cost estimation - on my Tideland site.

Thx for reading here, we'll see on my other sites.

mue

Permalink | Leave a comment  »

July 14, 2010 10:47 AM

July 12, 2010

Go's official blog

Go's Declaration Syntax

Newcomers to Go wonder why the declaration syntax is different from the tradition established in the C family. In this post we'll compare the two approaches and explain why Go's declarations look as they do.

C syntax

First, let's talk about C syntax. C took an unusual and clever approach to declaration syntax. Instead of describing the types with special syntax, one writes an expression involving the item being declared, and states what type that expression will have. Thus

    int x;

declares x to be an int: the expression 'x' will have type int. In general, to figure out how to write the type of a new variable, write an expression involving that variable that evaluates to a basic type, then put the basic type on the left and the expression on the right.

Thus, the declarations

    int *p;
int a[3];

state that p is a pointer to int because '*p' has type int, and that a is an array of ints because a[3] (ignoring the particular index value, which is punned to be the size of the array) has type int.

What about functions? Originally, C's function declarations wrote the types of the arguments outside the parens, like this:

    int main(argc, argv)
int argc;
char *argv[];
{ /* ... */ }

Again, we see that main is a function because the expression main(argc, argv) returns an int. In modern notation we'd write

    int main(int argc, char *argv[]) { /* ... */ }

but the basic structure is the same.

This is a clever syntactic idea that works well for simple types but can get confusing fast. The famous example is declaring a function pointer. Follow the rules and you get this:

    int (*fp)(int a, int b);

Here, fp is a pointer to a function because if you write the expression (*fp)(a, b) you'll call a function that returns int. What if one of fp's arguments is itself a function?

    int (*fp)(int (*ff)(int x, int y), int b)

That's starting to get hard to read.

Of course, we can leave out the name of the parameters when we declare a function, so main can be declared

    int main(int, char *[])

Recall that argv is declared like this,

    char *argv[]

so you drop the name from the *middle* of its declaration to construct its type. It's not obvious, though, that you declare something of type char *[] by putting its name in the middle.

And look what happens to fp's declaration if you don't name the parameters:

    int (*fp)(int (*)(int, int), int)

Not only is it not obvious where to put the name inside

    int (*)(int, int)

it's not exactly clear that it's a function pointer declaration at all. And what if the return type is a function pointer?

    int (*(*fp)(int (*)(int, int), int))(int, int)

It's hard even to see that this declaration is about fp.

You can construct more elaborate examples but these should illustrate some of the difficulties that C's declaration syntax can introduce.

There's one more point that needs to be made, though. Because type and declaration syntax are the same, it can be difficult to parse expressions with types in the middle. This is why, for instance, C casts always parenthesize the type, as in

    (int)M_PI

Go syntax

Languages outside the C family usually use a distinct type syntax in declarations. Although it's a separate point, the name usually comes first, often followed by a colon. Thus our examples above become something like (in a fictional but illustrative language)

    x: int
p: pointer to int
a: array[3] of int

These declarations are clear, if verbose - you just read them left to right. Go takes its cue from here, but in the interests of brevity it drops the colon and removes some of the keywords:

    x int
p *int
a [3]int

There is no direct correspondence between the look of [3]int and how to use a in an expression. (We'll come back to pointers in the next section.) You gain clarity at the cost of a separate syntax.

Now consider functions. Let's transcribe the declaration for main, even though the main function in Go takes no arguments:

    func main(argc int, argv *[]byte) int

Superficially that's not much different from C, but it reads well from left to right:

function main takes an int and a pointer to a slice of bytes and returns an int.

Drop the parameter names and it's just as clear - they're always first so there's no confusion.

    func main(int, *[]byte) int

One value of this left-to-right style is how well it works as the types become more complex. Here's a declaration of a function variable (analogous to a function pointer in C):

    f func(func(int,int) int, int) int

Or if f returns a function:

    f func(func(int,int) int, int) func(int, int) int

It still reads clearly, from left to right, and it's always obvious which name is being declared - the name comes first.

The distinction between type and expression syntax makes it easy to write and invoke closures in Go:

    sum := func(a, b int) int { return a+b } (3, 4)

Pointers

Pointers are the exception that proves the rule. Notice that in arrays and slices, for instance, Go's type syntax puts the brackets on the left of the type but the expression syntax puts them on the right of the expression:

    var a []int
x = a[1]

For familiarity, Go's pointers use the * notation from C, but we could not bring ourselves to make a similar reversal for pointer types. Thus pointers work like this

    var p *int
x = *p

We couldn't say

    var p *int
x = p*

because that postfix * would conflate with multiplication. We could have used the Pascal ^, for example:

    var p ^int
x = p^

and perhaps we should have (and chosen another operator for xor), because the prefix asterisk on both types and expressions complicates things in a number of ways. For instance, although one can write

    []int("hi")

as a conversion, one must parenthesize the type if it starts with a *:

    (*int)(nil)

Had we been willing to give up * as pointer syntax, those parentheses would be unnecessary.

So Go's pointer syntax is tied to the familiar C form, but those ties mean that we cannot break completely from using parentheses to disambiguate types and expressions in the grammar.

Overall, though, we believe Go's type syntax is easier to understand than C's, especially when things get complicated.

Notes

Go's declarations read left to right. It's been pointed out that C's read in a spiral! See The "Clockwise/Spiral Rule" by David Anderson.

- Rob Pike, July 2010

by Andrew Gerrand (noreply@blogger.com) at July 12, 2010 02:26 AM

July 09, 2010

June 30, 2010

Nick Johnson

A short hiatus

This Saturday, I'm heading off to a Greek island to spend a little under two weeks with my gorgeous wife and a collection of good friends in what I hope is a well-deserved holiday. Relaxation will be the order of the day, and as a result, I'm not going to be blogging much, if at all, during that period.

Not to fear, though: I have several good posts planned, including new Damn Cool Algorithms posts, and you can look forward to seeing them after I get back.

See you all in a couple of weeks!

by Nick Johnson at June 30, 2010 12:23 PM

June 25, 2010

Nick Johnson

Using Python magic to improve the deferred API

Recently, my attention was drawn, via a blog post to a Python task queue implementation called Celery. The object of my interest was not so much Celery itself - though it does look both interesting and well written - but the syntax it uses for tasks.

While App Engine's deferred library takes the 'higher level function' approach - that is, you pass your function and its arguments to the 'defer' function - I've never been entirely happy with that approach. Celery, in contrast, uses Python's support for decorators (one of my favorite language features) to create what, in my view, is a much neater and more flexible interface. While defining and calling a deferred function looks like this:

def my_task_func(some_arg):
  # do something

defer(my_task_func, 123)

Doing the same in Celery looks like this:

@task
def my_task_func(some_arg):
  # do something

my_task_func.delay(123)

Using a decorator, Celery is able to modify the function it's decorating such that you can now call it on the task queue using a much more intuitive syntax, with the function's original calling convention preserved. Let's take a look at how this works, first, and then explore how we might make use of it in the deferred library.

Functions as objects in Python

In Python, everything is an object. That includes everything from what other languages call 'primitive' values like the number 3, all the way up to and including things such as functions, classes, and modules. This has many implications for the design of the language and what you can do with it. You probably think of functions as things which you can only do one thing to - call them - but because every function in Python is also an object, that's not the case: a Python function is simply a 'callable' object. We can explore the members of a function in Python using the dir() builtin:

>>> def my_func(arg1, arg2=None):
...   return arg2 or arg1
... 
>>> type(my_func)
<type 'function'>
>>> dir(my_func)
['__call__', '__class__', '__delattr__', '__dict__', '__doc__', '__get__', '__getattribute__', '__hash__', '__init__', '__module__', '__name__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', 'func_closure', 'func_code', 'func_defaults', 'func_dict', 'func_doc', 'func_globals', 'func_name']

As you can see, my_func is an instance of type 'function', and it has a number of members, including __call__, the special method that makes an object callable. It also includes a bunch of standard members, such as __hash__, __class__, and so forth, and a few function-specific ones, which we'll come back to later.

Celery takes advantage of this by modifying the function, adding a new member called 'delay'. You can modify functions in this way just by setting the member as you would anything else:

>>> my_func.foo = 'bar'
>>> my_func.foo
'bar'

Based on this, we could define a decorator for the deferred library trivially:

def task(func):
  task.defer = lambda *args, **kwargs: defer(func, *args, **kwargs)
  return task

And indeed, this will work exactly as expected, allowing us to call my_func.defer(args) instead of defer(my_func, args). There's a couple of missed opportunities, however. First up, you can call this new function with any set of arguments, and it will happily try and defer a call to the function with those arguments. If you passed an invalid set of arguments - for example, passing 4 arguments to a function that expects only 3 - you won't know about it until your task runs, and fails, making tracking down the culprit a lot harder.

There are a couple of solutions for this. One is to use inspect.getargspec to retrieve the argument specification from the original function, and check it against the passed in arguments. This will impose some additional overhead, however, and require us to write our own argument checking code, which will be difficult to test for correctness.

The second approach is slightly kludgier, but provides a better result: Dynamically create a new function and evaluate it using eval(). The function we create can have the same argument specification as the original function, and call our more permissive function as its sole activity. Fortunately, someone's already done this for us, in the form of the decorator library. With the decorator library, our new signature-preserving defer decorator looks like this:

def task(func):
  func.defer = decorator(defer, func)

Not only is it signature preserving, but it's even simpler! Note we didn't have to define a function at all: the decorator function expects its first argument to be a function with the signature func(f, *args, **kwargs), which is exactly the signature of the existing deferred.defer function. Given the decorator function and the function being decorated, decorator returns the decorated function, which we add as the 'defer' method of the original function, instead of replacing it.

The second feature I'd like to take advantage of is the ability to provide arguments to the task queue service, such as ETA and task name, both at definition time and at runtime. The existing defer function handles this using reserved, underscore-prefixed arguments for these parameters, but this seems less than ideal.

If we follow the pattern used by Celery, the former is easy to achieve: we can have our 'task' decorator take arguments, which will serve as defaults for anything that calls 'defer' on the decorated function. This only handles specification at definition-time, though, and we'd like to be able to specify them when we call our deferred function, too. We've got several, possibly conflicting, goals here:

  1. Make basic invocation as simple as possible. myfunc.defer(some_args) should continue to work as expected.
  2. Make it possible to specify arguments for the task queue at runtime, without modifying the signature of the function we're deferring a call to.
  3. Make it possible to bulk-add deferred tasks, rather than having them implicitly added individually, without complicating the interface for goals 1 and 2.

Satisfying all of these may be a tall order. I'm actually not entirely certain what the best way to handle this will be, yet. Currently, I think the best option may be this:

  1. Define a 'DeferredTaskTemplate' abstract class that encapsulates the arguments to be used in creating a deferred task. Instances of the class are immutable. The class has a method called something like 'with' or 'options', which when called with task queue options returns a new instance of itself with those options modified.
  2. Have the task decorator create a new subclass of DeferredTaskTemplate with the __call_ method defined using the decorator trick we discussed above. When called, it returns the taskqueue.Task object for the new deferred task, after optionally enqueueing it.
  3. Add another keyword argument to the with/options method, 'add', which causes the class to not enqueue the task if it's set to True.

With a framework such as that, the calling conventions look like this:

# Regular call
my_func(a, b)

# Standard deferred call
my_func.defer(a, b)

# Deferred call with task name and countdown
my_func.defer.with(name='foo', countdown=60)(a)

# Deferred calls using batch interface:
tasks = [my_func.defer.with(add=False)(x) for x in l]
taskqueue.Queue('test').add(tasks)

# Shortcut for commonly used options
gone_in_60s = my_func.defer.with(countdown=60)
gone_in_60s(a, b)

This seems like a reasonable interface to me. What do you think? Do you have any ideas on how to improve the interface further?

by Nick Johnson at June 25, 2010 04:04 PM

Adam Langley

Overclocking SSL

(This is a write up of the talk that I gave at Velocity 2010 last Thursday. This is a joint work of myself, Nagendra Modadugu and Wan-Teh Chang.)

The ‘S’ in HTTPS stands for ‘secure’ and the security is provided by SSL/TLS. SSL/TLS is a standard network protocol which is implemented in every browser and web server to provide confidentiality and integrity for HTTPS traffic.

If there's one point that we want to communicate to the world, it's that SSL/TLS is not computationally expensive any more. Ten years ago it might have been true, but it's just not the case any more. You too can afford to enable HTTPS for your users.

In January this year (2010), Gmail switched to using HTTPS for everything by default. Previously it had been introduced as an option, but now all of our users use HTTPS to secure their email between their browsers and Google, all the time. In order to do this we had to deploy no additional machines and no special hardware. On our production frontend machines, SSL/TLS accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead. Many people believe that SSL takes a lot of CPU time and we hope the above numbers (public for the first time) will help to dispel that.

If you stop reading now you only need to remember one thing: SSL/TLS is not computationally expensive any more.

The first part of this text contains hints for SSL/TLS performance and then the second half deals with things that Google is doing to address the latency that SSL/TLS adds.

Basic configuration

Modern hardware can perform 1500 handshakes/second/core. That's assuming that the handshakes involve a 1024-bit RSA private operation (make sure to use 64-bit software). If your site needs high security then you might want larger public key sizes or ephemeral Diffie-Hellman, but then you're not the HTTP-only site that this presentation is aimed at. But pick your certificate size carefully.

It's also important to pick your preferred ciphersuites. Most large sites (Google included) will try to pick RC4 because it's very fast and, as a stream cipher, doesn't require padding. Recent Intel chips (Westmere) contain AES instructions which can make AES the better choice, but remember that there's no point using AES-256 with a 1024-bit public key. Also keep in mind that ephemeral Diffie-Hellman (EDH or DHE) ciphersuites will handshake at about half the speed of pure RSA ciphersuites. However, with a pure RSA ciphersuite, an attacker can record traffic, crack (or steal) your private key at will and decrypt the traffic retrospectively, so consider your needs.

OpenSSL tends to allocate about 50KB of memory for each connection. We have patched OpenSSL to reduce this to about 5KB and I understand that the Tor project have independently written a similar patch that is now upstream (although not in any release at the time of writing). Keeping memory usage down is vitally important when dealing with many connections.

Resumption

There are two types of SSL/TLS handshake: a full handshake and an abbreviated handshake. The full handshake takes two round trips (in addition to the round trip from the TCP handshake):

The abbreviated handshake occurs when the connection can resume a previous session. This can only occur if the client has the previous session information cached. Since the session information contains key material, it's never cached on disk so the attempted client resume rate, seen by Google, is only 50%. Older clients also require that the server cache the session information. Since these old clients haven't gone away yet, it's vitally important to setup a shared session cache if you have multiple frontend machines. The server-side miss rate at Google is less than 10%.

An abbreviated handshake saves the server performing an RSA operation, but those are cheap anyway. More importantly, it saves a round-trip time:

Addressing round trips is a major focus of our SSL/TLS work at Google (see below).

Certificates

We've already mentioned that you probably don't want to use 4096-bit certificates without a very good reason, but there are other certificate issues which can cause a major slowdown.

Firstly, most certificates from CAs require an intermediate certificate to be presented by the server. Rather than have the root certificate sign the end certificates directly, the root signs the intermediate and the intermediate signs the end certificate. Sometimes there are several intermediate certificates.

If you forget to include an intermediate certificate then things will probably still work. The end certificate will contain the URL of the intermediate certificate and, if the intermediate certificate is missing, the browser will fetch it. This is obviously very slow (a DNS lookup, TCP connection and HTTP request blocking the handshake to your site). Unfortunately there's a constant pressure on browsers to work around issues and, because of this, many sites which are missing certificates will never notice because the site still functions. So make sure to include all your certificates (in the correct order)!

There's a second certificate issue that can increase latency: the certificate chain can be too large. A TCP connection will only send so much data before waiting to hear from the other side. It slowly ramps this amount up over time, but a new TCP connection will only send (typically) three packets. This is called the initial congestion window (initcwnd). If your certificates are large enough, they can overflow the initcwnd and cause an additional round trip as the server waits for the client to ACK the packets:

For example, www.bankofamerica.com sends four certificates: 1624 bytes, 1488 bytes, 1226 bytes and 576 bytes. That will overflow the initcwnd, but it's not clear what they could do about that if their CA really requires that many intermediate certificates. On the other hand, edgecastcdn.net has a single certificate that's 4093 bytes long, containing 107 hostnames!

Packets and records

SSL/TLS packages the bytes of the application protocol (HTTP in our case) into records. Each record has a signature and a header. Records are packed into packets and each packet has headers. The overhead of a record is typically 25 to 40 bytes (based on common ciphersuites) and the overhead of a packet is around 52 bytes. So it's vitally important not to send lots of small packets with small records in them.

I don't want to be seen to be picking on Bank Of America, it's honestly just the first site that I tried, but looking at their packets in Wireshark, we see many small records, often sent in a single packet. A quick sample of the record sizes: 638 bytes, 1363, 15628, 69, 182, 34, 18, … This is often caused because OpenSSL will build a record from each call to SSL_write and the kernel, with Nagle disabled, will send out packets to minimise latency.

This can be fixed with a couple of tricks: buffer in front of OpenSSL and don't make SSL_write calls with small amounts of data if you have more coming. Also, if code limitations mean that you are building small records in some cases, then use TCP_CORK to pack several of them into a packet.

But don't make the records too large either! See the 15KB record that https://www.bankofamerica.com sent? None of that data can be parsed by the browser until the whole record has been received. As the congestion window opens up, those large records tend to span several windows and so there's an extra round trip of delay before the browser gets any of that data. Since the browser is pre-parsing the HTML for subresources, it'll delay discovery and cause more knock-on effects.

So how large should records be? There's always going to be some uncertainty in that number because the size of the TCP header depends on the OS and the number of SACK blocks that need to be sent. In the ideal case, each packet is full and contains exactly one record. Start with a value of 1415 bytes for a non-padded ciphersuite (like RC4), or 1403 bytes for an AES based ciphersuite and look at the packets which result from that.

OCSP and CRLs

OCSP and CRLs are both methods of dealing with certificate revocation: what to do when you lose control of your private key. The certificates themselves contain the details of how to check if they have been revoked.

OCSP is a protocol for asking the issuing authority “What's the status of this certificate?” and a CRL is a list of certificates which have been revoked by the issuing authority. Both are fetched over HTTP and a certificate can specify an OCSP URL, a CRL URL, or both. But certificate authorities will typically use at least OCSP.

Firefox 2 and IE on Windows XP won't block an SSL/TLS handshake for either revocation method. IE on Vista will block for OCSP, as will Firefox 3. Since there can be several OCSP requests resulting from a handshake (one for each certificate), and because OCSP responders can be slow, this can result in hundreds of milliseconds of additional latency for the first connection. I don't have really good data yet, but hopefully soon.

The answer to this is OCSP stapling: the SSL/TLS server includes the OCSP response in the handshake. OCSP responses are public and typically last for a week, so the server can do the work of fetching them and reuse the response for many connections. The latest alpha of Apache supports this (httpd 2.3.6-alpha). Google servers will hopefully support this soon.

However, OCSP stapling has several issues. Firstly, the protocol only allows the server to staple one response into the handshake: so if you have more than one certificate in the chain the client will probably end up doing an OCSP check anyway. Secondly, an OCSP response is about 1K of data. Remember the issue with overflowing the initcwnd with large certificates? Well the OCSP response is included in the same part of the handshake, so it puts even more pressure on certificate sizes.

Google's SSL/TLS work

Google is working on a number of fronts here. I'll deal with them in order of deployment and complexity.

Firstly, simple and deployed, is False Start. All you really need to know is that it reduces the number of round trips for a full handshake from two to one. Thus, there's no longer any latency advantage to resuming. It's a client-side only change and it's live in Chrome (5 on Linux, 6 on Mac and Windows) and Android Froyo.

Slightly more complex, but still deployed, is Next Protocol Negotiation. This pushes negotiation of the application level protocol down into the TLS handshake. It's how we trigger the use of SPDY. It's live on the Google frontend servers and in Chrome (6).

OCSP stapling should be coming to the Google frontend servers within the next couple of months. Once we have basic stapling working, we'll need to address its shortfalls (see above). That means that we'll be implementing multi-stapling and compression of certificates and OCSP responses at some point, although we haven't started work on that yet.

Lastly, and most complex, is Snap Start. This reduces the round trip times to zero for both types of handshakes. It's a client and server side change and it assumes that the client has the server's certificate cached and has up-to-date OCSP information. However, since the certificate and OCSP responses are public information, we can cache them on disk for long periods of time.

Conclusion

I hope that was helpful. We want to make the web faster and more secure and this sort of communication helps keep the world abreast of what we're doing.

Also, don't forget that we recently deployed encrypted web search on https://encrypted.google.com. Switch your search engine!

June 25, 2010 07:00 AM

June 24, 2010

Google's Go Guide

ドキュメント更新のお知らせ

原文更新に伴い、ドキュメントの翻訳を更新しました。

Goプログラミング言語仕様

可変引数パラメータ(…)には、型の指定が必須になりました。
なお、ドキュメントのバージョンの表記は変わっていません。(Version of June 7, 2010)

チュートリアル

これも、可変引数の説明が変わりました。

実践Go言語

セクションの追加(defer,  panic, recover)

by noboru at June 24, 2010 03:11 PM

Airs – Ian Lance Taylor

Afghanistan

I don’t have a well thought out view of Afghanistan. But General McChrystal’s counter-insurgency plan never made much sense to me. The plan by definition requires a government which the people can trust. But all reports are that Hamid Karzai is not trusted by the people in Afghanistan. The election last year was a total fiasco. That kind of seems like a big gaping hole in the middle of the counter-insurgency plan. You can’t build trust in a government whose leader stays in power by fraud. McChrystal was reportedly trying to build trust in Karzai by travelling with him and boosting his position, but since frankly I can’t see why the Afghan people would trust McChrystal or the U.S. either, that seems like a flawed plan.

So now McChrystal is out amid reports of bickering and infighting. But we’re still going to follow the same plan under General Petraeus. The basic dynamic of the situation is unchanged. How is this not going to be a disaster?

The U.S. made progress in Iraq, against my expectations, by showing that people had more to gain by participating in politics than they did by staying out. In particular, the Iraqis showed themselves what a civil war would look like, and many of them backed away. Iraq remains a long way from normal, and the former middle class remains largely outside the country, but it’s hugely better than it was four years ago.

Afghanistan is a much bigger country than Iraq with a much smaller population. The political dynamics are by necessity quite different. The political class is much smaller. I don’t see why one would expect the same process to work.

It’s also worth questioning what the U.S. has to gain from Afghanistan. Al Qaeda has relocated into Pakistan. No reasonable person would want to let the Taliban regain control, but there is no U.S. national interest in Afghanistan. There is no oil. The recently trumpeted minerals wealth has little national interest to the U.S., no slouch in mineral wealth itself. What is going to keep us there for the time it takes to turn Afghanistan into a modern society?

At this point I think the military approach is entirely wrong. I think an economic approach would be much more effective. Maybe we should try to make Kabul as secure as we can and as rich as we can, and open its gates to anybody who will enter without weapons. Hand out radios and food. Let the Taliban fight for the rest of the country, but show most of the people that a better way is available. I don’t know if this would work at all, but it would be cheaper in lives and money than the current approach.

Since we’re not going to do that, I just hope that I’m wrong again, and that something useful comes out of this, even if I can’t see what.

by Ian Lance Taylor at June 24, 2010 12:44 AM

June 23, 2010

Nick Johnson

Guessing subreddits with the Prediction API

Edit: Now with a live demo!

I've written before about the new BigQuery and Prediction APIs, and promised to demonstrate them. Let's take a look at the Prediction API first.

The Prediction API, as I've explained, does a restricted form of machine learning, as a web service. Currently, it supports categorizing textual and numeric data into a preset list of categories. The example given in the talk - language detection - is a good one, but I wanted to come up with something new. A few ideas presented themselves:

  1. Training on movie/book reviews to try and predict the score given based on the text
  2. Training on product descriptions to try and predict their rating
  3. Training on Reddit submissions to try and predict the subreddit a new submission belongs in

All three have promise, but the first could suffer from the fact that the prediction API as it currently stands doesn't understand a relationship between categories - it would have no way to know that the '5 star' rating tag is 'closer to' the '4 star' one than the '1 star' tag. The second seems very ambitious, and it's not clear there's enough information to do that. The third one, though, seemed just right.

With that in mind, I started collecting data on reddit submissions. Using the API, I collected every submission to reddit between 2010-06-02 13:39:31 UTC and 2010-06-09 23:11:34, using this simple script:

import logging
import time
import urllib2
import simplejson

REDDIT_URL = 'http://www.reddit.com/r/all/new/.json'

class RedditProcessor(object):
  def __init__(self, outfile):
    self.seen = set()
    self.outfile = outfile
    self.interval = 120

  def run(self):
    self.init()
    while True:
      data = self.get_reddit_data()
      if data:
        self.process_reddit_data(data)
      time.sleep(self.interval)

  def init(self):
    fh = open(self.outfile, 'r')
    self.seen.update(simplejson.loads(x)['id'] for x in fh)
    fh.close()
    logging.info("Read %d IDs", len(self.seen))
    self.writer = open(self.outfile, 'a')

  def get_reddit_data(self):
    try:
      request = urllib2.urlopen(REDDIT_URL)
      return simplejson.loads(request.read())
    except urllib2.URLError, e:
      logging.exception("Request failed")
      return None

  def process_reddit_data(self, data):
    if 'data' not in data or 'children' not in data['data']:
      logging.warn("Data does not contain expected keys: %r", data)
      return
    try:
      num_written = 0
      for entry in data['data']['children']:
        entry_data = entry['data']
        if entry_data['id'] not in self.seen:
          simplejson.dump(entry_data, self.writer)
          self.writer.write('\n')
          num_written += 1
          self.seen.add(entry_data['id'])
      logging.info("Wrote %d new entries", num_written)
    except Exception, e:
      logging.exception("Error processing entries")


def main():
  logging.basicConfig(level=logging.DEBUG)
  processor = RedditProcessor("reddit_dump.json")
  processor.run()

if __name__ == '__main__':
  main()

In total, I collected a little over 75MB of JSON-encoded data, comprising 72,986 submissions. I then determined the 20 subreddits with the most submissions over that time, and generated a training dataset from just the submissions to those subreddits. This subset made up 42,753 submissions, or about 58% of the original. Submissions were randomly split into either the training set (98%) or the validation set (2%):

import csv
import random
import simplejson


fh = open('reddit_dump.json', 'r')
subreddits = {}
for line in fh:
  data = simplejson.loads(line)
  subreddits[data['subreddit']] = subreddits.get(data['subreddit'], 0) + 1

top20 = sorted(subreddits.items(), key=lambda x:x[1], reverse=True)[:20]
print top20
top20_set = set(x[0] for x in top20)
fh = open('reddit_dump.json', 'r')
training = csv.writer(open('reddit_training.csv', 'w'))
validation = csv.writer(open('reddit_validation.csv', 'w'))
for line in fh:
  data = simplejson.loads(line)
  if data['subreddit'] not in top20_set:
    continue
  row = (data['subreddit'].encode('utf-8'), data['title'].encode('utf-8'), data['domain'].encode('utf-8'))
  if random.random() >= 0.98:
    validation.writerow(row)
  else:
    training.writerow(row)

The top 20 reddits, incidentally, are:

RedditSubmissions
reddit.com14578
pics4157
AskReddit3375
reportthespammers3258
politics3162
funny2176
WTF1773
gaming1367
worldnews938
videos849
atheism834
Music833
technology732
trees703
comics639
nsfw611
circlejerk600
news567
environment537
DoesAnybodyElse537

Next, I uploaded the new dataset to the Google Storage service, using the 'gsutil' tool, and followed the getting started instructions for the Prediction API to start it training on my model:

$ curl -X POST -H "Content-Type:application/json" -d "{\"data\":{}}" -H "Authorization: GoogleLogin auth=$XAPI_AUTH" https://www.googleapis.com/prediction/v1/training?data=reddit-dump%24reddit_training.csv
{"data":{"data":"reddit-dump$reddit_training.csv"}}

$ curl -H "Authorization: GoogleLogin auth=$XAPI_AUTH" https://www.googleapis.com/prediction/v1/training/reddit-dump%2freddit_training.csv
{"data":{"data":"reddit-dump/reddit_training.csv","modelinfo":"Training has not completed."}}

Once training completed, we can see the Prediction API's own estimate of accuracy:

$ curl -H "Authorization: GoogleLogin auth=$XAPI_AUTH" https://www.googleapis.com/prediction/v1/training/reddit-dump%2freddit_training_2.csv
{"data":{"data":"reddit-dump/reddit_training_2.csv","modelinfo":"estimated accuracy: 0.61"}}

That's not bad, though perhaps not as stellar as we might have hoped. For perspective, if we simply picked the most popular category (reddit.com) every time, we'd only get it right about 34% of the time. Let's run our own test with the data we set aside for validation, though, and see how it goes. Here's our test script:

import csv
import logging
import simplejson
import sys
import urllib
import urllib2


def predict(target, auth, text):
  request_data = {
      'data': {
          'input': {
              'text': text,
          }
      }
  }
  request = urllib2.Request(
      'https://www.googleapis.com/prediction/v1/training/%s/predict' % target,
      data=simplejson.dumps(request_data),
      headers={
          "Authorization": "GoogleLogin auth=%s" % auth,
          "Content-Type": "application/json",
      })
  response = urllib2.urlopen(request)
  response_data = simplejson.load(response)
  return response_data['data']['output']['output_label']

def main(args):
  infile, auth, target = args[1:4]
  reader = csv.reader(open(infile))
  count = 0
  correct = 0
  for tag, text, domain in reader:
    count += 1
    retries = 0
    while retries < 10:
      try:
        result = predict(target, auth, [text.strip('\r\n\\n'), domain])
        break
      except urllib2.HTTPError, e:
        retries += 1
    if result == tag:
      correct += 1
    else:
      print ("Incorrectly predicted %r (%s) as %s (should be %s)"
             % (text, domain, result, tag))
  print "%d of %d predicted correctly." % (correct, count)


if __name__ == '__main__':
  main(sys.argv)

And the output...

$ python training_check.py reddit_validation.csv $XAPI_AUTH reddit-dump%2freddit_training_2.csv
...
484 of 857 predicted correctly.

56% - not far off the system's own estimate. Let's take a look at some of the failures to predict, though:

Incorrectly predicted 'Small Businesses Still Worried About Reform Bill' (nytimes.com) as politics (should be reddit.com)
Incorrectly predicted 'Seriously Reddit?' (imgur.com) as pics (should be reddit.com)
Incorrectly predicted '3 YEARS of crappy skin spam' (reddit.com) as reportthespammers (should be reddit.com)
Incorrectly predicted ' President George W. Bush A.K.A. The Decider ("I\'m not much of an E-mailing kind of guy") - Joins Facebook.' (news.bbc.co.uk) as politics (should be reddit.com)
Incorrectly predicted 'Meet Leroy Stick, The Man Behind @BPGlobalPR' (gizmodo.com) as funny (should be reddit.com)
Incorrectly predicted "I don't understand why this is so funny...\n[N S F Charles Guiteau]" (imgur.com) as pics (should be reddit.com)
Incorrectly predicted 'Freaky New Dead Space 2 Screenshots' (dasreviews.com) as gaming (should be reddit.com)
Incorrectly predicted 'Once upon a time...' (i.imgur.com) as pics (should be reddit.com)
Incorrectly predicted 'The Flotilla Choir' (youtube.com) as funny (should be reddit.com)
Incorrectly predicted '15 Million Unemployed and 41,000 non census jobs created last month.' (cnbc.com) as politics (should be reddit.com)
Incorrectly predicted 'EU ruling prompts calls for overhaul of gambling laws' (telegraph.co.uk) as worldnews (should be reddit.com)
Incorrectly predicted 'I woke up this morning to find my car had been towed out of my driveway [pic]' (imgur.com) as pics (should be reddit.com)
Incorrectly predicted 'Old White Guy Late Night Talk Show' (buzzfeed.com) as funny (should be reddit.com)

Hm. There's an awful lot that look correct, but were actually posted to reddit.com. The reverse is true, too:

Incorrectly predicted 'Coulomb Details Huge Electric Car Charging Infrastructure Plans' (earthtechling.com) as reddit.com (should be technology)
Incorrectly predicted 'AT&T Announces Tethering Support For iPhone 4G / HD; Updates Data Plans' (mygadgetnews.com) as reddit.com (should be technology)
Incorrectly predicted '"And those are the facts" Same faulty arguments from all theists.' (youtube.com) as reddit.com (should be atheism)
Incorrectly predicted 'For VPS users! A great link!' (gallantfx.com) as reddit.com (should be technology)
Incorrectly predicted '\xe2\x80\x9cIsraeli commandos had paintball guns\xe2\x80\x9d \xe2\x80\x93 Israeli Ambassador' (rt.com) as reddit.com (should be worldnews)
Incorrectly predicted 'Cool, Rayguns!' (cnn.com) as reddit.com (should be technology)
Incorrectly predicted 'India vows to sabotage ACTA.' (arstechnica.com) as reddit.com (should be technology)
Incorrectly predicted 'Spy on my dog!' (ustream.tv) as reddit.com (should be funny)
Incorrectly predicted 'California poised to OK supertoxic pesticide' (sfgate.com) as reddit.com (should be environment)
Incorrectly predicted 'Voters of Reykjavik know where their towels are' (news.bbc.co.uk) as reddit.com (should be WTF)

And, of course, there's ordinary errors too:

Incorrectly predicted "WHAT'S UPPPPPPPP?" (i.imgur.com) as pics (should be funny)
Incorrectly predicted 'Google Chrome extensions the new facebook apps when it comes to privacy?' (imgur.com) as technology (should be pics)
Incorrectly predicted 'DAE actually understand the words to the songs "Smells Like Teen Spirit" or "Song 2" ?' (self.DoesAnybodyElse) as politics (should be DoesAnybodyElse)
Incorrectly predicted 'BP May Sell Prudhoe Bay Stake as Spill Costs Mount ' (bloomberg.com) as politics (should be news)
Incorrectly predicted 'Bananas are pervert ' (i.imgur.com) as pics (should be funny)
Incorrectly predicted 'New BP Logo...anyone else?' (imgur.com) as pics (should be environment)
Incorrectly predicted 'Petition for Net Neutrality - 74 Democrats sold you out to AT&T, Verizon and Comcast ' (self.technology) as politics (should be technology)
Incorrectly predicted 'Neighborhood Watch' (i.imgur.com) as pics (should be funny)
Incorrectly predicted 'U MAY BE A GEEK IF...' (streetwalkermag.ca) as funny (should be technology)
Incorrectly predicted 'Ron Paul says criticism of Obama on oil spill is wrong ' (content.usatoday.com) as news (should be politics)
Incorrectly predicted 'Watching CNN and the cap appears to be in place over the well' (self.worldnews) as gaming (should be worldnews)

Based on this, it looks to me like we can talk about three distinct types of categorization errors:

  1. reddit.com vs other subreddit posts. This occurs because the reddit.com subreddit gets an amazingly diverse set of submissions, and doesn't really have its own 'type' of posts like most of the other subreddits to.
  2. Ambiguous categorization. Many pics are also funny, and a lot of news is also politics. In some cases, you could even argue that the submitter got it wrong, and the learning API got it right!
  3. Ordinary, legitimate errors. The last one above is a good example of that.

There's not much we can do about category 3, and it's debatable that anything needs to be done about category 2, but let's see what we can do about category 1 by excluding the reddit.com subreddit from consideration, and concentrating on everything else:

$curl -X POST -H "Content-Type:application/json" -H "Authorization: GoogleLogin auth=$XAPI_AUTH" -d "{data:{}}" https://www.googleapis.com/prediction/v1/training?data=reddit-dump%2freddit_training_nodotcom.csv

...

$ curl -H "Authorization: GoogleLogin auth=$XAPI_AUTH" https://www.googleapis.com/prediction/v1/training/reddit-dump%2freddit_training_nodotcom.csv

{"data":{"data":"reddit-dump/reddit_training_nodotcom.csv","modelinfo":"estimated accuracy: 0.63"}}

$ python training_check.py reddit_validation_nodotcom.csv $XAPI_AUTH reddit-dump%2freddit_training_nodotcom.csv

...

375 of 581 predicted correctly.

The API has modestly increased its estimate to 63% accuracy; our test supports that with a more significant increase to 64% (from 56%). Of the remaining miscategorizations, most fall into category 2 - here's the last 10 miscategorized results, unedited - judge for yourself:

Incorrectly predicted "As rumours swell that the government staged 7/7, victims' relatives call for a proper inquiry" (dailymail.co.uk) as funny (should be news)
Incorrectly predicted 'Wow.' (pici.se) as WTF (should be nsfw)
Incorrectly predicted 'Gabe: What if Coruscant had a newspaper? What would their stupid political cartoons look like?' (penny-arcade.com) as funny (should be comics)
Incorrectly predicted "Abramoff released from Md. prison to halfway house.  Can we have a pool to see how long before he's back at work with the Republicans?" (google.com) as worldnews (should be politics)
Incorrectly predicted "Lady Gaga 'Alejandro' music video released - Very, very sexy. Potentially [NSFW]." (newsfeed.time.com) as funny (should be WTF)
Incorrectly predicted 'Official Decree - Take Notice.' (farm2.static.flickr.com) as science (should be funny)
Incorrectly predicted 'Philosophy of Ghost In the Shell' (video.google.com) as WTF (should be funny)
Incorrectly predicted 'Taco Bell Petitions National Reserve To Circulate More $2 Bills' (knucklesunited.com) as politics (should be WTF)
Incorrectly predicted 'Seriously, how is Japan not fat from eating this?!' (i.imgur.com) as pics (should be WTF)
Incorrectly predicted 'Cuba declines UN mission related to torture' (wireupdate.com) as politics (should be worldnews)

In conclusion, even for more subjective categorization tasks like this, and with a fairly large number of categories (20), thus increasing the number of possible ways to be wrong, the Prediction API performs fairly well, with most of the errors being ones humans would plausibly make as well. Determining the actual error rate - by going through the errors and categorizing them based on legitimacy - would be an interesting task.

If there's interest, I may put up an App Engine app that allows you to query for Reddit predictions yourself - but right now, this blog post is already 2 days late, so it'll have to wait. Live demo now, er, live! Try it out here.

by Nick Johnson at June 23, 2010 05:26 PM

Google's Go Guide

パッケージドキュメント更新のお知らせ

原文更新に伴い、パッケージドキュメントの翻訳を更新しました。

exp/datafmt
fmt
image
path
reflect
rpc
strconv
strings

by noboru at June 23, 2010 02:28 PM

Adam Langley

curve25519 in iOS4

Someone pointed out to me that I'm in the credits for iOS4 (the operating system formerly known as iPhone OS) for curve25519-donna. Unfortunately, I have no idea what they're using it for! I would be fascinated to know.

June 23, 2010 07:00 AM

Airs – Ian Lance Taylor

Death-taxis

I’ve come across a few articles recently about how modern medicine is on the road to conquer death in the next thirty years or so. I find this to be very unlikely, and I feel that people aren’t thinking about the real issues. I’ve seen two general themes. One is that the singularity will come and change everything, which is essentially unanswerable except by rolling your eyes and backing away. The other is that death is essentially a type of disease, and we will learn to cure it.

Unfortunately, death is not a disease to be cured. It’s a fundamental aspect of life. In the competition for food and other resources necessary for life, the most significant competitors of any individual organism are the other members of its own species. They are the ones who seek to occupy exactly the same niche. Complex organisms which do not die will have more size and experience than their descendants, and will therefore tend to outcompete them. It follows that species whose organisms do not die will tend to not evolve. They will over time be outcompeted by other species which do evolve. Thus death is a key evolutionary strategy for any successful species. The fact that individuals may prefer not to die is irrelevant to long term evolutionary history.

What this means is that death is a finely tuned aspect of ourselves, just as finely tuned as our rather remarkable ability to reproduce ourselves. And it’s not just an aspect of ourselves, it’s an aspect of our evolutionary forebears for eons.

It may seem superficially that humans pass through a period of childhood, then enter a phase of stasis, and then decline and die. However, in fact humans change slowly throughout their lives. Arresting the aging process would be just as complex as arresting the growth process during the teenage years. All our bodily systems are shaped by evolution to head in a particular direction. Stopping that means changing all aspects of our bodies. It would mean a person aged 20 who does not turn into a person aged 30. That means changing a hundred different aspects of how the body grows.

The fundamental argument of the people seeking to conquer death is that the body is a machine, and that we can figure out how to fix the machine so that it does not fail. However, the bodily machine was created by an evolutionary process, not by human design. Think of the ugliest least comprehensible computer program you’ve ever seen, code which is uncommented and full of cross dependencies. Think of the hacker who wrote that code–code that works but is unmaintainable. Imagine letting that hacker work on a computer program for a million years, continually micro-optimizing and never doing a comprehensive overhaul or redesign. Now you have to reverse engineer it. That’s what figuring out the human body is like. Every system in the body has deep layers of complexity and is related to other systems in strange and surprising ways. Despite all the near-miraculous advances of modern medicine, we are still only scratching the surface of understanding how the body works. Increasing computer power will help, of course, but we don’t even know the questions to ask. This is going to be a task of many generations, and even as we start to understand it will take far more work before we have any idea how to actually change anything.

Of course I could be entirely wrong, and I do think that research on aging should continue. I just don’t see any reason for optimism. A human who does not age would really be an entirely different species. What reason do we have to think that we can create such a species any time in the foreseeable future? If we could create it, what reason do we have to think that we can somehow convert ourselves?

by Ian Lance Taylor at June 23, 2010 12:33 AM

June 22, 2010

Airs – Ian Lance Taylor

Martin Beck

Apparently the great popularity of Stieg Larsson’s novels have triggered a new interest in Swedish mystery authors. I’d like to plug the Martin Beck series by Maj Sjöwall and Per Wahlöö. It’s ten books written in the 60s and 70s.

Actually, other than being Swedish, they are entirely different from Larsson’s novels. Larsson reads like an intelligent Dan Brown with real characterization. The Beck novels are police procedurals, telling the story of solving a crime from the perspective of a policeman, Martin Beck. The novels were also intended to be an examination of Swedish society, which sounds daunting but is quite effective in practice.

The Beck novels have some extremely funny scenes, scenes which are made all the funnier by the fact that nobody in the story considers the amusing at all, and indeed they would not be funny if you were involved in them in real life. For example, the police breaking into what turns out to be a completely empty room in The Terrorists (Terroristerna), resulting through a series of completely plausible mishaps in several shootings and near fatalities.

Henning Mankell, a popular current Swedish mystery writer, is clearly strongly influenced by Sjöwall and Wahlöö. Many of Mankell’s novels are quite good, but I prefer the earlier ones.

by Ian Lance Taylor at June 22, 2010 01:21 PM

June 19, 2010

Adam Langley

TLS latency improvements

We've published several drafts recently concerning reducing the number of round trips in SSL/TLS. Firstly, False Start is a client-side only change which reduces the number of round trips for a full handshake to one. It's in Android Froyo and Chrome (Linux since 5.x, Mac and Windows since 6.x). Secondly, Snap Start is a client and server change which reduces the overhead to zero. The code for it is still preliminary, but it might make it into Chrome 6.x. Snap Start requires that the client cache some information about the server but, unlike resume information, that data can be cached on disk.

This is all part of an ongoing effort to make the web faster.

Full handshakeResume handshake
(Round trip times)
Standard TLS21
False Start11
Snap Start00

June 19, 2010 07:00 AM

June 18, 2010

Nick Johnson

Using remote_api with OpenID authentication

When we recently released integrated OpenID support for App Engine, one unfortunate side-effect for apps that enable it was disruption to authenticated, programmatic access to your App Engine app. Specifically, if you've switched your app to use OpenID for authentication, remote_api - and the remote_api console - will no longer work.

The bad news is that fixing this is tough: OpenID is designed as a browser-interactive authentication mechanism, and it's not clear what the best way to do authentication for command line tools like the remote_api console is going to be. Quite likely the solution will involve our OAuth support and stored credentials - stay tuned!

The good news, though, is that there's a workaround that you can use right now, without compromising the security of your app. It's a bit of a hack, though, so brace yourself!

The essential insight behind the hack is that if we can trick the SDK into thinking that it's authenticating against the development server instead of production, it will prompt the user for an email address and password, then send that email address embedded in the 'dev_appserver_login' cookie with all future requests. We can then use the email field to instead hold a secret key that we can use to authenticate you to your app.

This requires only a simple modification to the server component of remote_api, using the techniques I previously outlined, and no modification at all to your SDK. Without further ado, here's the patch:

from google.appengine.ext.remote_api import handler
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app

import re


MY_SECRET_KEY = 'topsecret'


cookie_re = re.compile('^"([^:]+):.*"$')


class ApiCallHandler(handler.ApiCallHandler):
  def CheckIsAdmin(self):
    login_cookie = self.request.cookies.get('dev_appserver_login', '')
    match = cookie_re.search(login_cookie)
    if (match and match.group(1) == MY_SECRET_KEY
        and 'X-appcfg-api-version' in self.request.headers):
      return True
    else:
      self.redirect('/_ah/login')
      return False


application = webapp.WSGIApplication([('.*', ApiCallHandler)])


def main():
  run_wsgi_app(application)


if __name__ == '__main__':
  main()

This should be fairly easy to understand, but let's go through it piece by piece. First, we define a constant, 'MY_SECRET_KEY'. I highly recommend setting this to something other than the string 'topsecret', as we have in this demo! Then, we define a regular expression that matches the email part of the dev_appserver_login cookie.

The class ApiCallHandler extends the regular remote_api handler as we demonstrated in the previous article, and overrides its "CheckIsAdmin" method. This method is responsible for doing authentication, and usually it simply checks that the user is logged in as an administrator using the Users API. In this case, though, we instead look for the dev_appserver_login cookie, and if it exists, check if the secret key matches. If it does not, we send the user a redirect to '/_ah/login', and indicate that authentication was not successful.

It's this redirect that is essential to the hack. The way the SDK determines if it's authenticating against a development server or production is by checking the nature of the redirect sent back from unauthenticated requests; if it gets a redirect to '/_ah/login', it assumes it's the dev_appserver it's talking to, and assembles the login cookie that we check for above.

Only one other change is required to make use of this hack - you must update your mapping in app.yaml to send remote_api requests to our custom handler, rather than the standard one. Save the above file as remote_api.py, and replace the remote_api handler in app.yaml with the following:

- url: /remote_api
  script: remote_api.py

Note that we didn't include the usual "login: admin" declaration in this handler. It's important that you leave that out: without it, our handler will do the authentication itself using our hack, but if you leave it in, execution never reaches our handler - the App Engine infrastructure checks for a regular OpenID login, and refuses to execute our code if it's not present.

by Nick Johnson at June 18, 2010 11:45 AM

June 16, 2010

Nick Johnson

Google I/O playlist, day 8: Next gen queries

This is the eighth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them. Don't worry, we're nearly done!

First up, apologies for the lack of posts the last few days. I spent the weekend at Hack Camp, so Friday was spent preparing for it, and Monday and Tuesday were spent recovering and catching up!

Today's session is Alfred Fuller's session, Next Gen Queries. If you're interested in the continued evolution of the datastore API, this is a must-watch talk. It's completely language-agnostic, too. If you're looking for better background knowledge, Brett's talk from I/O '09 is well worth watching first, though.

Alfred goes into a lot of detail about some of the exciting improvements you can expect in future iterations of the Datastore API, largely due to his work on extending and improving the ZigZag merge join strategy in App Engine, as well as better support for MultiQuery. As an extra bonus, check out 31:29 for details of an improvement you can take advantage of today!

Alfred also talks extensively about doing spatial and other multi-dimensional indexing using Hilbert Curves, something I've written about before.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/ofhEyDBpngM&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/ofhEyDBpngM&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 16, 2010 02:00 PM

June 15, 2010

Renee French



cover design for my upcoming fall release book, H DAY, published by picturebox

by renee (noreply@blogger.com) at June 15, 2010 08:14 AM

June 14, 2010

Google's Go Guide

Goプログラミング言語仕様更新のお知らせ(June 7, 2010 版)

Go言語仕様の原文更新(June 7, 2010版)に伴ない、日本語訳を更新しました。

【原文】The Go Programming Language Specification

言語的にはほとんど影響のない変更です。

  • 組み込み関数copyで、配列が使えなくなり、スライスのみとなった。(たぶん、もともとドキュメントの誤り)
  • 「代入の適合性(assignment compatible)」と呼ばれていたものが、「代入可能(assignable)」と表記されるようになった。
  • 上記の「代入可能」ルールや、「比較」ルールなどが整理された。

by noboru at June 14, 2010 02:30 PM

Airs – Ian Lance Taylor

gccgo panic/recover

Back in March Go picked up a dynamic exception mechanism. Previously, when code called the panic function, it simply aborted your program. Now, it walks up the stack to the top of the currently running goroutine, running all deferred functions. More interestingly, if a deferred function calls the new recover function, the panic is interrupted, and stops walking the stack at that point. Execution then continues normally from the point where recover was called. The recover function returns the argument passed to panic, or nil if there is no panic in effect.

I just completed the implementation of this in gccgo. It turned out to be fairly complex, so I’m writing some notes here on how it works.

The language requires that panic runs the deferred functions before unwinding the stack. This means that if the deferred function calls runtime.Callers (which doesn’t work in gccgo, but never mind, it will eventually) it gets a full backtrace of where the call to panic occurred. If the language did not work that way, it would be difficult to use recover as a general error handling mechanism, because there would be no good way to dump a stack trace. Building up a stack trace through each deferred function call would be inefficient.

The language also requires that recover only return a value when it is called directly from a function run by a defer statement. Otherwise it would be difficult for a deferred function to call a function which uses panic and recover for error handling; the recover might pick up the panic for its caller, which would be confusing.

As a general gccgo principle I wanted to avoid requiring new gcc backend features. That raised some difficulty in implementing these Go language requirements. How can the recover function know whether it is being invoked directly by a function started by defer? In 6g, walking up the stack is efficient. The panic function can record its stack position, and the recover function can verify that it is at the correct distance below. In gccgo, there is no mechanism for reliably walking up the stack other than exception stack unwinding, which does not provide a helpful API. Even if it did, gccgo’s split stack code can introduce random additional stack frames which are painful to account for. And there is no good way for panic to mark the stack in gccgo.

What I did instead was have the defer statement check whether the function is it deferring might call recover (e.g., it definitely calls recover, or it is a function pointer so we don’t know). In that case, the defer statement arranges to have the deferred thunk record the return address of the deferred function at the top of the defer stack. This value is obtained via gcc’s address-of-label extension, so no new feature was required. This gives us a value which a function which calls recover can check, because a function can always reliably determine its own return address via gcc’s __builtin_return_address function.

However, if the stack is split, then __builtin_return_address will return the address of the stack splitting cleanup code rather than the real caller. To avoid that problem, a function which calls recover is split into two parts. The first part is a small thunk which is marked to not permit its stack to be split. This thunk gets its return address and checks whether it is being invoked directly from defer. It passes this as a new boolean parameter to the real function, which does permit a split stack. The real function checks the new parameter before calling recover; if it is false, it just produces a nil rather than calling recover. The real function is marked uninlinable, to ensure that it is not inlined into its only call site, which could blow out the stack.

That is sufficient to let us know whether recover should return a panic value if there is one, at the cost of having an extra thunk for every function which calls recover. Now we can look at the panic function. It walks up the defer stack, calling functions as it goes. When a function sucessfully calls recover, the panic stack is marked. This stops the calls to the deferred functions, and starts a stack unwind phase. The stack unwinding is done exactly the way that g++ handles exceptions. The g++ exception mechanism is general and cross-language, so this part was relatively easy. This means that every function that calls recover has an exception handler. The exception handlers are all the same: if this is the function in which recover returned a value, then simply return from the current function, effectively stopping the stack unwind. If this is not the function in which recover returned a value, then resume the stack unwinding, just as though the exception were rethrown in C++.

This system is somewhat baroque but it appears to be working. Everything is reasonably efficient except for a call to recover which does not return nil; that is as expensive as a C++ exception. Perhaps I will think of ways to simplify it over time.

by Ian Lance Taylor at June 14, 2010 07:35 AM

June 10, 2010

Nick Johnson

Google I/O playlist, day 7: Data pipelines with Google App Engine

This is the seventh in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's session is Brett Slatkin's talk on Data Pipelines with Google App Engine. This is a really fascinating talk, and it's a must-watch for anyone wanting to do advanced, high-throughput processing on App Engine. Brett describes in detail, with working code, a couple of high level concepts that make it possible to do 'eventually consistent' processing on App Engine, with one of the goals being Materialized Views.

The examples are in Python, but the talk will be useful for Java developers too - it's the concepts that are really important here.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/zSDC_TU7rtc&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/zSDC_TU7rtc&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 10, 2010 12:49 PM

June 09, 2010

Nick Johnson

Google I/O playlist, day 6: Batch data processing with App Engine

This is the sixth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's video is Mike Aizatsky's Batch data processing with App Engine, where he describes the recently-released mapper framework for App Engine. I blogged previously about this framework, giving a detailed breakdown of the demo mapreduce used in this very talk!

The mapper API is initially being released for Python, so it'll be mostly of interest to Python users - but Java is coming soon, so it's well worth watching even if you only speak Java.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/_7fJotosrNQ&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/_7fJotosrNQ&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 09, 2010 12:51 PM

June 08, 2010

Nick Johnson

Google I/O playlist, day 5: Data migration in App Engine

This is the fifth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's video is Data migration in App Engine, Matthew Blain's talk on new improvements to the bulkloader. I've blogged about the new bulkloader previously, but Matthew's talk goes into a lot more detail.

Matt starts talking about the new configuration format at 6:38, if you want to skip the intro.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/4XBqdu8dYE8&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/4XBqdu8dYE8&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 08, 2010 02:54 PM

June 07, 2010

Nick Johnson

Google I/O playlist, day 4: The BigQuery and Prediction APIs

This is the fourth in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's session isBigQuery and Prediction APIs. These are two awesome APIs that I described previously, and you can look forward to some forthcoming posts exploring how they work and what they can be used for.

This is another language-agnostic video - the APIs, by their nature, are pretty indifferent about what language you access them with. They both depend on Google Storage for their storage needs, so you should probably watch that talk first, though.

If you're only interested in one API or the other, the BigQuery talk starts at 6:15, and the Prediction API talk starts at 24:40. The whole talk is definitely worth watching, though.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/dbkwv1wjs3A&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/dbkwv1wjs3A&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

Have something you'd particularly like to see demonstrated using the Prediction or BigQuery APIs in a future post? Leave a comment!

by Nick Johnson at June 07, 2010 05:11 PM

Porting Go to NetBSD (Abandoned)

Well, I have an update: I’ve pulled the plug.

Well, I have an update: I’ve pulled the plug.

Rationale: I’m giving up on Go. The language (IMHO, naturally) is not a “systems programming language” while it discourages access to system calls, and has only an (undocumented and weak) facility for accessing other operating system APIs.

Similarly, having Unicode and UTF-8 is great … but if you can’t even normalise a UTF-8 string, then comparisons (and sorting) degenerate to byte comparisons. I suppose it works for ASCII ….

There are other items of friction, both linguistic and process wise, but in short: I don’t see the value of putting my time into a port, and am pulling the plug.

Game over.

by admin at June 07, 2010 08:08 AM

June 06, 2010

Go's official blog

Go Programming session video from Google I/O

Below is the video of the talk given by Rob Pike and Russ Cox at Google I/O 2010.

<object height="243" width="400"><param name="movie" value="http://www.youtube.com/v/jgVhBThJdXc&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="243" src="http://www.youtube.com/v/jgVhBThJdXc&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="400"></embed></object>

by Andrew Gerrand (noreply@blogger.com) at June 06, 2010 05:12 PM

June 05, 2010

Airs – Ian Lance Taylor

Proposition 16

California’s proposition 16, which will be voted on next Tuesday, is an interesting use of California’s bizarre ballot initiative process. The proposition says that if a local government wants to start a municipal electrical utility, it must get a 2/3 majority of votes. The proposition was initiated and almost entirely funded by PG&E, a California electric company, which has reportedly spent over $35 million on advertising in support of the proposition. I rarely watch television, but I’ve received quite a few flyers in the mail about it. Some even list a long set of candidates and positions to endorse, along with proposition 16. However, since these flyers are required to list the groups which explicitly endorsed the flyer, it’s easy to see that the flyers are being put out for proposition 16, and they are trying to slide in support for it along with other candidates I might be inclined to vote for anyhow.

PG&E is the only company from which I can buy my electricity. Electricity distribution is a classic natural monopoly; why would two different companies put up wires to everybody’s house? Since modern life requires electricity, and since I can only get it from PG&E, I am a PG&E customer. Proposition 16 appears to be designed solely to preserve PG&E’s monopoly position, by making it much harder for communities to create their own municipal power companies. It would be hard enough to get a majority vote in favor of government run power; I think we can assume that a 2/3 majority would be effectively impossible. It’s pretty darn annoying that PG&E is spending $35 million, including money they collect from me, on this. Is that a good use of my money?

Municipal power is not impossible. For example, Palo Alto, California, uses a municipal power utility. It was notable during the rolling blackouts that affected most of California in the early 2000s that Palo Alto was immune. So it’s not as though PG&E deserves to be protected because they are doing a particularly good job. When a real crunch time came, they did a very poor job indeed.

It’s difficult for me to imagine why anybody would vote in favor of such a blatant power grab by a private company. But then, of course, there’s the $35 million. The opponents of proposition 16 have reportedly raised less than $100,000. I assume that if PG&E succeeds we will see more and more cases where private companies spend lots of money on ballot initiatives in their favor. I hope that it fails, and if I have any readers in California I encourage you to vote against proposition 16 this Tuesday.

by Ian Lance Taylor at June 05, 2010 06:21 PM

June 04, 2010

Nick Johnson

Google I/O playlist, day 3: What's hot in Java for App Engine

This is the third in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's video is "What's hot in Java for App Engine" by Don Schwarz and Toby Reyelts. It provides an overview of the first year of Java support on App Engine, focusing on a demo app that shows off a number of features of the Java runtime.

At first glance, this is definitely a talk for the Java programmers amongst us, and it certainly has a lot of content on those lines; the demo shows off to good advantage a number of the App Engine APIs. The secret hidden surprise, though, is something that will be of interest to nearly everyone: Details about the forthcoming channel API. The channel API implements the promised support for Comet on App Engine. For the juicy details, jump to 10:49, and keep watching up to 15:10. There's more details in Moishe's talk on building Real-time Webapps in App Engine, the video for which will be going up later today, and which we'll cover in a future post.

The other part that will be of interest to everyone starts at 42:52, and details more upcoming improvements to App Engine. Some of these are Java-specific, such as reduced JIT and GC time, but others are likely to apply to everyone, such as improved API call and I/O latency.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/vm-RdH_eWK4&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/vm-RdH_eWK4&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 04, 2010 01:08 PM

June 03, 2010

Nick Johnson

Google I/O playlist, day 2: Google Storage for Developers

This is the second in a series of posts providing a day-by-day playlist to help break up the Google I/O session videos - specifically the App Engine ones - into manageable chunks for those that haven't seen them.

Today's video is on Google Storage for Developers. Google Storage for Developers is a new API for storing, serving, and sharing large numbers of files, of nearly any size. While Google Storage isn't directly linked to App Engine, it's likely to be of interest to many App Engine developers. It's also the foundation for the BigQuery and Prediction APIs, which I discussed earlier.

The talk is language-agnostic, so both Python and Java developers will find it of interest. Currently, the only officially supported client library is the Python one, but the protocol is deliberately compatible with other similar services, so the Boto library works as-is, as will other libraries. Further, the API is simple enough that I expect Google-Storage-specific libraries will quickly spring up in most major languages.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/GLbRvcbkAwE&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/GLbRvcbkAwE&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

by Nick Johnson at June 03, 2010 01:59 PM

Airs – Ian Lance Taylor

Shrek Distances

I’m sure I’m not the only person troubled by the loose geography in the Shrek movies. In the first movie Shrek takes a couple of days to get to the dragon’s castle. In the second movie Shrek and Fiona appear to take a few days to get from Shrek’s swamp to Far Far Away by coach. However, in the fourth movie, Shrek walks from his swamp to the dragon’s castle and then to Far Far Away all in a single night. There is no explanation. How is this possible?

Even allegorical fairy tales are weakened by a lack of internal story consistency.

by Ian Lance Taylor at June 03, 2010 12:23 AM

June 02, 2010

Nick Johnson

Google I/O playlist, day 1: Appstats

With the full videos of the App Engine sessions at I/O now released, those who missed I/O, or, like me, didn't have time to catch all the sessions can now catch up on what they missed out on. The list is quite intimidating, though - there's a lot of content there, and most of us have a limited amount of time to spend watching each day.

With that in mind, over the next couple of weeks, I'm going to be posting session videos, one or two each day, in an order that's hopefully useful and makes sense to people. I'll also explicitly note who'll find each video of interest, and what other videos, if any, you should watch first.

First in the lineup is Guido's Appstats talk. This talk will be of interest to both Python and App Engine developers, at all levels of skill. There's some Python-specific code in the talk, but it mostly applies to both languages.

<object height="385" width="640"><param name="movie" value="http://www.youtube.com/v/bvp7CuBWVgA&amp;hl=en_US&amp;fs=1&amp;"><param name="allowFullScreen" value="true"><param name="allowscriptaccess" value="always"><embed allowfullscreen="true" allowscriptaccess="always" height="385" src="http://www.youtube.com/v/bvp7CuBWVgA&amp;hl=en_US&amp;fs=1&amp;" type="application/x-shockwave-flash" width="640"></embed></object>

What's that, you say? You already know about appstats? Even if you do, there's almost certainly something you didn't know in Guido's talk. If you're already familiar with the basics, you might want to skip to 29:41, where Guido talks about some of the more advanced configuration options, followed by a discussion of Jens' excellent Patterns of Doom. Of course, it doesn't hurt that he gives me a shout-out at 44:07. ;)

by Nick Johnson at June 02, 2010 12:06 PM

June 01, 2010

Nick Johnson

The wonders of HDMI

Recently, we embarked upon an update of our 'media center' / 'home theater' setup. Our projector had reached the end of its 2000 hour bulb life, and since it was a cheap projector to start, a replacement bulb would've cost nearly as much as the projector itself did a couple of years ago. Also, it's become extremely unreliable - 9 times out of 10, pressing the power button results in a brief flash of light from the bulb, followed by nothing. Plus, we promised ourselves that when the bulb ran out, we'd buy an HD projector, as they ought to be affordable by then.

Well, that time has come, and we've done the upgrade. Since we were upgrading to HD, it seemed necessary to get an AV receiver that supports HDMI to replace our 2-channel amplifier and let us switch audio and video at the same time. And if we have that, we may as well get a blu-ray player, too - after all, they're pretty cheap now.

Here's our new setup:

Going by the letters, we have:

  1. Our HTPC media box (Existing)
  2. Netgear XAVB1004 powerline networking switch (Existing)
  3. Philips BD3000/05 Blu-ray player (New)
  4. Yamaha RX-V367 A/V Receiver (New)
  5. XBox 360 (Existing)
  6. Benq W1000 1080p HD Projector (New)

The point of all this, though, is not just to show off our new gear; it's because I wanted to write about how impressed I am by HDMI. Not the Copy-Prevention BS it supports, but rather the mere fact of combining, for the first time since SCART, both audio and video in a single cable, and doing it right.

Instead of anywhere from 1 to 4 wires for video, and another 2 to 6 wires for audio, a single connector does it all, digitally, and in a manner that is forward compatible enough to continue to work for the forseeable future. Wiring up all this new gear was a pleasure, and for the first time ever, we have a setup that isn't an embarrassing mess of spaghetti wiring. Here's the back of the unit:

Granted, it's not quite as tidy as some, but I'm pretty happy with it. The four HDMI cables you can see emerging from the AV switch (D) consist of all the connections I need - one to each of the AV devices we use, and one output one to the projector. Ethernet connections from the switch to those same three devices occupy most of the rest of the clutter, as well as power cables, which are all neatly trunked out-of-frame on the left. Our dining table, pictured in the background, is not nearly as neat!

Incidentally, if you're planning on setting up audio out over HDMI in linux, this page is very much your friend. Specifically, one single sentence under "Side Notes":

On some systems with e.g. an HDA Intel sound card, HDMI sound transmission only works if the video is also transmitted

This one had me troubleshooting, debugging, and scratching my head for a good couple of hours. All the obvious things - drivers, kernels, unmuting the audio output - are mentioned all over the web, but only there did anyone think to mention that, oh, by the way, this AV connector only does A if it's also doing V. Doh.

by Nick Johnson at June 01, 2010 05:00 PM

Google's Go Guide

Goプログラミング言語仕様更新のお知らせ(May 24, 2010版)

Go言語仕様の原文更新(May 24, 2010版)に伴ない、日本語訳を更新しました。

【原文】The Go Programming Language Specification

今回は、特筆すべき変更点はありません。

by noboru at June 01, 2010 02:23 PM

Airs – Ian Lance Taylor

GCC in C++

I’m very pleased to see that the GCC steering committee has agreed to permit GCC to be written in C++. At one time RMS, who is a member of the steering committee, had felt that C++ was never appropriate for systems programs like GCC. It’s good to see that he has apparently come around.

There has been a long effort to prepare for this, by moving GCC’s code base from C to the common subset of C and C++. While people naturally think of C++ as an extension to C, they are different languages and there is a lot of C code which is not valid C++. In the GCC code base, one of the biggest issues was that enums are more restricted by the type system in C++ than they are in C. Another was that in C++ you may not use the same name as a typedef and a struct tag, except for the special case of making the struct tag be a typedef for the struct itself.

Gabriel Dos Reis did the first substantial work on moving the GCC code base to the common subset, and many other people contributed. I think it’s fair to say that I did the lion’s share of the work, starting with my surprise presentation on the advantages of C++ at the 2008 GCC Summit. I did a lot of work to improve the -Wc++-compat warning option to warn about C code which was not in the C/C++ common subset, and I did a lot of work to make GCC code compile with that option without warnings.

C++ will not magically make the GCC code base better. However, I believe that it will give us some useful tools to incrementally improve the code base over time, making it easier to read, easier to modify, and more efficient. I say this not based on theory, but on my experiences with gold and with the gccgo frontend. I’ve already started writing some draft C++ coding conventions which I hope we can use to guide our efforts.

by Ian Lance Taylor at June 01, 2010 01:29 PM

May 31, 2010

Airs – Ian Lance Taylor

GCC Project

GCC as a free software project is clearly very successful. Over more than 20 years it’s grown from nothing to become the standard compiler for several operating systems and many microprocessors. So far in 2010 the core part of the compiler alone has seen over 1000 commits by over 100 contributors. GCC continues to get significant new features; e.g., the recent GCC 4.5 release includes a new link time optimization facility.

On the other hand, the GCC project has some problems. The major individual contributors to GCC are hired to work on it. That means that they have a lot of time and resources to use to improve the compiler, which is good. However, it also has some negative effects. It’s difficult for new volunteers to join the community. It’s hard for them to learn the code base and it’s hard for them to keep up with the pace of change. It’s also hard for them to learn the conventions of how the project works, and they get little help in getting their patches in. Also, the people who work on GCC have learned the intricacies of the code base over time. They do not rely on the internal documentation. The effect is that the internal documentation for some parts of the code base is quite weak, and none of the main contributors are motivated to fix it.

Another, separate, problem is that there is no single person or group with a clear ability and willingness to decide on the direction of the project. In the past the direction has been set at different times by people like Richard Stallman, Richard Kenner, Jeff Law, and Richard Henderson. None of them are playing that role today. The effect is that nobody can see whether significant new features should or should not go into the project, which leads to a tendency for inconclusive discussions and unreviewed patches. People hoping to contribute are left with no clear path forward. (I should mention that groups like the GCC Steering Committee and the GCC Release Managers are effective but do not take on this role, which is more that of an architect.)

A third problem is that GCC has no public relations activity. The project web page tells you what GCC is but says nothing about how it compares to other compilers or how it has improved over time. There are some common criticisms of GCC, such as the belief that it is measurably worse than proprietary compilers, or that it is stagnating, which the project makes no attempt to discuss or dispute.

None of these issues are critical. As I said, GCC is highly successful. But they are areas where I think GCC could improve. Unfortunately, pointing out these issues is insufficient; it’s necessary for peole to step up to take on these roles. The various companies which pay people to work on GCC are generally less interested in these aspects of the project, which makes it that much harder to find people to work on them.

by Ian Lance Taylor at May 31, 2010 04:45 AM

May 28, 2010

Nick Johnson

Exploring the new mapper API

One of the new features announced at this year's Google I/O is the new mapper library. This library makes it easy to perform bulk operations on your data, such as updating it, deleting it, or transforming/filtering/processing it in some fashion, using the 'map' (and soon, 'reduce') pattern. I'm happy to say that I'm deprecating my own bulkupdate library in favor of it.

The mapper API isn't just limited to mapping over datastore entities, either. You can map over lines in a text file in the blobstore, or over the contents of a zip file in the blobstore. It's even possible to write your own data sources - something we'll cover in a later post. Today, though, I'd like to dissect the demo that was presented at I/O. The demo uses a number of the mapper framework's more sophisticated features, so it's a good one to use to get an idea for how the framework works.

For basic usage, the Getting Started page in the mapper docs is the place to go. If you're interested in seeing something more complex in practice, read on...

The demo at I/O consisted of a "code search" system, whereby you could upload a zipfile containing source code (or other text files) to the blobstore, then find every line matching a regular expression inside that file. Without further ado, let's see the mapper function that demo used:

def codesearch((file_info, reader)):
  """Searches files in a zip for matches against a regular expression."""
  params = context.get().mapreduce_spec.mapper.params
  regex = re.compile(params["regex"])
  parent_key = db.Key(params["parent_key"])
  
  if file_info.file_size == 0:
    return

  results = model.CodesearchResults(
      parent=parent_key, filename=file_info.filename)
  file_data = reader()
  for line_no, line in enumerate(file_data.split('\n')):
    if regex.search(line):
      results.match_lines.append(line_no)
      results.matches.append(line)
  
  if results.matches:
    yield op.db.Put(results)

This is our mapper function. It gets called once for each entry in the zip file that's being processed, and is passed a single argument. When iterating over a zipfile, that argument is a tuple, consisting of a zipfile.ZipInfo object providing information about the entry, and a zero-argument function (here, called 'reader') that when called will return the entire contents of the stored file. The reason it doesn't simply pass in the contents of the entry directly is for efficiency: If you're writing a mapper that only needs to process some files in a zip, there's no point wasting time reading them all out and passing them to the mapper, just to be discarded.

The first thing our function does is get the user-defined parameters. This is a dict of values provided by the user (and as we'll see later, optionally by our code) when they started the mapreduce process. We then extract 'regex', the regular expression to search on, and 'parent_key'. parent_key provides the key of the entity under which we should create all our result entities*.

Next up, we check the length of the file. If it's empty, we skip it. The main reason for this check is that directories are also zip entries, and this is the easiest way to skip over directories, since we don't care about directories or empty files.

Now we do the real work. First, we create a results entity under the parent key we retrieved earlier. Then, we call the passed-in reader function to retrieve the contents of the file, and we iterate over each line of it. Inside the for loop, we apply the regular expression to each line, and if it matches, we update the result entity with the line the match occurred in, and the contents of that line.

You'll note that we're compiling the regular expression afresh with each call to the mapper function. This seems inefficient, but here we're relying on a feature of the Python regular expression library: The regular expression cache. It transpires that Python caches regular expressions, so that if you call compile on the same regex twice, it will use the cached copy the second time. Since the mapper function will be called over and over again, we're able to use this to good effect.

Finally, we check if any lines matched in the file. If they did, we yield a 'Put' operation, instructing the mapper framework to write the new result entity to the datastore. If there were no matches in this file, we don't make the call, so the entity never gets written.

That's all there is to the mapper function, but there's still a little more we need to do. You'll note that we accepted as one of our parameters a 'parent_key'. It's not particularly friendly to require users to create the parent entity and supply its key, though - it'd be far more helpful if we could allow them to supply the ID of a file in the datastore, and figure out the parent key ourselves. Well, it turns out that's possible, by using a 'validator' function.

Validator functions get called once before the mapreduce starts, and are given the set of user-supplied parameters for the mapreduce, and allowed the opportunity to validate them (what a surprise, eh?) and modify them if they wish. Here's ours:

def codesearch_validator(user_params):
  """Validates and extends parameters for a codesearch MR."""
  file = model.ZipFile.get_by_id(int(user_params["file_id"]))
  user_params["blob_key"] = str(file.blob.key())

  parent_model = model.CodesearchJob(
    name=user_params["job_name"],
    file=file,
    regex=user_params["regex"])
  parent_model.put()
  user_params["parent_key"] = str(parent_model.key())

  return True

In this case, we expect the user to supply a 'file_id' parameter, which should be the id of an entity in our datastore with information about an uploaded file. From that, we extract the key of the blob to process, which allows our user to avoid having to enter that information by hand, as well. Then, we construct a new CodesearchJob entity, which will act as the parent entity we referred to above, and add its key as the 'parent_key' parameter. Finally, we return True to indicate that validation succeeded.

The one remaining components is the definition of the mapreduce. Here it is, from mapreduce.yaml:

mapreduce:
- name: Codesearch
  mapper:
    input_reader: mapreduce.input_readers.BlobstoreZipInputReader
    handler: mr.codesearch
    params_validator: mr.codesearch_validator
    params:
    - name: file_id
    - name: regex
    - name: job_name
    - name: processing_rate
      default: 10000
    - name: shard_count
      default: 20

None of this should be unexpected at this point. We give our mapreduce a name, and specify BlobstoreZipInputReader as the input reader class to use. For the handler, we provide the fully qualified name of the handler function we defined above, and likewise for the params_validator, we supply the fully qualified name of our validator function. Finally, we define a set of parameters that may be provided by the user when they create the mapreduce: file_id, regex, job_name, processing_rate, and shard_count.

The eagle-eyed amongst you may have noticed that the last two parameters appear nowhere in our code. That's because they're used by the mapper framework to decide how many shards to start up, and how fast to process entities. There are several parameters of this type -we already saw another one in the validator function, 'blob_key', which is used by the input reader.

That's all for today. In a future post, we'll take a look inside the BlobstoreZipInputReader, and discuss how to write your own input reader class for the mapper framework.

by Nick Johnson at May 28, 2010 05:22 PM

Go's official blog

Go at I/O: Frequently Asked Questions

Among the high-profile product launches at Google I/O last week, our small team gave presentations to packed rooms and met many present and future Go programmers. It was especially gratifying to meet with so many people who, after learning a bit about Go, were excited by the potential benefits (both immediate and long-term) they could gain from using it.

We were asked a lot of good questions during I/O, and in this post I'd like to recap and expand upon some of them.

How suitable is Go for production systems?
Go is ready and stable now. We are pleased to report that Google is using Go for some production systems, and they are performing well. Of course there is still room for improvement - that's why we're continuing to work on the language, libraries, tools, and runtime.

Do you have plans to implement generics?
Many proposals for generics-like features have been mooted both publicly and internally, but as yet we haven't found a proposal that is consistent with the rest of the language. We think that one of Go's key strengths is its simplicity, so we are wary of introducing new features that might make the language more difficult to understand. Additionally, the more Go code we write (and thus the better we learn how to write Go code ourselves), the less we feel the need for such a language feature.

Do you have any plans to support GPU programming?
We don't have any immediate plans to do this, but as Go is architecture-agnostic it's quite possible. The ability to launch a goroutine that runs on a different processor architecture, and to use channels to communicate between goroutines running on separate architectures, seem like good ideas.

Are there plans to support Go under App Engine?
Both the Go and App Engine teams would like to see this happen. As always, it is a question of resources and priorities as to if and when it will become a reality.

Are there plans to support Go under Android?
Both Go compilers support ARM code generation, so it is possible. While we think Go would be a great language for writing mobile applications, Android support is not something that's being actively worked on.

What can I use Go for?
Go was designed with systems programming in mind. Servers, clients, databases, caches, balancers, distributors - these are applications Go is obviously useful for, and this is how we have begun to use it within Google. However, since Go's open-source release, the community has found a diverse range of applications for the language. From web apps to games to graphics tools, Go promises to shine as a general-purpose programming language. The potential is only limited by library support, which is improving at a tremendous rate. Additionally, educators have expressed interest in using Go to teach programming, citing its succinct syntax and consistency as well-suited to the task.

Thanks to everyone who attended our presentations, or came to talk with us at Office Hours. We hope to see you again at future events.

The video of Rob and Russ' talk will be uploaded to YouTube within the next week, and will then be posted on this blog.

by Andrew Gerrand (noreply@blogger.com) at May 28, 2010 05:45 AM

May 27, 2010

Phil Bayfield » Go

GoMySQL 0.2

Today the latest version of GoMySQL is available from the Github repository.

This is a beta version and can be obtained by checking out the master branch.

This version includes bug fixes and standardisation of function format, it will also include some new features which will be added over the next few days.

Share/Bookmark

by Phil at May 27, 2010 05:48 PM

Airs – Ian Lance Taylor

Iron Man 2

A few thoughts on Iron Man 2.

I liked it.

How odd to see a decent romantic comedy mixed into a superhero movie. Most recent romantic comedies have been terrible. Forgetting Sarah Marshall wasn’t too bad, but the last one I can remember as being solidly good was Fifty First Dates.

The movies was much more like a comic book than most, with a few scenes of subplots tossed in every so often. In comic books it works because you get more of the story every month. Can they really make that pay off in other movies which are at best a year later? Or is it mainly aimed at people who read the adaptations?

Don Cheadle and Scarlet Johansson did good jobs with minor characters, which shows the importance of getting good actors. Samuel L. Jackson was amusing as always. Mickey Rourke was excellent.

The final scene, after the credits, sets up for Thor, which according to IMDB is going to be a movie next year. Thor is everybody’s favorite Norse god, but he’s a much weaker character than Iron Man. He has no character weaknesses, except for a tendency toward bravado which becomes rapidly uninteresting. His difficulties are all structural: take away his hammer and he turns back to human. All the best Thor comic book stories are very long, very cosmic, and concentrate mainly on the characters around him. The very best one, the multi-year epic by Walter Simonson, starts off by finding a character who is an even better Thor than Thor himself, and has a whole issue in which Thor does not appear at all. None of this suggests a good movie to me. IMDB does list Kenneth Branagh as director; he’s made some great movies (my favorite is Much Ado About Nothing) and some very weak ones (Frankenstein).

by Ian Lance Taylor at May 27, 2010 04:29 AM

May 26, 2010

Airs – Ian Lance Taylor

Libertarian Civil Rights

The recent clamor over Rand Paul’s comments on the Civil Rights Act were a useful indicator of one of the problems with the libertarian approach to society. Paul was clear, in retrospect, that he supported the Civil Rights Act, but he was also clear that he was concerned about its effect on business owners.

Any society is a balancing of rights among all its members. All societies agree that people have different rights in different roles. The complex cases for societies are deciding what to do when those rights conflict. The libertarian point of view tends to emphasize one particular right over all others: the right to private property. But that is not the only right in our society, and cases like segregated lunch counters give that a nice clarity. If a business is open to the public, then if we are a member of the public, we have the right to expect it to be open for us. There are a number of ways which society permits business to discriminate; most obviously, businesses may discriminate against people without money. But society does not permit businesses to discriminate against people on the basis of skin color. This is not a grey area.

If you focus only on the right to private property, then the ability of businesses to discriminate against customers is a troubling case. That is how Paul got into trouble and had a hard time giving a clear answer to a relatively simple question. If you consider this issue as a balancing of rights, then there is no difficulty.

There are certainly hard cases in rights balancing; this just isn’t one of them. A hard case is how much accommodation a small business must provide a disabled customer. E.g., we all agree that the business must serve someone in a wheelchair, but is a business required to make it possible for that person to get to all parts of the store?

If Paul wants to get elected and be an effective senator, he must not only learn to answer simple questions in a straightforward way. He must also learn that the role of the politicians is to balance rights, not to promote one specific right over all others.

by Ian Lance Taylor at May 26, 2010 01:25 PM