On optimizing keyboard layout for programming

2013-02-20

Many programmers world-wide prefer the US keyboard layout, mostly because it works best in most of the programmers' text editors, and many programming languages were designed with availability of certain keys in mind (some non-US layouts may lack important symbols, for example, there is no ~ in the Italian layout).

However, is the US layout really optimal for programming? In particular, is it reasonable to use plain digit keys to enter digits, and enter punctuation with a Shift?

I decided to count the use of the lower row symbols (digits) vs upper row symbols in several popular projects.

Using this (zsh) function to count character use:

function count-keys {
    infile=$1 ;
    printf "lower keys (digits):      % 7d\n" \
        $(tr -d -c '`1234567890-=' < $infile | wc -c) ;
    printf "upper keys (punctuation): % 7d\n" \
        $(tr -d -c '~!@#$%^&*()_+' < $infile | wc -c) ;
}

Clojure (git)

In core.clj:

% count-keys clojure/src/clj/clojure/core.clj
count-keys clojure/src/clj/clojure/core.clj
lower keys (digits):         4292
upper keys (punctuation):   13978

Entire project:

% count-keys =(cat clojure/**/*.clj)
count-keys =(cat clojure/**/*.clj)
lower keys (digits):        41548
upper keys (punctuation):   71241

Upper row is used almost twice as often as the bottom row. The ratio is higher on common high-level code.

Haskell (Xmonad 0.11)

The same prevalence of punctuation over digits we can observe in Haskell:

% count-keys =(cat xmonad-0.11/**/*.hs)
count-keys =(cat xmonad-0.11/**/*.hs)
lower keys (digits):         3260
upper keys (punctuation):    5214

Python (NumPy git)

Supposedly, NumPy is very digit-intensive project:

% count-keys =(cat numpy/**/*.py)
count-keys =(cat numpy/**/*.py)
lower keys (digits):       255711
upper keys (punctuation):  280361

In practice, even with heavy use of = and - (lower row), the upper row (punctuation) is still much more common.

C (Linux 3.8 and UMFPACK 5.6.1)

C seems to be 50-50, with the upper row is being used a little bit more often:

% count-keys =(cat linux-3.8/**/*.[hc])
count-keys =(cat linux-3.8/**/*.[hc])
lower keys (digits):       30150650
upper keys (punctuation):  38324161

% count-keys =(cat UMFPACK/**/*.[hc])
count-keys =(cat UMFPACK/**/*.[hc])
lower keys (digits):       120075
upper keys (punctuation):  111357

C++ (boost trunk)

Interestingly, C++ breaks the pattern:

% count-keys =(cat boost-trunk/**/*.[hc]pp)
count-keys =(cat boost-trunk/**/*.[hc]pp)
lower keys (digits):       13433937
upper keys (punctuation):  7575912

The lower row is much popular in Boost overall, although in many individual projects the punctuation prevails:

% count-keys =(cat boost-trunk/libs/algorithm/**/*.[hc]pp)
count-keys =(cat boost-trunk/libs/algorithm/**/*.[hc]pp)
lower keys (digits):        14168
upper keys (punctuation):   17382

% count-keys =(cat boost-trunk/libs/regex/**/*.[hc]pp)
count-keys =(cat boost-trunk/libs/regex/**/*.[hc]pp)
lower keys (digits):        49490
upper keys (punctuation):   69126

Of all the boost libraries, there are only a few which use digits really a lot: geometry, math, multiprecision, wave. It's easy to understand why, given their names. Still, all of them use underscore (from the upper row) more often than minus (from the lower row).

You may wish to know which symbols are used most yourself. This is the function I used to count frequencies:

function most-used-keys {
    infile=$1 ;
    cat $1 | tr -d -c '`1234567890-=~!@#$%^&*()_+' | \
        awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' | \
        sort -n -k 2 -r
}

Proposal

Remap digit row of the US layout to input punctuation without Shift and digits when shifted. Consider if leaving = intact, as many programming languages use it a lot.