Desirable language features

Index

Syntax

Prefix arithmetics

Syntax and semantics are not entirely independent.  At the minimum, there should not be multiple syntaxes to describe the same concept.  But currently, in many languages we have not two but three different ways to describe function application:

It is a sad histroical accident that arithmetics is usually infix.  Getting rid of operator precedence is such a huge gain that using pre- or postfix is worth just for this.  Further, the other two are much less ambiguous and more machine-friendly: prefix arithmetics is trivially equal to the operation tree, and postfix operations can be implemented with a stack machine very easily (the reason old HP calculators famously used it).

Since a programming language requires functions to accept a variable number of arguments, prefix notation with explicit parentheses is probably the best choice.  Which is why it is already so widespread: func(arg1, arg2).  But if arithmethics were expressed this way, other gains would naturally follow.

Accidental elegance

If arithmetic operations were moved to the front, the boundaries of tokens would become easier to determine.  Notably, the former operation marks could become legal parts of variable names.  In particular, the hypen could be used in long-variable-name much the same way it is used in natural texts, and programmers would not have to put up with long_variable_name or camelcased LongVariableName.

Also, we would no longer have to fret about overloading the operators to do addition "properly".  With + being just another ordinary character, the + function can be overloaded just like any other, user-defined or not.

What about struct field access, you ask?  After all, the obj.func(arg2) syntax grew out of struct.field.  Well, we could certainly follow this thread backwards, and it leads us out of the labirynth nicely: field(struct).  What is especially appealing is that accessing the field directly, or through accessor functions, is no longer syntactically different.  obj.getfoo(), obj.setfoo(value), and obj.foo all melt away into foo(obj) which returns a reference so we can read or set it.

Stamping out more reserved syntax

So far so good, but we still have things like array[index].  Which was acceptable when it was only used for this purpose, and would not be conflated with anything else.  Enter C++ and friends, bringing along operator[](int) and equivalents.  Making this operation overloadable for user-defined types is a fine idea, but please, don't introduce yet another incompatible syntax for function call in the process.  What about aref(array, index) instead?

Nesting block comments

Programmers often deal with pairs of matching delimiters, and (quite naturally) come to expect them to behave the same way.  Whether single-character, like ()[]{} or compound (e.g. HTML tags), most of them nest in most of their uses.  This is so common because it is very computer-friendly.  Both unambiguity and being trivially transformable into a tree (or into an error) are generally recognised as beneficial.

Delimiters

In many languages (especially those with infix arithmetics) tokens can begin or and at a great variety of characters.  Operators (e.g. +-*/%), commas, semicola, various parentheses (()[]{}<>), and of course whitespace   .  However, in many cases whitespace is mostly irrelevant.

Recently, some languages (notably Python) embraced an entirely opposing philosophy about whitespace.  Indentation defining blocks can sound great at first, and it is a valuable teaching tool, but gets in the way of "serious" programming.  In languages with explicit tokens defining blocks, refactoring is a lot easier.  Any block can be taken from anywhere, and pasted somewhere else; if the indentation is off, IDEs and smart editors can properly align them automatically.  No such luck in Python: the programmer has to adjust the indentation by hand, and woe betide him if he forgot where the block ends.

Thus I support block structure defined by brackets.  However, other than the designated parenthesis, we should not have multiple different characters to delimit identifiers.  (This is another reason to get rid of infix syntax.)  Comma, semicolon; whitespace  - pick one and stick with it.

Syntax sugar

Too much syntax turns a language into a candygrammar.  It can also rot your teeth (if you don't use enough FLOSS) and give your code syntax diabetes. 

Back to the top

Semantics

Let us leave behind the eyecandy and delve deeper.  Perhaps I sould mention some features that are already widely available.

Dynamic typing

What if we took variables not to mean a certain memory address, but held everything through references (pointers)?  All variables, regardless of their type, would look pretty much the same.  But then, why should their type be determined at compile-time, if their values are what have types?  Indeed why should variables have a type at all?

Garbage collection

Manually keeping track of all resources is both time-consuming and hard.  In a largely unrelated article, Joel Spolsky describes well some of the things garbage collection allows:

He also writes that (in his experience) using a language with garbage collection is roughly three or four times faster than one without.  Of course, GC is not the only feature to make programming easier, more features mean further gains.  Some sources state that differences over an order of magnitute are possible.

Of course, there is no silver bullet.  Each kind of garbage collection has some downside that can sometimes jump on you.

Cutting-edge stuff

First-class functions

Somewhat surprisingly, C's function pointers are halfway to what we want.  The reason you might not have heard much about them is because by being only halfway there, they are very unsafe and don't do several things we want.  But you might have come across them without knowing: how does sort get your comparison function?

First-class functions means that you can treat a function in pretty much the same way you can treat any other type of thing.  That is, you can assign a function as a variable's value, give a function as a parameter in a function call, and return a function from a function.

Having first-class functions naturally leads to using higher-order functions.  These mean functions that take as arguments or return other functions.  You might not see immediately, but this eliminates several patterns:

Lexical closures

Once you can create lambda functions and return them, you run into a problem.  The function you are returning may reference variables from the enclosing function.  An example (in Python) should help:
def counter_factory (initial): foo = [initial] def counter (i): foo[0] += i return foo[0] return counter This is not pie in the sky, this is completely legal Python.  And the same thing, with a more elegant approach, in javascript:
function counter_factory(foo) { return function (i) { return foo += i } } As you can see, the returned inner function references foo, but no one else can see it (because it is not in scope).  This is actually a great thing, because now functions can have state between successive calls, if necessary.  However, this state is neither visible from the outside, nor shared between instances of the function - which are the main problems plagueing globals.

If you want to look at everything through object-colored glasses, then you can look at closures as taking objects, and turning them onto their feet from their heads (on which they have been standing so far).  Because closures attach state to functions, as opposed to objects glueing behaviour to state (structs).  This also means that in languages without first-order functions and closures, these features are simulated with objects - and this working around shortcomings of the language are proudly called patterns (see previous part).

Expressions

The distinction between expressions and statements is an unfortunate historical accident.  It appeared in the earliest programming languages (notably Fortran) because those were little more than "assembly with math".  Expressions were necessary for math, but there was little point in anything else (such as control flow constructs, i.e. goto) returning a value, since nothing could be waiting for a value.  In block-structured languages (starting with Algol 60) this limitation disappeared, yet the distinction remains.  It became so entrenched that in Python there is a separate exec and eval.

The by now mostly gone distinction between functions and procedures is a good analogy for our comparison.  (Especially since user-defined functions are syntactically indistinguishable from built-in ones.)

To give a few examples (in curly-brace syntax) of how this feature would simplify code, I will start with the low-hanging fruit:
T function(...) { /* something */ expression(); } Which is equivalent to:
T function(...) { T retval; /* retval = something */ retval = expression(); return retval; }

Which is a common pattern to see.  (Why patterns are evil is beyond the scope of this essay. You can read more here.)

To do something harder, assign an "arbitrary" block's value to a variable.  What about a loop's?
var = while(condition) { /* something */ expression(); } This is equivalent to:
{ T tempval; while(condition) { /* tempval = something */ tempval = expression(); } var = tempval; } /* here tempval leaves scope */

The above example also gives the answer to what happens when a loop is the last element of a function.  Just substitute return tempval; for the last line.  (Or retval = tempval; return retval; if you are overly mechanic.)  Clearly, this is a doable, but very tedious way to cure this shortcoming of programming languages.  And since tedium is not a problem for compilers, why should languages not have this feature?

3-part exception handling

Currently most languages have a two-part exception-handling system: throw and catch.  I hope I don't have to introduce them.  However, I would like to split catch in half, maybe calling them handle and restart for now.

The reason I want to split catch is because with it in one piece, properly responding to unforeseen events (the whole reason exception handling exists) comes down to a choice between a few bad solutions.  This is because "what to do in this situation?" is a matter of policy.

Ideally, the exception should go to a high level, where the decision on how to proceed should be made.  And once that happens, execution should resume (after some cleanup) at a lower level, without any state having been lost during the decisionmaking upstairs.

And this is exactly what handle and restart allow.  handle gets the exception, and can proceed to do anything with it.  It can even ask the user!  If it declines to decide, the exception resumes going up the stack, looking for a new handle clause.

The stack is only unwound after a choice has been made.  After automatic cleanup (destructors of objects leaving scope, etc.) takes place, execution resumes at the chosen resume block, after which it keeps running in the same way today's catch clauses work.

Multiple return values

This is going to be heavy on low-level stuff.

The same mechanisms that allow functions to take multiple arguments also allow them to return multiple values.  Whether they are pushed onto the stack, into register windows (on RISC architectures), left in registers, somewhere out on the heap, or a combination of these, there is no reason why a function could not work the same way with parameters and results.  This is more than returning C structs, or Python tuples.

Of course, in a stack-based approach this is a perfect way to misalign the stack, if the called function pushes a different amount of return values onto it than the caller expects.  Thankfully, many calling conventions have only one of them move the stack frame in both directions, thus at least control is given back to the point it jumped away from.  And of course, this problem does not exist in other methods of passing parameters.  Register windowing always shifts by a fixed amount, the heap doesn't move anywhere, and neither do registers.

Dynamic scope

In the same way lexical scope (with closures) is local scope done right, dynamic scope is in a way global scope done right.  The main difference from "simple" globals is that definitions can be shadowed, and are automatically restored when the block of the shadowing definition ends.

So how does it work?  Suppose that we define a variable globally, and make it have dynamic scope.  Now, this variable will be visible from everywhere, and work just like any other global variable would.  If a function changed its value, that change would be visible to every other function.  Nothing new so far.

The new thing comes when a function shadows this variable by defining one with the same name.  As this function called others, they would all see the new variable, not the old one.  And the real trick comes when the function creating the new variable returns: with that, its shadowing definition of foo ceases to exist.  Thus the old variable, with the value it had when it was shadowed, is restored.

As you might expect, trying to declare a variable and suddenly it turning out to be a global is a very bad experience.  Thus dynamic variables should belong to a different namespace.  One way this can happen has a nice tie-in with syntax; if "unusual" characters, such as * are allowed in names, they can be used to differentiate dynamic variables.  Putting an asterisk on both ends of a name to make sure it does not collide with normal (lexical) variables is called the earmuff convention by some, because *foo* looks cute.

More theoretically, dynamic variables are the perfect pair to lexicals.  This is because all variables have two qualities:

Lexical variables have limited scope (they can be called only inside the block they are defined in) and unlimited extent (with closures, these variables exist as long as anyone holds a reference to them).  Dynamic variables are exactly the other way around: they have unlimited scope (they are visible in the whole program) but limited extent (they cease to exist when the block they were defined in returns).

Multiple dispatch

In most object-oriented programming languages, subtypes can override the methods of their base class.  When such a method is called, the implementation is chosen depending on the type/class of the first argument.  However, in practice, behavior often should depend on the types of multiple arguments, not just one.  In fact this is such a common case that a design pattern called visitor exists, to kludge a semblance of this feature into languages that lack it. 

Again we have a semantic feature with ties to the cleanliness of syntax.  When function calls look like obj.func(args) and their semantics are described with nonsense like "sending a message to the object" or "invoking a method on the object" the result is that the first argument (obj) is treated specially.  However, if functions are called like func(arg1, arg2) there is no reason to treat the first argument in any special manner.  It is only natural to do the right thing and involve the other parameters in the dispatching.

The end

The compound effect of these features can have a huge effect on programmer productivity.  Allow me to cite the results of Erann Gat, building on previous work by Lutz Pretchelt.

Back to the top