Speaking of shackles of C: any ideas why did C choose "signed int" as the default? Especially the signed part

jhi · January 14, 2022, 7:21pm

I read this recently

and it made me wonder: why was the signed int chosen as the default type? The “int” part I get, the “natural size” for the platform, but why signed?

Just historical accident?

Or something coming from BCPL? (“word”)

Or common in the languages of the day for the default to be signed?

Or one of the uses mentioned in the article, namely being able to indicate failure? (They missed the chance to unify the negative error return and errno, but that’s another sad story.)

Or somehow handy in PDP-11?

EdS · January 14, 2022, 7:48pm

It sounds like C originally lacked the unsigned type:
The Development of the C Language
says

During 1973-1980, the language grew a bit: the type structure gained unsigned, long, union, and enumeration types…

I might guess that if a language were going to provide just one of the two flavours - signed, or unsigned - then providing signed types will be more generally convenient.

It looks like BCPL’s word - the only datatype, I think - is a signed type, running from minint to maxint:

The constant minint is 1<<(bitsperword-1) and maxint is =minint-1. They hold the most negative and largest positive numbers that can be represented by a BCPL word. On 32-bit implementations, they are normally #x80000000 and #x7FFFFFFF.

It seems

Arithmetic overflow is undefined

jhi · January 14, 2022, 7:51pm

Thanks! Duh, I had forgotten this paper.

This might be the crux of the matter:

I might guess that if a language were going to provide just one of the two flavours - signed, or unsigned - then providing signed types will be more generally convenient.

elb · January 14, 2022, 11:17pm

Unsigned integers were indeed introduced some time between Unix V6 in 1975 and V7 in 1979; there are “V6.5” patches floating around that add unsigned integers to a compiler which is still substantially the V6 compiler (e.g., without real struct “types”). In the V6 sources, where unsigned math is required, variables are typically declared as char *. On the PDP-11 (but not all platforms!), this gives a pointer with a stride of 1 byte, which is effectively simple unsigned arithmetic.

drogon · January 15, 2022, 10:52am

Possibly, as BCPLs “word” is signed (2’s compliment in the 16 and 32-bit systems I’ve used).

My suspicion is that maybe some of the early systems in ran on had support for signed arithmetic but it’s hard to know.

The only things that seem to care about signed’ness (in BCPL) are basic arithmetic and compare operations, otherwise it’s just a word. Right shift is arithmetic, so no sign propagation.

The automatic signed nature did cause me a moment of head scratching recently when writing an emulator for a system that had unsigned compares (as well as signed compares).

And to add to the confusion, the default signed’ness char data type in C is implementation dependant…

-Gordon

jhi · January 17, 2022, 3:02pm

I have debugged many a bug caused by assuming this, either way.

elb · January 17, 2022, 4:28pm

This is one of the many reasons that I regret the current hegemony of just a couple of dominant platforms that have broadly similar semantics for a lot of historically diverse platform details. I teach a systems programming course at the University level, and many of the points that I harp upon as architectural concerns (endianness, alignment, signedness being examples) are simply not a concern for students who encounter only x86-64 and ARM on any regular basis.

Not that many years ago it was perfectly usual for a typical programmer to encounter SPARC (big endian, very picky about alignment, unsigned char), PowerPC (default big endian, moderately picky about alignment, signed char), and x86 (little endian, alignment only a performance detail, unsigned char) on a daily basis, and various other platforms (Alpha, ARM, M68k, etc.) depending on position and project. Now, the whole world is little endian, indifferent about alignment, and unsigned of char.

While in some sense it’s nice to be able to ignore those details, from a practical perspective, it seems obvious that it’s going to either constrain future platforms or bite back when some platform of different behavior rises in popularity. RISC-V, for example, is picky about alignment. I know that in the 1990s when a lot of wintel (and software born on x86 Linux) was being ported to PowerPC and other platforms of differing details, I know this caused a fair amount of consternation. I’m too young to remember the Unix Wars or the true proliferation of workstation architectures in the 80s, but of course portability was a huge concern, then, as well.

The TL;DR is that I think the programmers of the 201x/202x years are going to be in for a rude awakening if we ever achieve a diversity of architecture like we enjoyed at the end of the 20th century.

jhi · January 17, 2022, 4:47pm

Strongly agreed. Monoculture is a mind killer, and a liability.

I’ve seen suggestions from representatives you can guess never guess which company that why don’t we just all agree on little-endian.

Tor · January 18, 2022, 8:10am

Well, everything I write at $work or home I write endian-independent and alignment-safe. In general my co-workers do too, even if these days most of our customers are on Linux. We used to have customers on Sparc, Alpha, MIPS in the past. Not anymore, but we still have some on Power/AIX. But writing endian- and alignment-independent code isn’t exactly hard to do, so at least for me that practice won’t change.

jhi · January 18, 2022, 3:30pm

I am envisioning qemu setup with big-endian and strict-alignment, to trap lax code.

Dare · January 18, 2022, 5:08pm

The TL;DR is that I think the programmers of the 201x/202x years are going to be in for a rude awakening if we ever achieve a diversity of architecture like we enjoyed at the end of the 20th century.

Ostensibly that’s true. But I think the days are long past when variations in signedness and endianness across architectures confer an actual benefit (if in fact they ever did). Today, there is effectively no, or even negative value in a new architecture adopting different signedness/endianness defaults from the prevailing platforms. To the extent that this discourages a proliferation of architectures then it is a good thing, as it reduces the complexity burden on the software engineer, who is already straining under the demands of modern software engineering requirements.

I think data alignment is in a slightly different boat, as it can have a broader impact on hardware architecture which results in meaningful cost vs. performance trade-offs for programmers. Even so, the vast majority of software can be written to a single alignment standard (natural alignment) without consequence, giving new architectures little incentive to deviate from that norm.

scruss · January 18, 2022, 6:31pm

A possible reason: because FORTRAN’s integers are signed. Before C compilers were widely useful, quite a bit of the Unix software was written in FORTRAN-66, or at least, ratfor. From ANSI X 3.9 1966 Fortran 66:

4.2.1 Integer Type. An integer datum is always an exact representation of an integer value. It may assume positive, negative, and zero values. It may only assume integral values.

This may now seem as a losts-in-the-mists-of-time reason, like tabs in Makefiles. It may have been useful for FORTRAN-generate binary fields or in-memory structures, but more likely helpful to avoid having to remember new numerical ranges for a new programming language.

Also, in RL, integers can be negative. If C really wanted a non-negative integral type, they could have used something like whole or natural.

jhi · January 18, 2022, 7:20pm

I was bold enough to ask Brian Kernighan by email about this… he said he has no clear idea, but he thinks all our guesses (which I listed) have some partial validity. He suggested asking in the TUHS mailing list if other “elder statesmen” have any better recollections.

Tor · January 19, 2022, 7:46am

I’m not sure why choosing signed int as default is particularly strange. Back then all programming examples were basically showing bits of code with subtraction and additions and other numeric applications, and then of course you want signed integers - you shouldn’t have to remember to tag all your declarations with “signed”.
Of course for indexing you’ll be better off with unsigned integers, but indexing isn’t the major use for numbers. Try to imagine a calculator where the integers are unsigned unless you declare them to be signed…

jhi · January 19, 2022, 10:29am

I didn’t think it was strange as such, I was just being curious about the background.

EdS · January 19, 2022, 11:22am

I was surprised to learn of the three-fold nature of char!

I do recall a friend of mine was miffed to have an endianness issue in his code, as he thought he’d been straightforward. I think he’d written on an x86 (Linux) and I was trying to run his code on a SPARC (SunOS).

(We also learn that sometimes - these days, usually - it’s the platform, not the CPU, which chooses an endianness. Here’s a story about Apple’s journey which has left traces in the header of the Universal Binary format.)

jecel · January 19, 2022, 6:30pm

Note that “integer” was assumed to be signed back then. That is why Intel created the “ordinal” data type for its iAPX432 processor (started in 1976, released in 1981):

character: 8 bits, for text and booleans
short ordinal: 16 bits, unsgined
ordinal: 32 bits, unsgined
short integer: 16 bits, unsgined
integer: 32 bits, unsgined
short real: 32 bits, IEEE floating point
real: 64 bits, IEEE floating point
temporary real: 80 bits, floating point

The 432 allowed logic operations on ordinals, but not on integers. On the other hand, the “neg” instruction was invalid for ordinals.

The main difference between signed and unsigned integers is in comparisons, so having both implies two sets of those. Shifting to the right is also different for signed and unsigned.

For the Inmos Transputer processor the designers decided that the main use for unsigned integers was for addresses so they defined their address space as signed (0 is in the middle of the memory map) to avoid having to do everything twice. They quickly added a set of instructions for unsigned integers, however, when they figure out that some clients would want to run C on their chip instead of their own Occam language.

jhi · January 23, 2022, 2:37pm

And Pascal had integers and ordinals.

Michael_Barry · January 23, 2022, 6:48pm

All of the two’s complement machines I know can juggle signed integers just as well as unsigned integers, as long as the comparison-and-branch functionality is there or can be synthesized economically. If the added functionality is there for free or nearly free, why not use it?

oldben · January 24, 2022, 11:04pm

What about shifting, what does the C standard say about that?
Ben.