UTF-8 encoding/decoding in C

I was working on a simple database to Excel XML exporter the other day and decided to write it in C. Now, the problem was that since the Swedish language contains non-ascii characters the output needs to be UTF-8 encoded. C doesn’t have a built-in function for this – it seems I should add since I’m a C rookie – and no matter how I searched at Google I couldn’t find anything useful. So I thought…

Look at PHP

…why not look at the source code of PHP and see how the PHP functions utf8_encode and utf8_decode are being done. So I downloaded the source of PHP and with a little find . -name *.c -print | xargs grep "utf8_encode" I found the functions in xml.c. Thankfully they weren’t too complicated – when dug out from the rest of the XML functions – so I didn’t take too long before I had them as standalone functions.

This is how they are used:

12 lines of C/C++
  1. #include "utf8.h"
  2. int main(int argc, char **argv)
  3. {
  4. char *iso_str = "Pontus Östlund";
  5. char *utf8_str;
  6. utf8_str = utf8_encode(iso_str);
  7. iso_str = utf8_decode(utf8_str);
  8. return 0;
  9. }

And it seems to be working quite OK!

Sources at Github

8 comments Subscribe | Drop a comment

  1. This is REALLY what I need! Thanks a lot:)

  2. Now, I haven’t really tested these functions thoroughly so be careful, there might be some bugs in there!

  3. Rodrigo P. A.

    Hi, need free pointer after use, example:

    create this function:

    void utf8_clean(void *ptr)
    {
    if ( ptr ) free(ptr);
    }

    using:

    #include “utf8.h”

    int main(int argc, char **argv)
    {
    char *iso_str = “Pontus Östlund”;
    char *utf8_str;

    utf8_str = utf8_encode(iso_str);
    iso_str = utf8_decode(utf8_str);
    utf8_clean( utf8_str );
    utf8_clean ( iso_str );

    return 0;
    }

  4. Rodrigo P. A.

    Hi again, i found one problem when i use with PtBr, not encode correct this sample:

    “Olá Mundo”

    i found bug in this function: xml_utf8_encode

    i change to:

    static char *xml_utf8_encode(const char *s, int len, int *newlen,
    const XML_Char *encoding)
    {
    int pos = len;
    int size;
    char *newbuf;
    unsigned int c;
    unsigned short (*encoder)(unsigned char) = NULL;
    xml_encoding *enc = xml_get_encoding(encoding);

    *newlen = 0;
    if (enc)
    encoder = enc->encoding_function;
    else
    /* If the target encoding was unknown, fail */
    return NULL;

    if (encoder == NULL) {
    /* If no encoder function was specified, return the data as-is.
    */
    newbuf = (char*)emalloc(len + 1);
    memcpy(newbuf, s, len);
    *newlen = len;
    newbuf[*newlen] = ”;
    return newbuf;
    }

    /* This is the theoretical max (will never get beyond len * 2 as long
    * as we are converting from single-byte characters, though) */
    size=len;
    newbuf = emalloc(size);
    while (pos > 0) {
    c = encoder ? encoder((unsigned char)(*s)) : (unsigned short)(*s);
    // alteredo, se o tamanho do novo buffer size ) {
    size+=16; // add 16 bytes in new buffer
    newbuf = (char*)erealloc(newbuf, size);
    }
    if (c < 0×80)
    newbuf[(*newlen)++] = (char) c;
    else if (c > 6));
    newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
    }
    else if (c > 12));
    newbuf[(*newlen)++] = (0xc0 | ((c >> 6) & 0×3f));
    newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
    }
    else if (c > 18));
    newbuf[(*newlen)++] = (0xe0 | ((c >> 12) & 0×3f));
    newbuf[(*newlen)++] = (0xc0 | ((c >> 6) & 0×3f));
    newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
    }
    pos–;
    s++;
    }

    newbuf[*newlen] = 0;
    //newbuf = erealloc(newbuf, (*newlen)+1);
    return newbuf;
    }

    and now work fine!

    Thank you

  5. Thanks a bunch for your contribution Rodrigo. I’ll add it to the downloadable code.

  6. Ernst Scheller

    Hello Pontus,

    i don’t quite understand the changes of Rodrigo and as far as i see , the modifications are not in your code. I have no example to proof the changes of Rodrigo. Do you modify your downloadable code ?

    Kind regards
    Ernst

  7. I had forgotten about this! Now I have changed the sources according to Rodrigo’s suggestions. But I can’t guarantee it’s bug free since I haven’t tested it thoroughly.

    https://github.com/poppa/PlayStation/tree/master/c/utf8

  8. Hello Pontus,

    Thanx a lot. I only need the decode Part of it . We fight with ä, ü and ö (German). I will test some use cases, which we need. I also so some error checking for our purpose. You’ve saved my day ;-)

    Cu
    Ernst

Drop a comment