UTF-8 encoding/decoding in C
I was working on a simple database to Excel XML exporter the other day and decided to write it in C. Now, the problem was that since the Swedish language contains non-ascii characters the output needs to be UTF-8 encoded. C doesn’t have a built-in function for this – it seems I should add since I’m a C rookie – and no matter how I searched at Google I couldn’t find anything useful. So I thought…
Look at PHP
…why not look at the source code of PHP and see how the PHP functions utf8_encode and utf8_decode are being done. So I downloaded the source of PHP and with a little find . -name *.c -print | xargs grep "utf8_encode" I found the functions in xml.c. Thankfully they weren’t too complicated – when dug out from the rest of the XML functions – so I didn’t take too long before I had them as standalone functions.
This is how they are used:
- #include "utf8.h"
- int main(int argc, char **argv)
- {
- char *iso_str = "Pontus Östlund";
- char *utf8_str;
- utf8_str = utf8_encode(iso_str);
- iso_str = utf8_decode(utf8_str);
- return 0;
- }
And it seems to be working quite OK!
This is REALLY what I need! Thanks a lot:)
Now, I haven’t really tested these functions thoroughly so be careful, there might be some bugs in there!
Hi, need free pointer after use, example:
create this function:
void utf8_clean(void *ptr)
{
if ( ptr ) free(ptr);
}
using:
#include “utf8.h”
int main(int argc, char **argv)
{
char *iso_str = “Pontus Östlund”;
char *utf8_str;
utf8_str = utf8_encode(iso_str);
iso_str = utf8_decode(utf8_str);
utf8_clean( utf8_str );
utf8_clean ( iso_str );
return 0;
}
Hi again, i found one problem when i use with PtBr, not encode correct this sample:
“Olá Mundo”
i found bug in this function: xml_utf8_encode
i change to:
static char *xml_utf8_encode(const char *s, int len, int *newlen,
const XML_Char *encoding)
{
int pos = len;
int size;
char *newbuf;
unsigned int c;
unsigned short (*encoder)(unsigned char) = NULL;
xml_encoding *enc = xml_get_encoding(encoding);
*newlen = 0;
if (enc)
encoder = enc->encoding_function;
else
/* If the target encoding was unknown, fail */
return NULL;
if (encoder == NULL) {
/* If no encoder function was specified, return the data as-is.
*/
newbuf = (char*)emalloc(len + 1);
memcpy(newbuf, s, len);
*newlen = len;
newbuf[*newlen] = ”;
return newbuf;
}
/* This is the theoretical max (will never get beyond len * 2 as long
* as we are converting from single-byte characters, though) */
size=len;
newbuf = emalloc(size);
while (pos > 0) {
c = encoder ? encoder((unsigned char)(*s)) : (unsigned short)(*s);
// alteredo, se o tamanho do novo buffer size ) {
size+=16; // add 16 bytes in new buffer
newbuf = (char*)erealloc(newbuf, size);
}
if (c < 0×80)
newbuf[(*newlen)++] = (char) c;
else if (c > 6));
newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
}
else if (c > 12));
newbuf[(*newlen)++] = (0xc0 | ((c >> 6) & 0×3f));
newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
}
else if (c > 18));
newbuf[(*newlen)++] = (0xe0 | ((c >> 12) & 0×3f));
newbuf[(*newlen)++] = (0xc0 | ((c >> 6) & 0×3f));
newbuf[(*newlen)++] = (0×80 | (c & 0×3f));
}
pos–;
s++;
}
newbuf[*newlen] = 0;
//newbuf = erealloc(newbuf, (*newlen)+1);
return newbuf;
}
and now work fine!
Thank you
Thanks a bunch for your contribution Rodrigo. I’ll add it to the downloadable code.
Hello Pontus,
i don’t quite understand the changes of Rodrigo and as far as i see , the modifications are not in your code. I have no example to proof the changes of Rodrigo. Do you modify your downloadable code ?
Kind regards
Ernst
I had forgotten about this! Now I have changed the sources according to Rodrigo’s suggestions. But I can’t guarantee it’s bug free since I haven’t tested it thoroughly.
https://github.com/poppa/PlayStation/tree/master/c/utf8
Hello Pontus,
Thanx a lot. I only need the decode Part of it . We fight with ä, ü and ö (German). I will test some use cases, which we need. I also so some error checking for our purpose. You’ve saved my day
Cu
Ernst