This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[1.7] Invalid UTF8 while creating a file -> cannot delete?


After a few problems with monotone's unit tests on Cygwin-1.7, I began
searching and experimenting a bit with new 1.7 support for wide chars.

I also read the full thread about its last change:
http://www.cygwin.com/ml/cygwin/2009-05/msg00344.html
which really makes some sense to me (when I create a file from the
console I want "ls" to show back that file to me with same encoding).

Problem is, that unit test assumes filenames are "raw data" and tries to
create three types of filenames: ISO-8859-1, EUC-JP and UTF-8.
Except on OSX where it only tries UTF-8 as that's the disk format.

Now we have an UTF-16 disk format, except the library is using
LANG-value-from-process-start to initialize some LANG-to-UTF16
conversion as far as I understoof so there's not really one "correct"
format: it depends on the LANG env value when the test unit is launched.

OK, that's a side issue since I can probably modify the tests to always
be launched with LANG=C instead of using the current value so that at
least it is consitent. And then maybe remove the creation of ISO-8859-1
and EUC-JP tests just like on OSX. Which could be correct... but a bit
less so than on OSX itself, when that is really "the format" and not the
"the DEFAULT format which could be overridden with a correct setlocale".

But the real problem with that test is not really what shows and how,
the biggest problem is that it seems that filenames created with a
"wrong" filename are quite limited in usage and can't seemingly be deleted.

% export LANG=en_EN.UTF-8
% cat t.c
#include <stdio.h>
int main() {
    fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
    fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8
    return 0;
}
% gcc -o t t.c
% mkdir test ; cd test ; ../t ; cd ..
% ls -l test
ls: cannot access test/a-âââ: No such file or directory
total 0
-????????? ? ?    ?    ?                ? a-âââ
-rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-ÃÃÃÃ
% find test
test
test/a-???
test/b-ÃÃÃÃ
% find test -delete
find: cannot delete `test/a-\366\344\374': No such file or directory
find: cannot delete `test': Directory not empty
% find test
test
test/a-???

Now... I don't know how exactly `find` works but it seems strange to me
it isn't capable of deleting something it is capable of listing.
Also seems strange `ls` is not capable of stat-ing something it's
capable of listing.

Yep, I do know that filename is "broken" in the first place, but since
in the Unix world such stuff can happen as filenames are really raw
data, I think probably an error on file creation would be better than
creating a file that can't be consequently stat-ed or even unlinked.

% cat u.c
#include <stdio.h>
int main() {
    remove("a-\xF6\xE4\xFC\xDF");
    remove("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F");
    return 0;
}
% gcc -o u u.c

OK, a program using a similarly-broken filename can delete it, but the
fact it can't be deleted with "normal" tools is a bit of an inconvenience...

-- 
Lapo Luchini - http://lapo.it/

âPremature optimisation is the root of all evil in programming.â (C. A.
R. Hoare)


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]