This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: Grepping Unicode files?
- From: Vince Rice <vrice at solidrocksystems dot com>
- To: cygwin at cygwin dot com
- Date: Thu, 14 May 2015 11:32:26 -0500
- Subject: Re: Grepping Unicode files?
- Authentication-results: sourceware.org; auth=none
- References: <3C280897-291A-4A8C-8C3F-46D1D9BEFCFE at solidrocksystems dot com> <746170827 dot 20150514185648 at yandex dot ru>
On May 14, 2015, at 10:56 AM, Andrey Repin <anrdaemon@yandex.ru> wrote:
>
> Greetings, Vince Rice!
>
>> uname says "CYGWIN_NT-6.1 machinename 1.7.35(0.287/5/3) 2015-03-04 12:07 i686 Cygwinâ.
>> Iâm running grep 2.21.2, which cygcheck -c says is OK.
>
>> Does Cygwinâs grep support Unicode files? The output from a SQL Server SQL
>> Agent job is a Unicode file, i.e. if you look at it in a hex editor every
>> other character is 00 because each character is taking up two bytes. The
>> filename itself is fine, itâs the contents that is Unicode. I canât get grep
>> to work on it, either with or without -a.
>
>> This may not be a Cygwin-specific question, but I havenât been able to find
>> anything after several Google searches, including the archives, and neither
>> --help nor the man page for grep references Unicode.
>
>> By default I have neither LC_ALL nor LC_COLLATE set.
>
>> A pointer to a better search or a website that explains this would be
>> great, or if it canât currently be done, thatâs OK, too.
>
> grep only treat files as text if they are matching current locale.
> Check `locale` output to see your current settings.
First, to the other responder(s), running it through iconv with a from of UTF16 and a to of UTF8 did work. Thanks for the pointer. (Iâve never had to deal with anything but ANSI files, so I didnât know about iconv. And I guessed on the UTF8, given what I found below.)
locale run from a cmd.exe session says that everything is âC.UTF-8â, while locale run from mintty says that everything is en_US.UTF-8. A âwhichâ in both cases shows that the locale being run is cygwinâs, so I assume mintty does something slightly differently than the normal console? I donât even know if thereâs a difference. (Have I mentioned I donât know anything about all of this?)
From cmd.exe:
LANG=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_ALL=
From mintty
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Now, pardon my continued ignorance, but which of those variables needs to be set to UTF16 in order for grep to work? And I assume it (they?) should be set to en_US.UTF-16?
Thanks to everyone for your help. I think youâve all confirmed this isnât cygwin-specific, but I couldnât find anything even searching generically (âgrep unicodeâ and now âgrep utf16â). I did finally find an external reference to iconv, but if grep is supposed to be handle this natively, I havenât been able to find much on how to do it.
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple