2 votes

Comment compter tous les caractères d'un fichier, y compris les caractères de contrôle et les caractères Unicode ?

Tout d'abord, je vous prie de m'excuser pour cette longue question. Je cherchais un script qui détaillerait par caractère tout ce qui se trouve dans un fichier. Je suis tombé sur un script et j'ai décidé de l'étendre pour montrer les caractères de contrôle et l'unicode. Voici ma tentative, mais elle n'est pas tout à fait correcte. Je demande donc de l'aide. J'ai fait des recherches sur la façon de lire un fichier en UTF-8 correctement, beaucoup de commentaires sur la façon de ne pas le faire, mais peu sur une méthode qui fonctionne pour moi.

En utilisant un fichier .DS_Store de mon Mac, j'obtiens le résultat suivant. J'aimerais comprendre comment résoudre les avertissements (c'est-à-dire ne pas simplement les ignorer, mais les traiter correctement). Je cherche également un moyen de vérifier que je procède correctement. Par exemple od -c .DS_Store est une méthode, mais je ne vois pas de correspondance univoque avec mon résultat.

>charlist_v4 .DS_Store
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
utf8 "\x80" does not map to Unicode at /Users/ericdp/bin/charlist_v4 line 210.
               Dec     Hex  Letter   Count  Desc

     1       0  0x0000  [NUL]        6,020  C0 Control Character Set - Null (^@ \0)                                         
     2       1  0x0001  [SOH]           59  C0 Control Character Set - Start of Header (^A)                                 
     3       2  0x0002  [STX]            8  C0 Control Character Set - Start of Text (^B)                                   
     4       3  0x0003  [ETX]            1  C0 Control Character Set - End of Text (^C)                                     
     5       4  0x0004  [EOT]            7  C0 Control Character Set - End of Transmission (^D)                             
     6       8  0x0008  [BS]             9  C0 Control Character Set - Backspace (^H \b)                                    
     7      11  0x000B  [VT]             2  C0 Control Character Set - Vertical Tabulation (^K \v)                          
     8      16  0x0010  [DLE]            9  C0 Control Character Set - Data Line Escape (^P)                                
     9      24  0x0018  [CAN]            1  C0 Control Character Set - Cancel (^X)                                          
    10      32  0x0020  [SP]             7  Space                                                                           
    11      37  0x0025  [%]          2  PERCENT SIGN                                                                    
    12      48  0x0030  [ ]          6  DIGIT ZERO                                                                      
    13      49  0x0031  [1]          1  DIGIT ONE                                                                       
    14      56  0x0038  [8]          6  DIGIT EIGHT                                                                     
    15      64  0x0040  [@]          7  COMMERCIAL AT                                                                   
    16      66  0x0042  [B]          2  LATIN CAPITAL LETTER B                                                          
    17      68  0x0044  [D]          2  LATIN CAPITAL LETTER D                                                          
    18      69  0x0045  [E]          1  LATIN CAPITAL LETTER E                                                          
    19      83  0x0053  [S]          1  LATIN CAPITAL LETTER S                                                          
    20      92  0x005C  [\]          6  REVERSE SOLIDUS                                                                 
    21      96  0x0060  [`]          1  GRAVE ACCENT                                                                    
    22     100  0x0064  [d]          1  LATIN SMALL LETTER D                                                            
    23     117  0x0075  [u]          1  LATIN SMALL LETTER U                                                            
    24     120  0x0078  [x]          6  LATIN SMALL LETTER X     

  #!/usr/bin/perl
  # ========== ========== ========== ========== ========== ========== ==========
  # charlist2.pl
  #
  # count every character in a file
  #
  # Version 1: 16 Aug 05  bb
  # Version 2: 21 Sep 05 jw v2 modified layout of output file
  # Version 3: 2005-10-15 bh Added -f and -r options
  # Version 4: 31 Jan 2010 EDP - added UTF-8 functionality
  # ========== ========== ========== ========== ========== ========== ==========
  $| = 1;             # Do not buffer output
  use strict;
  use warnings;
  use Encode qw(encode :fallbacks);

  #use open IO => ':utf8'; # all I/O in utf8
  #no warnings 'utf8'; # but ignore utf-8 warnings
  #binmode( STDIN, ":utf8" );
  #binmode( STDOUT, ":utf8" );
  #binmode( STDERR, ":utf8" );

  use Unicode::UCD 'charinfo';
  use Cwd 'abs_path'; # get full absolute path to files, regardless of where it is ran from
  {
    no warnings;      # warnings doesn't like $0 below
    use constant {
      PROGRAM  => abs_path( $0 ),  # get full path, not relative path
      DEBUG    => $ENV{ 'DEBUG' }  # to turn on debugging:  export DEBUG=1
    };
  }

  # ---------- ---------- ----------
  our $Version = "4.0";

  # ---------- ---------- ----------
  use Getopt::Std;
  our ( $opt_f, $opt_r );
  getopts( 'fr' );

  # ---------- ---------- ----------
  die <<"eof" unless $#ARGV >= 0;
  Usage:
    charlist2.pl [-f] [-r]  infile > outfile

  Given a text file, count the number of times each character occurs.
  Print out the count, also giving the decimal equivalent of each character.

  -f sort by frequency

  -r reverse sort order

  Version $Version
  eof
  my $file = $ARGV[0];
  my %ctrls;

    sub commify {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : commify a number
    #
    # Arguments   : number
    #
    # Returns     : string equivalent with commas every three numbers to the
    #               left of the decimal
    #
    # Example     : $num_str = commify 1234.5678  # == 1,234.5678
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

      my $text = reverse $_[0];
      $text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
      return scalar reverse $text;

    } # commify

    sub trim {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : Trim spaces before and after a string
    #
    # Arguments   : string
    #
    # Returns     : regex out any leading/trailing spaces
    #
    # Example     : print trim( '     a  ' )  # 'a'
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

      my ( $str ) = shift =~ m!^\s*(.+?)\s*$!i;
      defined $str ? return $str : return '';

    } # trim

    sub ident {
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------
    # Description : Identify everything about this character
    #
    # Arguments   : line counter
    #               character code (i.e. space = 32)
    #               count of how many we found
    #
    # Returns     : output line to STDOUT
    #
    # Example     : ident( line_num=>$cnt,
    #                      char_code=>$idx,
    #                      count=>$count[$idx] );
    # ---------- ---------- ---------- ---------- ---------- ---------- ----------

    my %args = @_;
    my $line_num = $args{line_num} || die 'ident( line_num=> ) paramer required';
    my $char_code = $args{char_code} ;#|| die 'ident( char_code=> ) paramer required';
    my $count = $args{count} || die 'ident( count=> ) paramer required';

    my ( $c, $h, $n );

    # ---------- ---------- ----------
    # Gather what unicode information about this character
    # ---------- ---------- ----------
    my $info=eval { charinfo( $char_code ) };

    # ---------- ---------- ----------
    # and we find something
    # ---------- ---------- ----------
    if ( defined $info )
    {

      # ---------- ---------- ----------
      # what if it is one of the control
      # characters defined at the end of
      # this file?
      # ---------- ---------- ----------
      if ( defined $ctrls{$char_code} )
      {

        $c = trim( $ctrls{$char_code}[0] );
        $h = $info->{code};
        $n = trim( $ctrls{$char_code}[1] );

      }
      else
      {

        # ---------- ---------- ----------
        # what did we find?
        # ---------- ---------- ----------
        $c = chr( $char_code ) || ' ';
        eval {

          no warnings;
          if ( $info->{combining} > 0 )
          {
            $c = ' ' . $c;
          }

        };
        $h = $info->{code} || ' ';
        $n = trim( $info->{name} ) || ' ';

      }

    }
    else
    {

      # ---------- ---------- ----------
      # we didn't find anything in the system files.
      # it may not be up-to-date
      # ---------- ---------- ----------
      $n = '<undef>';

    }
    print sprintf( "%6d", $line_num ) . "\t";
    print sprintf( "%6d", $char_code ) ."\t";
    print '0x' . $h . "\t";
    print sprintf( "[%-1s]\t", $c );
    print sprintf( "%10s", commify( $count ) ) . "\t";
    print sprintf( "%-80s", $n );
    print "\n";
    } # ident

  # ---------- ---------- ----------
  # Load special control characters from DATA below
  # ---------- ---------- ----------
  while ( <DATA> )
  {

    chomp;
    last unless /\S/;
    my ( $key, @data ) = split /,/;
    $ctrls{$key} = \@data;

  }

  # ---------- ---------- ----------
  # Read the file
  # ---------- ---------- ----------
  my $line;
  my @count;

  #open( my $fh, '<', $file ) or die "Unable to open $file - $!\n";
  #while ( $line = <$fh> )

  open( my $fh, '<:encoding( UTF-8 )', $file ) or die "Unable to open $file - $!\n";
  while ( $line = encode( 'UTF-8', <$fh>, FB_PERLQQ ) )
  {

    my @chars = split( //, $line );
    foreach my $char ( @chars )
    {

  #    utf8::decode( $char ) or die "unable to change [$char] to utf8";
      $count[ ord( $char ) ]++;

    }

  }
  close $fh or die "Unable to close $file: $!\n";

  # ---------- ---------- ----------
  #  http://unicode.org/faq/utf_bom.html#gen6
  #  1114111 = 0x10FFFF - max possible value in Unicode UTF-8 v.5.2.
  # ---------- ---------- ----------
  my @list = ( 0 .. 1114111 );
  @list = sort { $count[$a] || 0 <=> $count[$b] || 0 } @list if $opt_f;
  @list = reverse @list if $opt_r;

  # ---------- ---------- ----------
  # Show what we found
  # ---------- ---------- ----------
  print "\t   Dec\t   Hex\tLetter\t Count\tDesc\n\n";
  my $cnt = 1;
  for my $idx ( @list )
  {

    if ( $count[$idx] )
    {

      print "line_num=>$cnt\tchar_code=>$idx\tcount=>$count[$idx]\n" if DEBUG;
      ident( line_num=>$cnt,
             char_code=>$idx,
             count=>$count[$idx] );
      $cnt++;

    }

  }

  # ---------- ---------- ----------
  # All done
  # ---------- ---------- ----------
  exit;

  # ========== ========== ========== ========== ========== ========== ==========

  # ---------- ---------- ----------
  # These special characters don't have all
  # this extra definition, so let's make this list
  # ---------- ---------- ----------
  __DATA__
  0,NUL,C0 Control Character Set - Null (^@ \0)
  1,SOH,C0 Control Character Set - Start of Header (^A)
  2,STX,C0 Control Character Set - Start of Text (^B)
  3,ETX,C0 Control Character Set - End of Text (^C)
  4,EOT,C0 Control Character Set - End of Transmission (^D)
  5,ENQ,C0 Control Character Set - Enquiry (^E)
  6,ACK,C0 Control Character Set - Acknowledge (^F)
  7,BEL,C0 Control Character Set - Bell(^G \a)
  8,BS,C0 Control Character Set - Backspace (^H \b)
  9,HT,C0 Control Character Set - Horizontal Tabulation (^I \t)
  10,LF,C0 Control Character Set - Line Feed (^J \n)
  11,VT,C0 Control Character Set - Vertical Tabulation (^K \v)
  12,FF,C0 Control Character Set - Form Feed (^L \f)
  13,CR,C0 Control Character Set - Carriage Return (^M \r)
  14,SO,C0 Control Character Set - Shift Out (^N)
  15,SI,C0 Control Character Set - Shift In (^O)
  16,DLE,C0 Control Character Set - Data Line Escape (^P)
  17,DC1,C0 Control Character Set - Device Control One (^Q) - XON
  18,DC2,C0 Control Character Set - Device Control Two (^R)
  19,DC3,C0 Control Character Set - Device Control Three (^S) - XOFF
  20,DC4,C0 Control Character Set - Device Control Four (^T)
  21,NAK,C0 Control Character Set - Negative Acknowledge (^U)
  22,SYN,C0 Control Character Set - Synchronous Idle (^V)
  23,ETB,C0 Control Character Set - End of Transmission Block (^W)
  24,CAN,C0 Control Character Set - Cancel (^X)
  25,EM,C0 Control Character Set - End of Medium (^Y)
  26,SUB,C0 Control Character Set - Substitute (^Z)
  27,ESC,C0 Control Character Set - Escape (^[, \e)
  28,FS,C0 Control Character Set - File Separator (^\)
  29,GS,C0 Control Character Set - Group Separator (^])
  30,RS,C0 Control Character Set - Record Separator (^^)
  31,US,C0 Control Character Set - Unit Separator (^_)
  32,SP,Space
  127,DEL,Delete (^?)
  128,PAD,C1 Control Character Set - Padding Character
  129,HOP,C1 Control Character Set - High Octet Preset
  130,BPH,C1 Control Character Set - Break Permitted Here
  131,NBH,C1 Control Character Set - No Break Here
  132,IND,C1 Control Character Set - Index
  133,NEL,C1 Control Character Set - Next Line
  134,SSA,C1 Control Character Set - Start of Selected Area
  135,ESA,C1 Control Character Set - End of Selected Area
  136,HTS,C1 Control Character Set - Horizontal Tabulation Set
  137,HTJ,C1 Control Character Set - Horizontal Tabulation with Justification
  138,VTS,C1 Control Character Set - Vertical Tabulation Set
  139,PLD,C1 Control Character Set - Partial Line Down
  140,PLU,C1 Control Character Set - Partial Line Up
  141,RI,C1 Control Character Set - Reverse Index
  142,SS2,C1 Control Character Set - Single-Shift Two
  143,SS3,C1 Control Character Set - Single-Shift Three
  144,DCS,C1 Control Character Set - Device Control String
  145,PU1,C1 Control Character Set - Private Use One
  146,PU2,C1 Control Character Set - Private Use Two
  147,STS,C1 Control Character Set - Set Transmit State
  148,CCH,C1 Control Character Set - Cancel Character
  149,MW,C1 Control Character Set - Message Waiting
  150,SPA,C1 Control Character Set - Start of Guarded Protected Area
  151,EPA,C1 Control Character Set - End of Guarded Protected Area
  152,SOS,C1 Control Character Set - Start of String
  153,SGCI,C1 Control Character Set - Single Graphic Character Introducer
  154,SCI,C1 Control Character Set - Single Character Introducer
  155,CSI,C1 Control Character Set - Control Sequence Introducer
  156,ST,C1 Control Character Set - String Terminator
  157,OSC,C1 Control Character Set - Operating System Command
  158,PM,C1 Control Character Set - Privacy Message
  159,APC,C1 Control Character Set - Application Program Command
  __END__

  # ========== ========== ========== ========== ========== ========== ==========

5voto

tchrist Points 47116

Réponse triviale

En voici les grandes lignes. Ne procédez jamais vous-même à un décodage manuel ! La seule fois où j'ai eu à le faire, c'était pour traiter un fichier dont l'encodage variait d'une ligne à l'autre. Au lieu de cela, il faut toujours définir l'encodage sur le flux, que ce soit par l'une des méthodes suivantes :

  • En PERLUNICODE envariable : standard S pour std{in,out,err} et dangereux D pour les dossiers
  • En use open pragma.
  • Dans l'argument de mode du 3⁺-arg open .
  • Dans le deuxième argument de binmode .

En voici les grandes lignes :

use warnings;
use warnings FATAL => "utf8";
use charnames ();
my %seen = ();
binmode(STDOUT, ":utf8") || die "binmode failed";
binmode(STDIN, ":encoding(UTF-8)") || die "binmode failed";

while (<STDIN>) {
    $seen{$_}++ for split //;
}
close(STDIN) || die "can't close STDIN: $!";

Vous avez maintenant un %seen qui est indexé par chaque caractère dont la valeur associée est le nombre d'instances.

Réponse simple

Il s'agit d'une solution complète qui suppose que toutes les entrées sont en UTF-8. Elle produit de jolies sorties que vous pouvez trier sur différentes colonnes si vous n'aimez pas l'ordre des points de code.

#!/usr/bin/env perl
#
# unicount - count code points in input
# Tom Christiansen <tchrist@perl.com>

use v5.12;
use strict;
use sigtrap;
use warnings;
use open qw( :encoding(UTF-8) :std );
use charnames ();

use List::Util          qw(max);
use Unicode::UCD        qw(charinfo charblock);

my $total = 0;
my %seen = ();

while (<>) {
    $total += length;
    $seen{$_}++ for split //;
};

my $dec_width = length($total);
my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen));

for (sort keys %seen) {
    my $count = $seen{$_};
    my $gcat  = charinfo(ord())->{category};
    my $name  = charnames::viacode(ord())
             || "<unnamed code point in @{[charblock(ord())]}>";

    printf "%*d U+%0*X GC=%2s %s\n",
            $dec_width => $count,
            $hex_width => ord(),
            $gcat      => $name;
}

exit;

Réponse extravagante

Cela ne suppose plus que l'entrée est UTF-8.

  • Il coupe .gz à l'aide de la fonction "magic open".
  • Il recherche dans les podfiles un fichier intégré =encoding . Cela pourrait être étendu aux fichiers html et xml.
  • Si les fichiers ont une extension qui correspond à un alias d'encodage valide, alors utiliser cet encodage. Par exemple, foo.latin1 , foo.utf8 , foo.cp1252 , foo.utf16 , foo.utf16be , foo.macroman . Je suis fermement convaincu qu'il n'existe pas de fichier texte en clair et que, par conséquent, le fichier .txt Il convient donc d'interdire immédiatement l'utilisation de cette extension.
  • Dans le cas contraire, bytes est supposé être utilisé pour les fichiers binaires et utf8 dans le cas contraire.

Le traitement pourrait se faire par ligne plutôt que par fichier entier, mais c'est un exercice que je laisse au lecteur.

#!/usr/bin/env perl
#
# unicount - count code points in input
# Tom Christiansen <tchrist@perl.com>

use v5.12;
use strict;
use sigtrap;
use warnings;
use charnames ();

use Carp                qw(carp croak confess cluck);
use List::Util          qw(max);
use Unicode::UCD        qw(charinfo charblock);

sub fix_extension;
sub process_input   (&) ;
sub set_encoding    (*$);
sub yuck            ($) ;

my $total = 0;
my %seen = ();

# deep magic here
process_input {
    $total += length;
    $seen{$_}++ for split //;
};

my $dec_width = length($total);
my $hex_width = max(4, length sprintf("%x", max map { ord } keys %seen));

for (sort keys %seen) {
    my $count = $seen{$_};
    my $gcat  = charinfo(ord())->{category};
    my $name  = charnames::viacode(ord())
             || "<unnamed code point in @{[charblock(ord())]}>";

    printf "%*d U+%0*X GC=%2s %s\n",
            $dec_width => $count,
            $hex_width => ord(),
            $gcat      => $name;
}

exit;

##################################################

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

sub process_input(&) {
    my $function = shift();
    my $enc;

    if (@ARGV == 0 && -t STDIN && -t STDERR) {
        print STDERR "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle
        next if -e $file && ! -f _;
        my $quasi_filename = fix_extension($file);
        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ###98#
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            # could just do this a line at a time, but not if counting \R's
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        do {
            # much faster to alias than to copy
            local *_ = \$whole_file;
            &$function;
        };

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

# Encoding set to (after unzipping):
#    if file.pod => use whatever =encoding says
#    elsif file.ENCODING for legal encoding name -> use that one
#    elsif file is binary => use bytes
#    else => use utf8
#
# Note that gzipped stuff always shows up as bytes this way, but
#   it internal unzipped bytes are still counted after unzipping
#
sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = (-f $path && -B $path) ? "bytes" : "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;

        if ($ext eq "pod") {
            my $int_enc = qx{
                perl -C0 -lan -00 -e 'next unless /^=encoding/; print \$F[1]; exit' $path
            };
            if ($int_enc) {
                chomp $int_enc;
                $ext = $int_enc;
              ##print STDERR "$0: reset encoding to $ext on $path\n";
            }
        }

        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
      ##print STDERR qq(binmode($handle, ":$enc_name")\n);
        binmode($handle, ":$enc_name") || die "binmode to $enc_name failed";
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            # HIP HIP HURRAY! for magic open!!!
            return "$prog $path |";
        }
    }

    return $path;
}

END {
    close(STDIN)  || die "couldn't close stdin: $!";
    close(STDOUT) || die "couldn't close stdout: $!";
}

UNITCHECK {
    $SIG{  PIPE  } = sub { exit };
    $SIG{__WARN__} = sub {
        confess "trapped uncaught warning" unless $^S;
    };
}

4voto

Anomie Points 43759

Su .DS_Store contient des données binaires et non du texte encodé en UTF-8. Les avertissements proviennent du fait que certaines séquences d'octets ne sont pas valides en UTF-8.

Prograide.com

Prograide est une communauté de développeurs qui cherche à élargir la connaissance de la programmation au-delà de l'anglais.
Pour cela nous avons les plus grands doutes résolus en français et vous pouvez aussi poser vos propres questions ou résoudre celles des autres.

Powered by:

X