Home > Software engineering >  How can I organize data in rows and columns with Perl?
How can I organize data in rows and columns with Perl?

Time:01-05

The base problem is that I have lots of datapoints with normalized names that are just dumped from the server into a file, but I need to organize these datapoints into a file with rows and columns automatically, according to the data they contain (indicated in their normalized names).


The original file with all the datapoints comes as follows (these are not the original datapoint tags but rather simplified ones):

temp_r301
airflow_r301
temp_r345
airflow_r345
solar_w
solar_e
...

As you can see, they all come as one column, so there is one tag per row.

And I want to organize them so that for each state ("temp" as in temperature), I have the corresponding information in the same row, such as:

temp_r301 301 airflow_r301 solar_w solar_e     #airflow in 301 and general solar radiation affect temperature (state) in room 301
temp_r345 345 airflow_r345 solar_w solar_e     #airflow in 345 and general solar radiation affect temperature (state) in room 345

Of course the lenght of the array can vary so the idea is to make an algorithm that detects the length and organizes the data accordingly. Also, I am aware I will have to use regular expressions to find the matches and define which datapoints are states and which ones inputs, as well as knowing the room to which they belong.


So far I have tried the following:

use strict;
use warnings;
use diagnostics;

my @transpose = ();
my @sorted = ();

push(@sorted, [qw(temp_r301 temp_r345)]);
push(@sorted, [qw(301 345)]);
push(@sorted, [qw(airflow_r301 airflow_r345 solar_w solar_e)]);

for my $sorted (@sorted) {
  for my $column (0 .. $#sorted) {
    push(@{$transpose[$column]}, $sorted->[$column]);
  }
}

for my $new_row (@transpose) {
  for my $new_col (@{$new_row}) {
      print "$new_col ";
  }
  print "\n";
}

But this only works fine if all the arrays have the same lenght (not this case).

I also discovered a loop that can be used to store data into matrix form (array of arrays), but still, I can't seem to find a solution to write in the matrix the data from different arrays:

use strict;
use warnings;
use diagnostics;
use feature 'say';

my @states = qw(temp_r301 temp_r345);
my @zones = qw(301 345);
my @inputs = qw(airflow_r301 airflow_r345 solar_w solar_e);

my @matrix = ();

for my $x (0 .. $#states) {
    for my $y (0 .. $#inputs) {
        $matrix[$x][$y] = $states[$x];           #of course this only copies the states array and
    }                                            #repeats it for each created array
}

for my $aref (@matrix) {                         #print array of arrays
    say "[ @$aref ],";
}

So, knowing that I have all the data dumped into an input file, what would be the best way to sort that data into a matrix? Is there any loop I should give more attention to? Should I be working with arrays? Any idea would be great.

Thank you in advance.

CodePudding user response:

Details of this problem are still unclear, while explanations did help. So here is what I'll assume.

I take data to have a piece of information per line. Some contain a tag (description) followed by the room number, and I assume format tag_rN, identifying a room number that the tag applies to.

As for others, that don't have the room number, additional processing is needed to decide where that information belongs. The question puts forth only an example of tags that apply to all rooms, related to solar radiation that affects them (see comments), so that's all that's processed.

The fact that some of the data does not neatly classify with a room is what makes organization of the parsed data non-trivial. Since no details are given I merely split it into two hashes, one by room number and another one which structure will depend on specifics.

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);

my $file = shift // die "Usage: $0 file\n";
open my $fh, '<', $file or die "Can't open $file: $!";

my (%room, %other);
while (<$fh>) { 
    chomp;
    if ( my ($tag, $room_num) = /([^_] )_r([0-9] )/ ) {
        $room{$room_num}{$tag} = $_;                  # have room number
    }
    else {                                            # more processing needed
        my ($tag, $value) = parse_line($_);
        push @{ $other{$tag} }, $value;
    }
}
dd \%room; dd \%other; say '';

# Print in CSV format. Header first
my @tags = ( keys %{ $room{ (keys %room)[0] } }, keys %other );
say join ',', 'room', @tags;
foreach my $rnum (keys %room) { 
    say join ',', 
        $rnum, map { $room{$rnum}{$_} // join ' ', @{$other{$_}}  } @tags;
}

sub parse_line {
    my ($line) = @_;
    my ($tag, $value);

    if ($line =~ /solar_w|solar_e/) {   # example from sample data
        $tag   = 'solar';
        $value = $line;
    }
    else { }  # other possibilities

    return $tag, $value;
}

The data with the room number is sorted out by the identifying description ("tag") as a key, with the line being its value. Each such key-value pair is in a hashref assigned to each room number.

The data without the room number is parsed in a separate sub, with just some token code since no details are given. Then that is stored in another hash, for easier manipulation (since it's not tied to any one room).

How tags are extracted from data is a bit arbitrary, since it's not specified in the question.

All this is combined into a CSV format. The above, with the input file from the question and the explanation in comments that the solar radiation from both west and east affects all rooms, prints:

{
  301 => { airflow => "airflow_r301", temp => "temp_r301" },
  345 => { airflow => "airflow_r345", temp => "temp_r345" },
}
{ solar => ["solar_w", "solar_e"] }

room,airflow,temp,solar
345,airflow_r345,temp_r345,solar_w solar_e
301,airflow_r301,temp_r301,solar_w solar_e

Comment out the line with dd ... (from Data::Dump) to remove the initial diagnostic prints. Then the last few lines are the CSV that would go into some file etc.

Some data may be missing for some rooms, and there is yet more data which may not classify so uniformly. Then the fields for those headers will be merrily empty in some rows, as desired.

  •  Tags:  
  • Related