Perl Weekly Challenge 110: Transpose File

by Abigail

Challenge

You are given a text file.

Write a script to transpose the contents of the given file.

Example

Input File

name,age,sex
Mohammad,45,m
Joe,20,m
Julie,35,f
Cristina,10,f

Output

name,Mohammad,Joe,Julie,Cristina
age,45,20,35,10
sex,m,m,f,f

Solution

A straight forward solution would be to read in the input line by line, split each line on commas, and put the fields in a two dimensional array. Then transpose the array and printing the result row by row, joining the fields by commas.

We won't do that. Mostly because not all languages we will be using have two dimensional arrays (I'm looking at you, Bash).

Instead, we will read in the input line by line, and split the line on commas and then use the fields to construct the lines we're going to output. That way, we only need a single, one-dimensional array. We're adding the \(n^\text{th}\) field of each input line to the end of the \(n^\text{th}\) output line — and each time we're adding a comma too.

When we have created all the output lines, we print the lines, after chopping off the final comma.

Note that we read the input file from standard input. We also assume the input is sane, that is, each line contains the same number of fields.

Perl

map {$- = 0; $; [$- ++] .= "$_," for /[^,\n]+/g} <>;

Here, we use <> to read the input, and we use map to process each line. We're using a regular expression to extract the fields (/[^,\n]+/); we could have used split, but then we would have needed some extra code to get rid of the trailing newline. The regular expression just matches anything which isn't a comma or a newline. Due to the /g modifier, there are implicit parenthesis, so the regular expression returns all the fields. Which we iterate over using for. We are storing the output lines into the array @;. We use the variable $- to keep track of which line we need to update. It's set to 0 for each line of input, and incremented for each field.

chop, say for @;

For each line in @;, we chop of the final comma, and then we print it.

Find the full program on GitHub.

AWK

We could use split to split the input into records, or extract the fields using a regular expression, but why would we in AWK? One of the strengths of AWK is it is line and field oriented. We just have to tell AWK our fields aren't white space separated, but they're separated by commas:

BEGIN {
    FS = ","
}

This uses the special variable FS, the Field Separator, which controls how AWK splits the input into fields.

We can now just iterate over the input fields (which are available in the special variables $1, $2, etc), and build our output string in the array out:

{
    for (i = 1; i <= NF; i ++) {
        out [i] = out [i] "," $i
    }
}

All what is left is to print out the output string, without the trailing comma:

END {
    for (i = 1; i <= length (out); i ++) {
        print substr (out [i], 2)
    }
}

Find the full program on GitHub.

Bash

Just like AWK, Bash will split the input into records when reading data using read. We're telling to split fields on commas by assigning to the variable IFS, the Input Field Separator:

IFS=,

Commas are not special to Bash, so the don't have to be escaped.

We're now ready to read in the input. We read each line of input into the array chunks — Bash has neatly split the input into the right fields for us. We then iterate over the array chunks, adding each field to the appropriate output line, which are stored in the array out. Concatenation in Bash is done by just sticking the things you want to concatenate together:

IFS=","
while read -a chunks
do    for   ((i = 0; i < ${#chunks[@]}; i ++))
      do    out[$i]=${out[$i]}${chunks[$i]},
      done
done

We can now print the output string, without the trailing commas:

IFS=""
for   ((i = 0; i < ${#out[@]}; i ++))
do    echo ${out[$i]:0:-1}
done

Find the full program on GitHub.

Lua

In Lua, we follow the same strategy we used for most languages. We use the following to read in the data, split the input lines into fields, and to construct the output lines:

local output = {}
for line in io . lines () do
    local i = 1
    for field in line : gmatch ("[^,\n]+")
    do if   output [i] == nil
       then output [i] = ""
       end
       output [i] = output [i] .. field .. ","
       i = i + 1
    end
end

Things to note are that Lua does not have a split method, instead we use gmatch, a global match. And while Lua autovivifies array elements, we cannot concatenate to undefined (nil) values, so we have to handle that case. Lua does not have a post decrement operator (++) either.

Print the output without their trailing commas is easy:

for _, line in ipairs (output)
do  print (line : sub (1, -2))
end

Find the full program on GitHub.

Node.js

Our solution in Node.js follows a similar strategy. Just like in Lua, we cannot concatenate strings to undefined values, we have to deal with that. In our Node.js solution however, we don't add a comma after we have added a field to an output string; instead, we first add the comma, then the field. This is purely because in Node.js, we cannot as easily use substr to include all, but the last the character. On the other hand, using substr to include all but the first character is easy.

let out = []

  require      ("fs")
. readFileSync (0)                     // Read all.
. toString     ()                      // Turn it into a string.
. split        ("\n")                  // Split on newlines.
. filter       (_ => _ . length)       // Filter out empty lines.
. map          (_ => _ . split (','))  // Split each line on commas
. forEach      (_ => _ .
      forEach ((field, index) => {     // Create output strings
           if (out [index] == null) {
               out [index] = ""
           }
           out [index] += "," + field
      }))

out . forEach (_ => console . log (_ . substr (1)))

Find the full program on GitHub.

Python

A slight deviation for our Python solution. Python doesn't autovivify array elements. So, for the first line of input read, we just copy the fields to the output array. This gives two added bonusses: we won't have to deal with concatenating undefined strings, and we don't have a problem with a trailing (or leading) comma. With hindsight, we could have used this strategy for other languages, but Python was one of the last languages we implemented the problem in, and it didn't seem worth it to update the other solutions.

import fileinput

outputs = []

for line in fileinput . input ():
    fields = line . rstrip () . split (",")
    if (len (outputs)):
        for i in range (len (fields)):
            outputs [i] = outputs [i] + "," + fields [i]
    else:
        outputs . extend (fields)

for line in outputs:
    print (line)

Find the full program on GitHub.

Ruby

By now, we know the drill...

output = []

ARGF . each_line do
   |_|
    _ . strip . split(/,/) . each_with_index do
       |_, i|
        output [i] = (output [i] || "") + _ + ","
    end
end

output . each do
   |_|
    puts _ [0 .. -2]
end

Find the full program on GitHub.

C

In C, we will use a different strategy. We won't be building output strings. Instead, we will read the complete input into a string, and position a pointer at the start of each input line. To generate the output, we visit each pointer (in order), and advance the pointer character by character to the next comma, printing each character while advancing.

First, we read the input, counting lines and columns:

int main (void) {
    char *  text = NULL;
    size_t  len  = 0;
    size_t  str_len;

    if ((str_len = getdelim (&text, &len, '\0', stdin)) == -1) {
        perror ("Failed to read stdin");
        exit (1);
    }
    /*
     * Count the number of input lines and columns.
     */
    size_t nr_of_lines   = 0;
    size_t nr_of_columns = 0;
    char * ptr = text;
    while (* ptr) {
        if (nr_of_lines == 0) {
            if (* ptr == ',' || * ptr == '\n') {
                nr_of_columns ++;
            }
        }
        if (* ptr ++ == '\n') {
            nr_of_lines ++;
        }
    }

We're then position the pointers (which themselves are placed in an array). We also turn the newlines in the input into commas, to make life easier later on:

    char ** outputs;
    if ((outputs = (char **) malloc (nr_of_lines * sizeof (char *)))
        == NULL) {
        perror ("Malloc failed");
        exit (1);
    }
    ptr = text;
    size_t c = 0;
    outputs [0] = ptr;
    while (* ptr) {
        if (* ptr == '\n') {
            * ptr = ',';
            if (* (ptr + 1) != '\0') {
                outputs [++ c] = ptr + 1;
            }
        }
        ptr ++;
    }

Now we can print:

    for (size_t i = 0; i < nr_of_columns; i ++) {
        for (size_t j = 0; j < nr_of_lines; j ++) {
            if (j) {
                printf (",");
            }
            while (* outputs [j] != ',') {
                printf ("%c", * outputs [j] ++);
            }
            outputs [j] ++;
        }
        printf ("\n");
    }

Find the full program on GitHub.


Please leave any comments as a GitHub issue.