First steps with Scala revisited: bash and python strikes back


I’ve received a good ammount of positive feedback on my previous article on scala.

A couple of readers prefered the bash one-liner version, and many of them argued that for such a simple task it was preferable a bash or python script. Luckily all of them understood that this was just a (maybe lousy, I admmit) excuse to give scala a try, and talk a little bit about functional programming, type inference, interacting with java, higher order functions, and, well, scala itself.

Nevertheless, to make justice to bash and scala, I took some advices from the discussion at hacer news, and even though I’m no bash nor python expert, with some googling around I managed to reproduce the funcionality of the scala script.

Bash eight-liner version

Well, here’s the bash version:

total_size=$(du --summarize *.textile --total | tail -n 1 | cut -f 1)
translated_files=$(grep -L "Esta página todavía no ha sido traducida al castellano" *.textile)
translated_size=$(echo $translated_files | tr '\n ' '\0' | xargs -0 du --summarize --total | tail -n 1 | cut -f 1)
translated_percent=$(($translated_size*100/$total_size))
echo "translated size: ${translated_size}kb/${total_size}kb ${translated_percent}% \
(pending $(($total_size-$translated_size))kb $((100-$translated_percent))%)"

total_count=$(ls *.textile | wc -l)
translated_count=$(echo $translated_files | tr ' ' '\n' | wc -l)
translated_percent=$(($translated_count*100/$total_count))
echo "translated files: ${translated_count}/${total_count} $(($translated_count*100/$total_count))% \
(pending $(($total_count-$translated_count)) $((100-$translated_percent))%)"

I just had to read a couple of man pages and struggle a little bit with tr, wc, xargs, tail, cut and that sort of stuff.

Python version

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import fnmatch
import os

total_files = [file for file in os.listdir('.') if fnmatch.fnmatch(file, '*.textile')]
translated_files = [file for file in total_files if "Esta página todavía no ha sido traducida al castellano" not in open(file).read()]

total_size = sum([os.path.getsize(file) for file in total_files]) / 1000
translated_size = sum([os.path.getsize(file) for file in translated_files]) / 1000
translated_percent= translated_size * 100 / total_size

print "translated size: %dkb/%dkb %d%% (pending %dkb %d%%)" % \
      (translated_size, total_size, translated_percent, total_size-translated_size, 100-translated_percent)

total_count=len(total_files)
translated_count=len(translated_files)
translated_percent= translated_count * 100 / total_count

print "translated files: %d/%d %d%% (pending %d %d%%)" % \
      (translated_count, total_count, translated_percent, total_count-translated_count, 100-translated_percent)

What else can I say? The python version was really easy.

Scala, Bash and Python… FIGHT!

Well, now let’s see the output of each version:

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ ./status.scala 
translated size: 407kb/624kb 65% (pending 217kb 35%)
translated files: 37/64 57% (pending 27 43%)

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ ./status.sh 
translated size: 476kb/752KB 63% (pending 276kb 37%)
translated files: 37/64 57% (pending 27 43%)

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ ./status.py 
translated size: 407kb/624kb 65% (pending 217kb 35%)
translated files: 37/64 57% (pending 27 43%)

It seems like du rounds up the files size, but apart from that everything works as expected.

What about performance?

While the scala version do have a startup penalty, with the savecompiled option turned on, the delay is pretty bearable (without it the compiling process takes a little less than two seconds). Moreover, with long running or more complex tasks, I suspect that the benefits of having a compiled script, and the performance optimizations of the JVM, would certainly show up.

Here are some figures to compare.

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ time ./status.scala
translated size: 407kb/624kb 65% (pending 217kb 35%)
translated files: 37/64 57% (pending 27 43%)
real	0m0.475s
user	0m0.388s
sys	0m0.056s

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ time ./status.sh
translated size: 476kb/752KB 63% (pending 276kb 37%)
translated files: 37/64 57% (pending 27 43%)
real	0m0.045s
user	0m0.004s
sys	0m0.008s

sas@ubuntu:~/devel/apps/playdoces/documentation/1.2.4/manual$ time ./status.py
translated size: 407kb/624kb 65% (pending 217kb 35%)
translated files: 37/64 57% (pending 27 43%)
real	0m0.039s
user	0m0.020s
sys	0m0.012s

Conclusion

After playing a bit with all three of them, for this kind of tasks I’d definitely go with python. It’s really a joy to use, it’s got great documentation and there’s lot of interesting information at stack overflow. Moreover, like the scala version, and unlike bash, is portable across different platforms, I haven’t tried it but it should work just fine on windows.

Nevertheless I expect to keep playing with scala, for learning purposes and just to have some fun…

In the next article, I give scala another chance, and at the same time have a look at Implicit conversions, Scala’s answer to ruby’s open classes.

About these ads

5 responses to this post.

  1. Posted by xxx on 13 January, 2012 at 8:09

    Kill latin1.

    Reply

  2. Posted by joe on 17 January, 2012 at 12:19

    One of the reasons I like scala for scripting is that I can simply import my jar with my rather complicated object model + API for accessing my database in a script.

    And everything ist type-safe. I have less fear that on some lesser used path that a small test dataset does not cover the program dies on a big dataset because of a typo.

    After seeing usefulness of my scripts, I refactor a little bit and move them into the library.

    I work however mostly on data-analysis problems. Python is also good, bash is hell.

    What I miss from scala is Pythons Popen(shell=true). I like to do the piping myself.

    Reply

  3. Posted by R on 19 January, 2012 at 13:11

    About the python version.
    You said that,
    “Esta página todavía no ha sido traducida al castellano” would be the first line, then why not use the f.readline() function instead of the .read() function? I am only babysteps into programming, and I’m not sure about this but wouldn’t you be able to squeeze a little more performance by this? This might also apply to the bash and scala vesion, but I wouldn’t know since I can’t read those 2 languages (yet).

    Reply

    • nice tip (Hey, I’m no python expert either), I tried it and it really went a bit faster

      translated size: 434kb/626kb 69% (pending 192kb 31%)
      translated files: 41/64 64% (pending 23 36%)
      real 0m0.033s
      user 0m0.016s
      sys 0m0.008s

      I guess the performance benefits aren’t bigger because in general file are pretty small…

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: