The power of *nix tools and regular expressions

Discussion in 'Linux / Unix Discussion' started by ffreeloader, Sep 25, 2007.

  1. ffreeloader

    ffreeloader Terabyte Poster

    3,661
    106
    167
    One of the guys I'm working with has spent the last three days trying to move data from a proprietary database format into a new database. Yesterday evening, after all the frustration he had experienced trying to do this in Windows, he asked me for help and emailed me the data in an xml file. In less than than 2 hours I had transformed the xml file to a .csv file, and deleted all the extraneous text that xml uses for its formatting information leaving only the raw data he needed. That includes the time I spent searching for, and installing, a utility that would translate an xml file to a straight text file, and my learning curve with with that tool, sed, and re-familiarizing myself with regular expressions.

    The power of the *nix text-based tools is just amazing. I did this with a two commands. First I transformed the file, then I simply used "less" to pipe the data into "sed" and massaged the text using regular expressions. I would imagine someone with quite a bit of previous sed experience could have done the entire job in 5 minutes or so.

    Uhh, and before you think this guy isn't experienced or know what he is doing, he is, and does. He's been working with databases for about a decade and does this kind of stuff on a regular basis. He just hadn't thought of trying any *nix tools until he was ready to give up in frustration on what was available to him at a reasonable cost in Windows. He's a smart guy, he just didn't know the correct tool for the job was available.
     
    Certifications: MCSE, MCDBA, CCNA, A+
    WIP: LPIC 1
  2. hbroomhall

    hbroomhall Petabyte Poster Gold Member

    6,624
    117
    224
    I find I use sed, grep, awk and cut on a regular basis to prep data for loading into a database!

    Harry.
     
    Certifications: ECDL A+ Network+ i-Net+
    WIP: Server+
  3. tripwire45
    Honorary Member

    tripwire45 Zettabyte Poster

    13,493
    180
    287
    I came across a handy article on regular expressions and posted it at the Linux Tutorial site.
     
    Certifications: A+ and Network+
  4. dmarsh
    Honorary Member 500 Likes Award

    dmarsh Petabyte Poster

    4,305
    503
    259
    Anyone with basic programming or scripting knowledge should have been able to complete the task in a day by the sounds of it.

    All these 'UNIX' tools, grep, awk, sed, are also available on Windows.

    http://www.cygwin.com/

    I messed with an early version of awk and sed for DOS about 12 years back !

    I ended up rewriting it in C as the performance of the AWK/SED version was horrible !

    Regular Expressions make good lexers but they do not make good parsers in general.

    God knows what this guys been doing for the last decade if his database experience is mainly on windows and he can't perform basic ETL tasks...

    Also in general other than for very basic tasks why on earth would you want it in CSV, and why would you not just use a standard XML parser instead of writing your own one with AWK/SED ?

    In fact you could probably have used XSLT on any platform to convert to CSV and had it done in under 2 hours.
     
  5. ffreeloader

    ffreeloader Terabyte Poster

    3,661
    106
    167
    Well, for the first thing, the only way he had to pull the data out of the proprietary database left it in an xml format that wasn't exactly a standard format as he had used a couple of XML parsers and couldn't get the data out in a usable format.

    The second thing was he was writing a XSLT and having problems with it because of how the data came out of the database. Me? I've never seen and XSLT, and wouldn't know one if it hit me in upside the head, so I would have a real steep learning curve with it.

    Third, I'm pretty comfortable the format the *nix tools take, even with ones I hadn't really used before, and worked my way through a book on regexes about a year ago so I wasn't going into that cold turkey. As to regexes in general, I have found them to be really powerful at parsing and finding data and I'm not alone in that regard. The way I did it though was to eliminate everything I didn't want, and then put the data into the format I wanted. To do any more from this time forward it would only take me a matter of few minutes to complete the job as I have a much better idea as to how to use sed's syntax to solve a problem. That first time through anything is always really slow.
     
    Certifications: MCSE, MCDBA, CCNA, A+
    WIP: LPIC 1
  6. ffreeloader

    ffreeloader Terabyte Poster

    3,661
    106
    167
    I can see why. I was pretty amazed at how fast it is. Once I'm really comfortable with sed I'll be able to massage text pretty rapidly.
     
    Certifications: MCSE, MCDBA, CCNA, A+
    WIP: LPIC 1
  7. ffreeloader

    ffreeloader Terabyte Poster

    3,661
    106
    167
    Certifications: MCSE, MCDBA, CCNA, A+
    WIP: LPIC 1
  8. dmarsh
    Honorary Member 500 Likes Award

    dmarsh Petabyte Poster

    4,305
    503
    259
    Well well done freddy anyway for sorting it out ! :biggrin I still think this other guy sounds like a loser though ! :oops:

    Well XML is a standard Metadata format, the end format is of course down to the writer or another standard. Its sounds like a very sensible thing for the DB vendor to do. Most DBs these days have some form of XML support, this along with TSQL or another database scripting language and maybe some temp tables would probably get the job done also. Then theres all the ETL tools, DTS and the like, it sounds like hes a point and click man ! These generally only work well in certain scenarios.

    Finally XSLT, basically think prolog, its a tree walking algorithm with predicates basically, in this case the tree is a DOM. It does take a little getting used to, however all these solutions would probably be a much better fit in general for the problem than awk, grep and sed.
     
  9. ffreeloader

    ffreeloader Terabyte Poster

    3,661
    106
    167
    Well, he's not a loser. He's actually been pretty successful, and he has the best product of its type in a very specialized, niche market. He's a pretty smart guy to tell the truth. He was just struggling with this one. I don't know, he had a heart attack a month and a half or so ago and maybe his heart attack has affected him more than we know. I know he had to have 5 stints put in so he was in pretty tough shape. As to his being a point and click guy, well, you nailed that one. :biggrin

    As my "reward" for doing the first one, he sent me more last night. :rolleyes: I finished the last table in about 10 minutes, and all them in quite a bit less time than it took me to do the first job he sent. So, whether or not sed is suitable for the job, well, I'll leave that up to you decide, but as I get more familiar with it my productivity is starting to get there. Funny thing about all of this is that not a single table came out of that database in the same format. I just think that's what he gets for choosing a proprietary solution for his database. When you try to move away from it because of its limitations, you run into problems.
     
    Certifications: MCSE, MCDBA, CCNA, A+
    WIP: LPIC 1
  10. dmarsh
    Honorary Member 500 Likes Award

    dmarsh Petabyte Poster

    4,305
    503
    259
    Sorry to hear about his ill health, I hope he makes a speedy recovery, I'd take a year off if I had a successful product and a heart attack, to recover and get things in perspective.

    I expect they are in the same format, the tags will be named and laid out in accordance with the tables, relations and columns etc. XML is a standard, it should be effectively isolating you from the legacy database, you still need to know the format of YOUR data though !

    I didn't say awk, sed and grep wouldn't work, ANY general purpose language could be used to solve the problem, as well as quite a few non general languages, its just that some are more suitable than others.

    In the same situation I'd often use C/C++ or Java, simply because I'm most familiar with them. For one off utils it really often doesn't matter, its just that its good to know the various ways you could have done it and why. Also often these 'one off' utils have a habit of taking on a life of their own.
     

Share This Page

Loading...
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.