The power of *nix tools and regular expressions

ffreeloader · Sep 25, 2007

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

One of the guys I'm working with has spent the last three days trying to move data from a proprietary database format into a new database. Yesterday evening, after all the frustration he had experienced trying to do this in Windows, he asked me for help and emailed me the data in an xml file. In less than than 2 hours I had transformed the xml file to a .csv file, and deleted all the extraneous text that xml uses for its formatting information leaving only the raw data he needed. That includes the time I spent searching for, and installing, a utility that would translate an xml file to a straight text file, and my learning curve with with that tool, sed, and re-familiarizing myself with regular expressions.

The power of the *nix text-based tools is just amazing. I did this with a two commands. First I transformed the file, then I simply used "less" to pipe the data into "sed" and massaged the text using regular expressions. I would imagine someone with quite a bit of previous sed experience could have done the entire job in 5 minutes or so.

Uhh, and before you think this guy isn't experienced or know what he is doing, he is, and does. He's been working with databases for about a decade and does this kind of stuff on a regular basis. He just hadn't thought of trying any *nix tools until he was ready to give up in frustration on what was available to him at a reasonable cost in Windows. He's a smart guy, he just didn't know the correct tool for the job was available.

hbroomhall · Sep 25, 2007

Oh man! Ad-blocking software has been detected! :'(

This website is run by the community, for the community... and it needs advertisements in order to keep running. Blocking our ads means your killing our stats!
Please disable your ad-block, or become a premium member to hide all advertisements and this notice.

I find I use sed, grep, awk and cut on a regular basis to prep data for loading into a database!

Harry.

tripwire45 · Sep 25, 2007

I came across a handy article on regular expressions and posted it at the Linux Tutorial site.

dmarsh · Sep 25, 2007

Anyone with basic programming or scripting knowledge should have been able to complete the task in a day by the sounds of it.

All these 'UNIX' tools, grep, awk, sed, are also available on Windows.

http://www.cygwin.com/

I messed with an early version of awk and sed for DOS about 12 years back !

I ended up rewriting it in C as the performance of the AWK/SED version was horrible !

Regular Expressions make good lexers but they do not make good parsers in general.

God knows what this guys been doing for the last decade if his database experience is mainly on windows and he can't perform basic ETL tasks...

Also in general other than for very basic tasks why on earth would you want it in CSV, and why would you not just use a standard XML parser instead of writing your own one with AWK/SED ?

In fact you could probably have used XSLT on any platform to convert to CSV and had it done in under 2 hours.

ffreeloader · Sep 25, 2007

dmarsh26 said: ↑

Anyone with basic programming or scripting knowledge should have been able to complete the task in a day by the sounds of it.

All these 'UNIX' tools, grep, awk, sed, are also available on Windows.

http://www.cygwin.com/

I messed with an early version of awk and sed for DOS about 12 years back !

I ended up rewriting it in C as the performance of the AWK/SED version was horrible !

Regular Expressions make good lexers but they do not make good parsers in general.

God knows what this guys been doing for the last decade if his database experience is mainly on windows and he can't perform basic ETL tasks...

Also in general other than for very basic tasks why on earth would you want it in CSV, and why would you not just use a standard XML parser instead of writing your own one with AWK/SED ?

In fact you could probably have used XSLT on any platform to convert to CSV and had it done in under 2 hours.
Click to expand...

Well, for the first thing, the only way he had to pull the data out of the proprietary database left it in an xml format that wasn't exactly a standard format as he had used a couple of XML parsers and couldn't get the data out in a usable format.

The second thing was he was writing a XSLT and having problems with it because of how the data came out of the database. Me? I've never seen and XSLT, and wouldn't know one if it hit me in upside the head, so I would have a real steep learning curve with it.

Third, I'm pretty comfortable the format the *nix tools take, even with ones I hadn't really used before, and worked my way through a book on regexes about a year ago so I wasn't going into that cold turkey. As to regexes in general, I have found them to be really powerful at parsing and finding data and I'm not alone in that regard. The way I did it though was to eliminate everything I didn't want, and then put the data into the format I wanted. To do any more from this time forward it would only take me a matter of few minutes to complete the job as I have a much better idea as to how to use sed's syntax to solve a problem. That first time through anything is always really slow.

ffreeloader · Sep 25, 2007

hbroomhall said: ↑

I find I use sed, grep, awk and cut on a regular basis to prep data for loading into a database!

Harry.
Click to expand...

I can see why. I was pretty amazed at how fast it is. Once I'm really comfortable with sed I'll be able to massage text pretty rapidly.

ffreeloader · Sep 25, 2007

tripwire45 said: ↑

I came across a handy article on regular expressions and posted it at the Linux Tutorial site.
Click to expand...

There is also a really handy FAQ on Sed that people might find really handy. It's actually more than an FAQ as it has handy scripts and is sort of a tutorial of sorts.

http://www.student.northpark.edu/pemente/sed/sedfaq.html

dmarsh · Sep 26, 2007

Well, for the first thing, the only way he had to pull the data out of the proprietary database left it in an xml format that wasn't exactly a standard format as he had used a couple of XML parsers and couldn't get the data out in a usable format.

The second thing was he was writing a XSLT and having problems with it because of how the data came out of the database. Me? I've never seen and XSLT, and wouldn't know one if it hit me in upside the head, so I would have a real steep learning curve with it.
Click to expand...

Well well done freddy anyway for sorting it out ! I still think this other guy sounds like a loser though !

Well XML is a standard Metadata format, the end format is of course down to the writer or another standard. Its sounds like a very sensible thing for the DB vendor to do. Most DBs these days have some form of XML support, this along with TSQL or another database scripting language and maybe some temp tables would probably get the job done also. Then theres all the ETL tools, DTS and the like, it sounds like hes a point and click man ! These generally only work well in certain scenarios.

Finally XSLT, basically think prolog, its a tree walking algorithm with predicates basically, in this case the tree is a DOM. It does take a little getting used to, however all these solutions would probably be a much better fit in general for the problem than awk, grep and sed.

ffreeloader · Sep 26, 2007

dmarsh26 said: ↑

Well well done freddy anyway for sorting it out ! I still think this other guy sounds like a loser though !

Well XML is a standard Metadata format, the end format is of course down to the writer or another standard. Its sounds like a very sensible thing for the DB vendor to do. Most DBs these days have some form of XML support, this along with TSQL or another database scripting language and maybe some temp tables would probably get the job done also. Then theres all the ETL tools, DTS and the like, it sounds like hes a point and click man ! These generally only work well in certain scenarios.

Finally XSLT, basically think prolog, its a tree walking algorithm with predicates basically, in this case the tree is a DOM. It does take a little getting used to, however all these solutions would probably be a much better fit in general for the problem than awk, grep and sed.
Click to expand...

Well, he's not a loser. He's actually been pretty successful, and he has the best product of its type in a very specialized, niche market. He's a pretty smart guy to tell the truth. He was just struggling with this one. I don't know, he had a heart attack a month and a half or so ago and maybe his heart attack has affected him more than we know. I know he had to have 5 stints put in so he was in pretty tough shape. As to his being a point and click guy, well, you nailed that one.

As my "reward" for doing the first one, he sent me more last night. I finished the last table in about 10 minutes, and all them in quite a bit less time than it took me to do the first job he sent. So, whether or not sed is suitable for the job, well, I'll leave that up to you decide, but as I get more familiar with it my productivity is starting to get there. Funny thing about all of this is that not a single table came out of that database in the same format. I just think that's what he gets for choosing a proprietary solution for his database. When you try to move away from it because of its limitations, you run into problems.

dmarsh · Sep 26, 2007

I don't know, he had a heart attack a month and a half or so ago and maybe his heart attack has affected him more than we know.
Click to expand...

Sorry to hear about his ill health, I hope he makes a speedy recovery, I'd take a year off if I had a successful product and a heart attack, to recover and get things in perspective.

Funny thing about all of this is that not a single table came out of that database in the same format. I just think that's what he gets for choosing a proprietary solution for his database. When you try to move away from it because of its limitations, you run into problems.
Click to expand...

I expect they are in the same format, the tags will be named and laid out in accordance with the tables, relations and columns etc. XML is a standard, it should be effectively isolating you from the legacy database, you still need to know the format of YOUR data though !

I didn't say awk, sed and grep wouldn't work, ANY general purpose language could be used to solve the problem, as well as quite a few non general languages, its just that some are more suitable than others.

In the same situation I'd often use C/C++ or Java, simply because I'm most familiar with them. For one off utils it really often doesn't matter, its just that its good to know the various ways you could have done it and why. Also often these 'one off' utils have a habit of taking on a life of their own.

Log in or Sign up

The power of *nix tools and regular expressions

ffreeloader Terabyte Poster

hbroomhall Petabyte Poster Gold Member

tripwire45 Zettabyte Poster

dmarsh Petabyte Poster

ffreeloader Terabyte Poster

ffreeloader Terabyte Poster

ffreeloader Terabyte Poster

dmarsh Petabyte Poster

ffreeloader Terabyte Poster

dmarsh Petabyte Poster

Share This Page

Navigation

Popular Forums

Useful Links

Log in or Sign up

The power of *nix tools and regular expressions

ffreeloader Terabyte Poster

hbroomhall Petabyte Poster Gold Member

tripwire45 Zettabyte Poster

dmarsh Petabyte Poster

ffreeloader Terabyte Poster

ffreeloader Terabyte Poster

ffreeloader Terabyte Poster

dmarsh Petabyte Poster

ffreeloader Terabyte Poster

dmarsh Petabyte Poster

Share This Page

Useful Searches