Event 5: My notes…

We are very close to the end of Scripting Games 2013. But before we finish with event 6, time to say few words about event 5… This time I will focus on things I liked, and things I didn’t like, without paying too much attention to category where I’ve seen either.

Likes

I must say that I enjoyed more than one script and one-liner. I’ve seen quite a few interesting techniques implemented. I haven’t paid too much attention to performance of that techniques because quite frankly – I had no way to test performance. I know few attendees, especially in advanced category, spent quite some time on getting the information as fast as possible. Hat tip to those of you who made it quicker than my solution (which was partially wrong – it would report both client and server IP) by orders of magnitude.

So what techniques I’ve seen that I liked? Using hashtables and other collections to keep only valid results. Really: if you want to make it efficient, you won’t waste memory on duplicates. Simple array is probably “good enough” if you have few results. But you should really avoid it when you expect it to grow very fast. Use collections that support adding, System.Array is not one of them.

OK, this is for collecting unique values, but first you need to find any… I liked that so many people read about IIS logs (I haven’t, my bad probably) and than took advantage of their format to use Import-Csv cmdlet instead of regular expressions. From what I’ve heard it was relatively slow, but still: it sounds like a smart idea to follow format of your log file, rather than parse it like crazy. Puszczam oczko

Using good regex patterns – some people really get it (or at least are smart enough to use good source for working examples). I also liked the idea of validating IP using [IPAddress] accelerator, after getting more basic regex that would gladly accept patterns that could or could not be IP addresses. Neat. Uśmiech

Dislikes

First and foremost: using “Get-Content | Select-String” style. I’ve seen few people who would combine content of all logs in one big variable, and than parse the whole thing. And most of those people didn’t use any ReadCount value, so they would read all files line by line. Select-String has –Path parameter, why not use it? I think Get-Content was valid only if you wanted to really take advantage of it’s options.

Next thing I didn’t like was using regular expressions that were to general to give valid results. Also: when you use regular expressions, you should consider using all what they have to offer. Two patterns below are same:

'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'            
'\b(\d{1,3}\.){3}\d{1,3}\b'

Both are not accurate patterns for IPv4 address. Notice the difference? It’s not very big here, where we repeat same pattern three times. Imagine you repeat more…:

if ('My MAC Address is: 00-27-10-1A-55-14' -match             
    '([\da-f]{2}-){5}[\da-f]{2}') {            
    $Matches[0]            
}

I’ve also seen people who tried to make sure that they are matching something surrounded by spaces. Well, that’s what \b is for. If you use it, you don’t have to .Trim() resulting string to get rid off extra spaces. And if you have some other chars that your “target” is surrounded by (e.g. quotes) you can always use another technique:

if ('This is long text with "quoted text" in it.' -match             
    '(?<=").*?(?=")') {            
    $Matches[0]            
}

As you can see – I got only text inside quotes. No .Trim() needed. Puszczam oczko

And last but certainly not least: getting all the data first to get unique values in last line. From my point of view: having so many possible duplicates is screaming for some way of throwing them away during whole process, not at the final stage. I know that example zip file was working pretty fine with that approach, but if you plan to get real work done, you can expect much more IPs that few we got from examples. Don’t collect all of them if you do not plan to use them… Puszczam oczko

3 thoughts on “Event 5: My notes…

  1. I agree that Get-Content | Select-String is horrible, but to me the horror is the Select-String.

    I like Select-String for searching and reporting on matches in a file, but IMHO it is a poor choice for parsing operations. It returns MatchInfo objects. These these are relatively complex, “rich” objects that contain a lot of information about each match – filename, line number, precontext, postcontext, etc., in addition to the match values. Once you get past the first few lines in the file header, you’re going to get one of those for every line in the log file. And then throw everything away except the string value of the capture.

    For parsing, Get-Content, followed by simple -match and /or -replace operators will serve just as well.

      • It’s great information to have if you need it and you can’t get that doing a simple -match operation, but getting it and creating those nice objects for you predictably comes with a cost.

        Using Select-String takes 4-5 times as long as applying the same regex using -match. It’s well worth it if you need the additional information it provides, but overkill if all you wanted out of it is the strings.

Leave a comment