Currently I am trying to parse a table from a Microsoft Site (the GitHub Version of it) to get proper PowerShell Objects. I'll share the relevant code part so you can test it. It does parse what i want but i want the results to be already trimmed (no leading trailing spaces or line-breaks). I also have to get the result for "CNG Key Isolation" which has a different formatting. Only for that block of data my RegEx includes line breaks and I did not get it to work. I know I could do some parsing in PowerShell after the RegEx, but I want to get better with RegEx.
My not yet optimized RegEx looks like this
(?:^##\s*(?<ServiceTitle>[^\r\n#]*)[\r\n\s]*\|\s Name\s \|\s Description\s \|(?:[\r\n\s\|\-\*] Service name[\|\*\s] (?<ServiceName>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Description[\|\*\s] (?<Description>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Installation[\|\*\s] (?<Installation>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Startup type[\|\*\s] (?<StartupType>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Recommendation[\|\*\s] (?<Recommendation>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Comments[\|\*\s] (?<Comments>[^\|]*?)(?: ?\|))*)
You can test it here: https://regex101.com/r/xQDRCO/1
The data to parse comes from: https://raw.githubusercontent.com/MicrosoftDocs/windowsserverdocs/main/WindowsServerDocs/security/windows-services/security-guidelines-for-disabling-system-services-in-windows-server.md
It should basically take one block of data for each service and try to get
"ServiceTitle","ServiceName","Description","Installation","StartupType","Recommendation","Comments"
No matter what order they are or if one of them is missing. "ServiceTitle" is something special and has to be there.
Here is the PowerShell code I currently tested:
$fields = "ServiceTitle","ServiceName","Description","Installation","StartupType","Recommendation","Comments"
$RequestData = Invoke-WebRequest -UseBasicParsing -Uri https://raw.githubusercontent.com/MicrosoftDocs/windowsserverdocs/main/WindowsServerDocs/security/windows-services/security-guidelines-for-disabling-system-services-in-windows-server.md
$RegExMatches = [Regex]::Matches($RequestData.content,'(?:^##\s*(?<ServiceTitle>[^\r\n#]*)[\r\n\s]*\|\s Name\s \|\s Description\s \|(?:[\r\n\s\|\-\*] Service name[\|\*\s] (?<ServiceName>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Description[\|\*\s] (?<Description>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Installation[\|\*\s] (?<Installation>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Startup type[\|\*\s] (?<StartupType>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Recommendation[\|\*\s] (?<Recommendation>[^\|]*?)(?: ?\|)|[\r\n\s\|\-\*] Comments[\|\*\s] (?<Comments>[^\|]*?)(?: ?\|))*)',[System.Text.RegularExpressions.RegexOptions]::Multiline)
$FullList = @()
foreach ($entry in $RegExMatches) {$ServiceAsObject = [pscustomobject]@{};foreach ($field in $fields) {$ServiceAsObject | Add-Member -MemberType NoteProperty -Name $field -Value $entry.Groups[$field].value};$FullList = $ServiceAsObject}
$FullList[15..17] # three items to see what problem i have with "CNG Key Isolation"
I am not using larger RegEx like that one often, so feel free to give me some feedback to improve myself.
Thank you, An-Dir
CodePudding user response:
This may not be what you are looking for, but you could do something like the following to output an array of custom objects:
$output = switch -regex ($requestdata.content -split '\r?\n') {
'^##\s' {
# tracking empty lines since there is one under the service title
# start new hash table when a new service is found
# remove ## from service title names
$emptyLineCount = 0
$hash = [ordered]@{}
$hash.ServiceTitle = $_ -replace '^##\s'
}
'\| \*\*' {
# split on | and surrounding spaces
# replace ** so name is cleaner
$key,$value = ($_ -split '\s*\|\s*' -replace '\*\*')[1,2]
$hash[$key] = $value
}
'^$' {
# when second empty line is reached in a service block, output object
if ($hash.ServiceTitle -and $emptyLineCount -eq 2) {
[pscustomobject]$hash
}
}
}
# Finding a service by title
$output | Where ServiceTitle -eq 'CNG Key Isolation'
Splitting the contents makes an array of lines, which is easier for me to use switch statement.
CodePudding user response:
Assuming you have all that text in your $RequestData.content, then I wouldn't try to create one large regex to parse it all out into usable objects, but instead would do:
# first split the tables from the rest of the text and work on the table lines only
$result = ($RequestData.content -split '(?m)^The following tables.*:')[-1].Trim() -split '(?m)^## ' |
Where-Object { $_ -match '\S' } |
ForEach-Object {
# split each block to parse out the title and the table data
$title, $table = ($_.Trim() -split '(\r?\n){2}', 2).Trim()
# now remove the markdown stuff from the data and convert it using ConvertFrom-Csv
$data = (($table -replace '(?m)^\|--\|--\||[*]{2}|^\||\|$' -replace '\s\|\s', '|') -split '\r?\n' -ne '').Trim() | ConvertFrom-Csv -Delimiter '|'
# set up an ordered Hashtable to store the data
$hash = [ordered]@{ServiceTitle = $title}
foreach ($item in $data) {
$hash[$item.Name] = $item.Description
}
# output real objects
[PsCustomObject]$hash
}
$result
