Home > Blockchain >  Cleaning data by adding/removing characters to a string when it meets certain conditions T-SQL
Cleaning data by adding/removing characters to a string when it meets certain conditions T-SQL

Time:02-04

I'm looking for help with cleaning a column in my data set so that I can join to another table.

The first data set is my complete data and includes what we call "reference_numbers" which relate to a specific case. Here is a dummy sample:

reference_number case_opened case_closed
01353568-00000001 11/01/2021 03/02/2022
09736473-00000009 21/04/2005 19/07/2021
05839576-00000012 13/09/2014 19/12/2017
09364857-00000006 13/09/2014 19/12/2017

As you can see, the "reference_number" is 8 digits then hyphen (-) and then another 8 digits. This is how a reference number should look.

My second data set is full of the same "reference_numbers". However, there is inconsistencies in the character length as they are often written differently by individuals:

reference_number Case_workers
1353568-00000001 5
09736473-9 10
5839576-12 7
09364857-000000006 4
  • The first reference_number in the second data set is missing the first "0"
  • The second reference_number in the second data set is missing seven "0" after the hyphen
  • The third reference_number in the second data set is missing both the first "0" and six "0" after the hyphen
  • The fourth reference_number in the second data set has too many digits after the hyphen (there is supposed to be seven 0's)

I want to be able to join the first data set onto the second data set using the reference_number. However, I need to clean them first. Is this possible and is there any efficient way of doing this?

Thanks

CodePudding user response:

If the rules are so specific, you could try to use a combination of STRING_SPLIT and STRING_AGG:

SELECT 
    t.reference_number,
    STRING_AGG(RIGHT('00000000' s.value,8),'-') new_reference_number
FROM dbo.SecondTable t
CROSS APPLY STRING_SPLIT(t.reference_number,'-') s
GROUP BY t.reference_number
;

Using the sample data you posted, the results are:

╔════════════════════╦══════════════════════╗
║  reference_number  ║ new_reference_number ║
╠════════════════════╬══════════════════════╣
║ 09364857-000000006 ║ 09364857-00000006    ║
║ 09736473-9         ║ 09736473-00000009    ║
║ 1353568-00000001   ║ 01353568-00000001    ║
║ 5839576-12         ║ 05839576-00000012    ║
╚════════════════════╩══════════════════════╝

CodePudding user response:

select reference_number,
CONCAT(left(reference_number, charindex('-', reference_number) - 1),'-',RIGHT(CONCAT('000000000',right(reference_number, len(reference_number) - charindex('-', reference_number))),9)) as NewReferenceNumber
from YourSecondTableName

Reference_Number New_Reference_Number
1353568-00000001 1353568-000000001
09736473-9 09736473-000000009
5839576-12 5839576-000000012
09364857-000000006 09364857-000000006
  •  Tags:  
  • Related