Home > Back-end >  Trying to automate a data standardization process I started in SSMS in Python
Trying to automate a data standardization process I started in SSMS in Python

Time:01-25

I have a bunch of address point data collected from a variety of sources. I had earlier standardized these datasets in SQL into the following columns:

  1. Address1: Street Address
  2. Address2: Unit Address
  3. City
  4. State
  5. Zip

Each of these columns were built up from different datasets, and since the data were collected from different sources, I needed to manually build it up.

For instance, the standardization for one county might look like this:

INSERT INTO Standardized_Addresses_02020 (Address1, Address2, City, State, Zip)
SELECT 
    Address1 = Number   PreDir   Street  StType
    Address2 = Unit
    City = Postal_Area
    State = StateAbbv
    Zip = ZIP_CODE
FROM Raw_Address_02020

But for another county, the columns in the raw data used to build up the columns in the standardized data are different. For instance:

INSERT INTO Standardized_Addresses_02100 (Address1, Address2, City, State, Zip)
SELECT 
    Address1 = Add_num  Prefix   St_name  PostDir   Type
    Address2 = Suite
    City = POST_COMM
    State = State
    Zip = ZIP
FROM Raw_Address_02100

I'm trying to write some python code that will automate this whole process, but because the raw data have different column names, it is starting to get complicated.

The method that I'm currently considering is to create a table with all column names ('Master_Columns') from INFORMATION_SCHEMA.COLUMNS, and tag those columns as either being in Address1, Address2, City, State, or ZIP. Then I would create a python function that would check if the columns in the raw data exist in the Master_Columns, and if they do, to use them in the standardization.

This seems messy and long winded though, and I was curious to see if anyone has better ideas that could optimize this. I hope my explanation was clear. Thank you in advance!

CodePudding user response:

You can create a config file like config.ini where you can define the mapping for each column for each source data. This way you only to refer the config file and not many changes are required in the code.

[source-1]
address1=Number   PreDir   Street  StType

[source-2]
address=Add_num  Prefix   St_name  PostDir   Type
  •  Tags:  
  • Related