I have a bunch of address point data collected from a variety of sources. I had earlier standardized these datasets in SQL into the following columns:
- Address1: Street Address
- Address2: Unit Address
- City
- State
- Zip
Each of these columns were built up from different datasets, and since the data were collected from different sources, I needed to manually build it up.
For instance, the standardization for one county might look like this:
INSERT INTO Standardized_Addresses_02020 (Address1, Address2, City, State, Zip)
SELECT
Address1 = Number PreDir Street StType
Address2 = Unit
City = Postal_Area
State = StateAbbv
Zip = ZIP_CODE
FROM Raw_Address_02020
But for another county, the columns in the raw data used to build up the columns in the standardized data are different. For instance:
INSERT INTO Standardized_Addresses_02100 (Address1, Address2, City, State, Zip)
SELECT
Address1 = Add_num Prefix St_name PostDir Type
Address2 = Suite
City = POST_COMM
State = State
Zip = ZIP
FROM Raw_Address_02100
I'm trying to write some python code that will automate this whole process, but because the raw data have different column names, it is starting to get complicated.
The method that I'm currently considering is to create a table with all column names ('Master_Columns') from INFORMATION_SCHEMA.COLUMNS, and tag those columns as either being in Address1, Address2, City, State, or ZIP. Then I would create a python function that would check if the columns in the raw data exist in the Master_Columns, and if they do, to use them in the standardization.
This seems messy and long winded though, and I was curious to see if anyone has better ideas that could optimize this. I hope my explanation was clear. Thank you in advance!
CodePudding user response:
You can create a config file like config.ini where you can define the mapping for each column for each source data. This way you only to refer the config file and not many changes are required in the code.
[source-1]
address1=Number PreDir Street StType
[source-2]
address=Add_num Prefix St_name PostDir Type
