How a Temporary Position Taught Me Data Entry is Doomed to be Replaced by ETL
Desperate for a summer job to add more data science work experience for my resume, I accepted an offer from my brother to work for a company specializing in golf cart part distribution. The role was a data entry role, meant to take around 10,000 rows of messy, manually entered inventory data, and move it to a more organized CSV that would be readable by a new software Eagle was planning to use.
Any computer scientist or data scientist would probably anguish at the thought of moving 10,000 rows of data from one place to another manually, especially given data entry was once a relatively unskilled role. However, me having the mind of a data scientist, I saw an easy opportunity to turn the project into a data transformation project — all for a wage that wasn’t too far above that which I could have easily gotten working as a barista for Wawa.
Now, your first reaction to the previous statement may have been, “why use more skills than you have to, especially for a paycheck like that?”, which is a completely understandable thought process! First of all, I probably wouldn’t do this again; I was mainly looking for more resume-worthy experience while I am still in school. Moreover, I cannot stand doing unskilled labor— my brain needs far more stimulation than doing tedious, repetitive tasks.
That’s why I decided to make this an ETL project. The description of the project sounded like a perfect fit for utilizing ETL techniques. Having the mind of a computer scientist as well as a data scientist, I was able to see an opportunity to iterate through several rows and parse the strings in those rows, rather than iterating through those rows manually with my eyes, which are more unreliable than basic parsing techniques. That allows me to respond to the question previously stated with another question: “why do more work than you have to?”
This time, instead of using my usual choice of T-SQL, I chose to learn pandas and do all my transformations in Python. There were several different constraints used to fill out around 40 different columns in the new table, which I could mostly only fill out using a part name and part number, in a 3 column table. I have a Juptyer notebook available on Github that details the transformations made and has the code I used for anyone interested. Data used is modified data, and unfortunately I am not allowed to provide the datasets.
While I am still working in the role, I can say I have walked out with a lot more experience and wisdom. It is quite obvious to me how easily the data could be transformed using Python. I doubt I am the first one to change the responsibilities of a data entry position. Even if I am, there is no way I am going to be the last.
Hope you enjoyed the read! As usual, if you have any questions, please feel free to reach out! LinkedIn or email is best.