Data Archaeology as a Skill

Skills & Theory

Nov 3

Depending on the age of the organization, data needed for modeling, analyses, and other important reporting is likely to be stored in archaic, undocumented, out-dated systems. An important skill as a data professional is being willing and technically able to dig into these old data systems in order to continue to move the business forward. This often means: reverse engineering through old code, testing / researching how tables interact with one another, where in the application the data is originating, talking with business unit leaders and users for additional context, and documentation.

The official, wiki-definition of data archaeology talks primarily about outdated storage systems in the physical sense (floppy disks, etc). The concept stands to extrapolate in the more digital sense and in my experience is more prevalent than one might think, especially in organizations that have been around and collecting data for more than 1-5 years.

This is a ubiquitous data skill that isn’t talked about often from what I can tell, but will certainly help anyone who employs and masters it excel at any data professional task.

The What

The time to throw on your detective, archaeologist costume is when:

There’s a need to rebuild a model / algorithm
A report, analysis, or model stops performing / running as expected
A former project needs to be “dusted off” and restarted after having been “shelved” for X number of years
Someone pops into an email inbox with a question about some data point no one knows about but somehow originates from the team

The Why

Any organization of any age begins accruing technical debt almost immediately. In some cases, even before they’re fully launched and started. I’ve seen it in small organizations of 1-5 people and I’ve seen it in some of the biggest, most visible, and most technically advanced organizations. Technical debt is a real thing and it exists nearly everywhere to varying degrees.

This concept of having processes, systems, and technology in place that causes work to maintain and modify as business needs change, trickles into any data professional’s job in the form of a mystery. Depending on a lot of things, documentation levels will vary, code structure and best practices will shift, software solutions will be contradictory, and tribal knowledge will be a driving force. Being comfortable with testing, trying, troubleshooting, and munging will be key to a Data Archaeologist’s success.

In my experience it is not practical to expect:

One software solution for one function (i.e. one database software solution, one data visualization / analysis solution, one ETL solution, etc)
One overall ecosystem to support all departments, data needs, etc (i.e. Microsoft only stack, Salesforce only stack, etc.)
Centralized repositories, documentation, and simple data structures

Even as a data professional, as a small business, using multiple solutions and knowing the pitfalls of technical debt, it requires a lot for me to run any sort of data analysis. I have data in my POS, online sales system, various merchant vendors, and Etsy. My business is ~1.5 years old at the time of writing and I have an obscene (shouldn’t be admitting this) number of Google Sheets supporting my operations for budgeting and inventory. I’m saying this to illustrate how quickly technical debt can be accrued even with the smallest number of people.

“Solutions” out there can often be expensive, complex, and exclusionary (meaning they want you to be 100% bought into ONLY their ecosystem). This adds complications to digging an organization OUT of technical debt (or at least reducing / mitigating it).

As a data professional, much of this is out of your control. Much of it comes before you and will continue on long after you. Other decisioning powers are at force, and you can’t fix it. The skill comes in by learning how to exist within this framework, ride out the waves of reverse-engineering and munging, and help solve some of the data problems and mysteries otherwise lost to history.

The How

Soft Skills

Interviewing / Iterative Communication

While all of the skills listed in this section are important, this is the one that will speed up the technical research the fastest. Figuring out who knows something about the problem you’re trying to solve and then effectively interviewing that person to build out context is the foundation for the rest.

Fact is, often times the “who” is no longer available - either they’ve left the organization or they are unable to carve out the time. Learning how the “who” notated their work and where they stored information won’t replace being able to talk to the person (people), but will be a step forward in the event talking isn’t an option.

Prepare for the “interview” by:

Reading what you can find before talking with them, so the interview can be used to fill in gaps (notes, code, presentations, etc).
Prepare an elevator pitch for the current problem to be solved (provide them with context as well)
Ask them the best way to work through the interview (meeting, detailed emails, documentation that might be stored elsewhere / missed, working session, etc.)

The goal of the interview(s) is to understand where any documentation exists, what the gaps are that need to be filled before proceeding, and any undocumented context and road blocks to prevent rework.

Creativity

Depending on the level of documentation, creative problem solving will come in handy. If the person (people) who created whatever system, algorithm, or code being reverse-engineered and researched is no longer available at the organization, the archaeological work becomes more difficult and will require detective ingenuity.

Searching file names / text / folders on shared drives / clouds using either OS level tools or even data engineering tools that can read OS level data (i.e. Alteryx, R, and other tools can iterate through file folders, read in file names, and output to a more data-search friendly file for faster digging)
Copies of copies. Often times developers and IT (rightfully so) don’t want people digging around in Production (or even Dev) level systems. But if you ask them for a read-only version or copy of code / data systems, they might be willing to spin one up for you to dig around in.
Find the originating application. If you can’t talk with the developers (and maybe even if you can) it’s worth finding the system / application where the data is entered by end-users and business units. Often the end-users can help fill in context on what the data is / where it comes from / what it means to them. Employ the above “Interviewing” skills.

There will likely be other creative problem-solving techniques out there depending on the organization and tools at your disposal, but in most cases, the above can get you started.

Adaptability (& Persistence)

Things break. Don’t work as expected. Come from places unknown.

The key is to roll with it and do so in a curious fashion. Leave behind assumptions and try to stifle judgment. Just because something was designed to be used in a certain way doesn’t mean it actually is. Being able to have a curious and safe conversation with the end-users and understand the context / processes in place that explain how it the data is actually used will lead to a much more fruitful research project.

Based on the conversations be prepared to:

Adjust priorities (i.e. if the end-users indicate the data input is not generally valid or useful, stop researching it and move on)
Lengthen timelines (if you find something you weren’t expecting or need to talk to other people)
Identify road-block(s) (if you were hopeful a data point would be useful only to later find out it’s not what you expected)

Point is, until you start really understanding the surrounding context and historical politics of a system, data, the design, and the origination / lineage of it all, there really is no way to tell what’s going to happen.

Reserve excitement; adjust anger or disappointment; adapt timelines.

Organization

While documenting what one knows is great, it’s often times just as important to document what one doesn’t know.

As the archaeological dig progresses, consider documenting (in a consistent, easily available way):

What has been tested / tried, even (especially) if it didn’t work / produce results
Where searches were held, even (especially) if nothing was found
Who was spoken to / interviewed (name, title, department), even (especially) if they didn’t have any information / context
Gaps in knowledge / open issues or tasks to be filled in as you progress / go
Code and interdependencies, with notes on gaps in understanding

Use the organization’s standard for storing this information. If there is no standard - cloud based spreadsheet style options are great (Google Sheets, MS Office 365 Excel). Atlassian also provides free cloud based options for Confluence and JIRA, if your organization allows for it (data security / IT would need to be consulted for something like this). “Readme.txt” files are simple go-to’s for me as an index for all of my documentation. Something like: “Go here for more details”, where “here” is a more structured, easier to read document.

Being able to document all research and context you find will be imperative to aiding the next, inevitable archaeological dig that will occur in 3-5 years when this problem resurfaces with a slightly new spin to it, as that’s the cycle of a data-life.

Technical Skills & Hard Knowledge

Soft skills, like in any project, must be paired with the technical skills to move through all of the code, systems, and applications. The types of skills I found myself regularly using:

Knowing where things are stored at an organization. Cloud based drives, shared network drives, other software programs. I can’t search what I don’t know exists!
Google. It’s inevitable I’ll run into some code in some software or application I don’t know. Being able to research various programming languages (and styles) helps with the reverse-engineering process.
ETL & data prep. Being able to pull the raw data out and make best-guesses as to relationships helps testing / research. I found I couldn’t always readily find documentation on relationships, but using conventional naming conventions, and trying relationships, I could easily join two tables, see the outputs of those joins, verify the inner join works as expected*, and then draw conclusions. This helps push through roadblocks and keep moving the “dig” further.

Persistence and organization will be important here as well. Documenting storage, noting readable information about the code, including reference links from Google, and being willing to TRY things through (read-only) data prep and ETL tools, will help move things along without relying on existing information that may not exist.

*If you look at the inner join of two tables, and compare your assumptions / expectations to a known-credible front-end application view, then you can work to verify your data connection (i.e. join “Client” with “Contacts” using some ID, going to the front-end application, searching a few of the records in the “Client” screen, and see if you can find the related contacts in the front end and make sure they match. If you know the front-end application is right, then you can safely guess your data join is also right.)

Conclusion

In real life data projects and organizations, systems and data storage can get messy quickly. I’ve worked with all sorts of organizations in all sorts of different industries and have heard or experienced stories of technical debt, archaic systems needing to be reverse engineered and brought into new processes / solutions, and messy, messy data. I’ve spent most of my data career digging through these types of problems, tracking back in someone else’s code, digging through complex data stores, and trying to generate some sort of knowledge base on what is going on in an organization.

I often hear of expectations that data will be clean, organized, in one solution, and well-documented, and while those goals are lovely in theory, in my experience, executing on them is a [lofty] project in and of itself. Depending on the organization’s priorities, workforce, and resources, it may or may not actually happen to the level someone would like. Be prepared for messiness and be willing to jump into it to help clear it all up.

Shop

As a former data professional, I’ve created a Zazzle shop, and some Etsy items, just for the #datafam. Check out my full list of designs in this blog here which I’ll keep updated regularly. Or go directly to my full-on Zazzle collection “Inside Data Jokes” here.