Starting a career in data analysis

NB(mmoran): I wrote this a few years back for a bartender who was thinking about changing careers. I haven't done any editing to this aside from formatting.

For the last almost two decades, I have worked in the field of data analysis. Things are much different now than they were when I started in terms of tooling, packages, tutorials, and more. What has not changed is the underlying principles of trying to answer questions with data. This document will discuss those underlying principles, as well as some thoughts on learning and development that I’ve accrued over time.

All of this is my opinion, and some people will disagree with what I write below. You may even be one of those people in the near future. I am an example of those people now, because I am further in my career and these ideas don’t always apply to myself. Where they should help is in the first few months or years of your development as you create and enhance your own opinions.

Learn one language and learn it well

The primary tool in our toolbox is our base programming language. For data analysis, the choice is almost always R or Python. No matter which one you pick, you need to get comfortable with the language itself, even before you start processing data. This includes writing functions, learning how to write simple tests, understanding the different data structures and how and when to use them, and the general ideas around organizing your code.

Both languages have their benefits and detriments, but very early on those differences matter less. In fact, the fact that I don’t really know how to write R code (I can read it decently well) possibly held me back earlier in my career. The fact that I knew Python well from an engineering perspective, however, helped me more in my early career too. It somewhat comes down to what type of data person you want to be. Some companies will be “Python shops” or “R shops” based on the primary language in use by that team at that company.

Both languages have their own communities. In general, I think the R community is a little nicer and easier to approach given that it is a more cohesive ecosystem. Python has two distinct communities: web development and data analysis. The R and the Python data community are overlapping more and more over time, with tooling from each side being used by the other. So even this distinction is getting less and less important over time.

In any case, do not try to learn both languages at the same time. After learning one of the languages well, decide if you need the other language to solve a particular problem and learn about it.

I have used and learned from resources from both communities, as well as communities from other languages. That does not mean that I know those other languages as well as Python, just that I know a bare minimum amount to understand what is happening.

...and learn SQL at the same time

SQL is the most common language for data analysis, and it thankfully is very easy to pick up. My favorite resource is https://sqlbolt.com/, which has interactive lessons that build on each other. I used this back in 2017 when I needed to learn SQL, and it’s a great way to get your basic footing. Knowing SQL will help you learn and use other data analysis packages, and it will likely be the most used language on whatever data team you end up joining.

Learn through projects

All too often, “learning” a new language spends too much time on introducing concepts of the language but less on how they fit together. Yes, you need to know the basics in order to write code, but that can be done in a weekend to get the basics of how it works. From there, you will develop your knowledge and abilities much faster by focusing on projects.

At the beginning, this may be as simple as “calculate the score of a bowling game” or “find all of the anagrams of a word.” There are many sites online that have these practice questions.

Once you have a decent handle on the language, the project-based learning doesn’t stop. You’ll soon want to answer questions like “what day is best for sales?” or “can I find differences between these two groups?” which can be solved with the base language, but are usually easier to solve with some extra package. In R, this is the tidyverse. In Python, this is the scientific stack (numpy, scipy, pandas). In many ways, these packages are like their own language, but they build on and work with concepts in the base language.

When you’re answering these more data-centric questions, try to answer it in two ways: once with the base language, and once using these “helper” packages. Then start trying to answer different questions with the same data.

...including those that you make yourself

It’s very easy to find a problem or data set online that comes along with a question to answer. Kaggle is a great example of this. I personally do not think Kaggle is useful beyond a repository of data sets, and even that is questionable. Instead, try to think of questions yourself, either using the data you have available or data you don’t even have yet! There are a ton of open data sources online, many of which are just available. Cities are a great source of this data, including NYC: https://opendata.cityofnewyork.us/data/.

Think of a question that you might want to answer that’s complex but not too complex. Something like “which borough is the most prone to fires?” Answer that basic question, then go deeper. What about fires per capita, or lang area, or number of buildings? Are there trends or changes from year to year? What if you exclude or include the airports in Queens? What if you break it down by neighborhood instead of borough? Year of construction of the building? Type of building? Proximity to a fire station?

Starting from a simple question and moving to additional questions gives you practice in two distinct skills:

Understanding the data
Asking questions of the data

Understand the data and its generation

One of the main purposes of data analysis is to answer the question of “how was this data generated?” You can frame almost all problems like this, including the example above. Instead of asking “which borough has the most fires?” you are asking “was the number of fires seen based on properties of the borough?” Was the data generated based on things that are present within the data.

As you ask questions and try to find answers, think about the data generation process. The data generation process is usually what you are trying to answer, like “was the amount of sales based on or generated from the day of the week?” This sounds more verbose than “do days have different amounts of sales?” but expresses the focus of data generation a little more.

This suggestion is more around the way to think about problems, but it also helps you understand questions from a business perspective. Wondering about the amount of sales per day is one thing, but you’re probably being asked that question so that future sales can be forecasted. There, a forecast is saying that if the data generation process in the future is the same as in the past, what will the data in the future look like? If we get a similar amount of sales in future Mondays, can we predict or forecast how many sales we’ll have next Monday, or Monday two months from now.

...as well as the terminology of the community

I’ve tried to keep jargon out of this document, but they have likely snuck in. The words used within the data analysis community can be intimidating at first, but they are there to have a consistent “shorthand” for discussion. A lot of times, that makes talking about data analysis sound like talking about math equations, and you wouldn’t be wrong. Think about the fire questions from before: what were the average fires per parameter in NYC? That parameter might be borough, or population and borough, or neighborhood.

More complex questions don’t always need more complex answers or solutions. Almost every question can be answered by counting or averaging, faceted or split up by some factors or parameters you are interested in. Total fires by borough? Counting. Fires per capita? Average fires per person in each borough.

Continuously learn

One main underlying aspect of this entire document is learning something. Taking the time to learn is critical, but possibly even more so is exposure to learning. If you watch a YouTube video from a tech conference, you probably won’t understand everything, but you’ll at least be exposed to things you don’t know. From there, you can get a basic understanding of that thing so that the next time it comes up, you’ll understand what is being talked about better.

You might get this from books or online, or both! There are a ton of resources available, but I’m going to list a few that I’ve found have been helpful:

The PyData YouTube channel: recordings of conference talks, covering mostly Python but also some R.
The NormConf YouTube channel: recordings from a single conference that was organized by a few of my coworkers.
Allen Downey’s “Think Python” and “Think Stats”: book (available for free online) that treat learning these topics similar to what I’ve written above.
Cassie Kozyrkov’s blogs, LinkedIn feed, and courses are great general data discussions covering the range of what a data professional’s career might touch.

...without getting too distracted

There is a fear within data analysis that if you aren’t learning about or using the fanciest new thing, you aren’t a “real” data professional. This is patently false, and NormConf is a great example of trying to make this fear a known yet irrational one. My early suggestion to learn something well stands here. If you try to collect a whole bunch of tools too early, you won’t understand any of them that well. As hard as it is, try to learn something 80% of the way before you go to try to learn something new, then learn that 80% of the way.

The above is true even going back to Python or R. There are a ton of things within these languages, and whole aspects of using them that I don’t even know or use. Don’t feel like you need to know everything about something before moving on, but also make sure that you learn enough about it before moving on.

This pattern exists at every scale, from an entire language down to a particular technique. Learn something to the point that you would feel pretty confident using it again, then move on. You’ll need to revisit that tool or technique every so often to maintain that knowledge. Over time, your foundation becomes stronger, so it becomes easier to learn more new things.

Closing

Data careers are exciting, but built on a few simple foundations: SQL and Python or R, data structures, presenting results to provide value. If you have these foundations in place, it becomes easier to expand your boundary with new skills and techniques if your foundation is solid.

If you have a solid foundation in R, it’s easier to add a specific R package that solves a problem to that. If you have a solid foundation in figuring out the underlying reason for a question, the “why does this question matter?”, it becomes easier to know how to better present your results and how, like should this be a report, graph, table, sentence? If you know how you can work with lists, it’s easier to then work with dataframes.

There are a lot of people going through the exact same thing as you, so there are a ton of resources beyond what I’ve provided, some which will be more or less helpful. At the same time, those people going through the process and those people that have already are always there to help. That’s why people put out blog posts, or YouTube videos, or give talks at conferences: to share their learnings with others.

I’m also always available as a resource as well. I might not always have the answer, but I can hopefully point you in a direction to get an answer if that’s the case. I also have old projects, questions, and ideas that I can provide as well (easier via real links), so if this document is a little too high level and you need an exact list of skills, ask away!

Hopefully this helps! And again, ask any questions you have!