A winter morning

snowflakes pirouette. “A winter morning” is published by Oswald Chen.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Reading Between The Lines

An NLP Exploratory around Suicide and Mental Health

Knowing I had to use the computer to “read” in text, a spark was lit and I was haphazardly out of the gates — inspired. I planned to use the written works from authors that have committed suicide to see if there are any glaring insights that can be gleaned from or decoded. Is that a good idea? Sounded helpful in world plagued by too-soons. I don’t even know how to “do NLP”, but I know inspiration and speed are good vectors in Data Science Projects With Quick Timelines (DSPWQT).

Suicide is the 10th largest killer resulting in more than fifty-thousand deaths per year. With 1.5M attempts per year at a cost of more than $70B, it impacts families, children, and generations to come. This was not going to be a topic to dive into lightly. Underpinning suicide is mental illness with a number of disorders (Generalized, Panic Disorders, Bipolarism, etc.), affecting more than 40M Americans annually, or as I told my room of 12:

This means 2–3 of us suffer from some form of mental anguish clinically. At the moment, it’s Python that sparks mine.

Back to my project. As has become routine on my march towards Data Scientist, I continued on with my question: “Are there patterns in the books from authors who’ve committed suicide?” I marched forward aggregating a set of books from a few well-known authors that have committed suicide, as well as books from authors with “healthy” minds to see if the topic models (more on that in a minute) could find patterns in the text from a mixed set of writers:

My set of authors for the NLP project. Feel so horribly for the Hemingway Family.

Text processing, EDA, and Topic Modeling, key parts of the project, are not only the key to doing NLP well, they are laboriously never-ending. As I would clean my text using various tools — tokenize, lemmatize, stem, and so-on to prepare them for Modeling — I’d find ghost characters, unicode artifacts, and other text-vermin determined to fuzzy my results. With each run, I’d go back to the start of the process, narrowing my targets. One goal using NLP is to take “clean” text and vectorize, or more properly “Count Vectorize” — build giant arrays of word counts (Imagine an Excel sheet where the rows are each book and the columns are each individual word that go on right forever). With seven million words you can imagine how, dimensionally-speaking, this becomes incredibly tedious. Therefore, Data Scientists, Linguistic Researchers, and Engineers have worked on processes to reduce the number of words to the most impactful and meaningful through various approaches of “Dimensionality Reduction.” To pause for a moment and get more technical, I used Gensim, TF-IDF, spaCy, NLTK, a MongoDB for unstructured data, and so on, but am moving quickly past the tech to see if I can keep you engaged until the end. All this to say I cleaned my text over and over until I began using NLP models in the process of Topic Modeling.

Topic Modeling allows us to see what the computer is finding via some very impressive math into word-categories. These are not necessarily categories that you and I would think of like “cars”, or “types of cats”, but rather groupings of words, or word-topics that each algorithm has deemed most associative to occur “near” each other. I used LSA, NMF, and LDA to evaluate my Corpus, then (through my own subjective lens) noted the more “human” categories listed below. You’ll note I color-coded where it felt like the models came up with similar themes. Though, as noted by our instructors, the art of Topic Modeling is highly subjective, and I’ll readily admit to grasping for understandable topics for my first pass in a project of this type.

So, you’re probably wondering as I was:

“Ok get to your point, what does all this mean?”

For starters, it means I learned something. As you get older in your career, savor the moments when you learn. It’s a gift. We’re moving at a furious clip in our bootcamp, and just for me to gather, process, model, and analyze 7M words is an incredible feat I’ll savor as I begin to prepare myself for sitting behind a desk again with a full inbox.

Second, through much discussion, my topic models really tell me nothing about suicide, or mental health. Bummer. The learning here is that ultimately (my absolutist word) I chose a set of authors that might write about these topics. That’s pretty anticlimactic, but it also lets me appreciate just how tedious and hard it is to set up a proper experiment, process, and feature engineer all the text. To really evaluate the works of authors that have committed suicide requires quite a bit of additional rigor. Our instructors were hoping for us to muddy up our feet in the NLP swamp, so I can check the box there. My whole NLP-ensemble needs a thorough washing.

Sentiment Analysis Example Scoring from the AFINN Dictionary

As it turns out (below), none of them, except for Hunter S. Thompson, (who according to the chart was one bitter-writing SOB…) wrote more negatively. As expected, the two more kid-friendly authors (Judy Blume and J.K. Rowling) are the most positive.

Sentiment Analysis using AFINN

Wrapping this one up, I can tell you I’m intrigued by NLP, and specifically by this project. The idea that you can evaluate Twitter and Facebook posts, Blogs, even full-length books as I did to better help people battle depression, other disorders, and even suicide is a great idea. I had never heard of it when I began, but I’m happy to know my idea was a good one, but like most, was already in practice many times over — and is being put to good use.

I’m glad too that my ability to write code has started to get better, so I’m hoping to be able to further refine my program to evaluate more text. I’d like to evaluate these works against those First Person and Absolutist Word Lists I mentioned earlier as well. No rest here at Metis, we’re onto our final projects and then back to work. If Data Science is the new frontier, NLP is the Wild Wild West.

Add a comment

Related posts:

vagrant virgo

you stare into my soul and i’m fucking done i want you more than just you. “vagrant virgo” is published by natalie.