Last week I attended the UCREL Summer School in corpus-based natural language processing (NLP). The summer school is taught by leading experts in the field both from Lancaster University and other institutions.
Here are a couple of my thoughts and take aways from the week.
Cambridge Analytical has provided a perfect example of how not to use data ethically. It serves as an important reminder to always think about how you want to use the data before starting any analysis and keep your research questions constantly in your mind.
Another topic close to my heart ❤️.
- Document EVERYTHING! From how you scraped and cleaned the data, to creating that pretty plot.
- Release all code and data with any papers you write. I mean if it isn’t on GitHub is it even research?
- Before releasing a corpus publicly think carefully about possible ethical and legal issues.
Be an interdisciplinary hero
Get involved! There are so many applications areas which make interesting NLP projects. Just during this week, we have had talks from bio-sciences, accounting and finances, geography and the publishing industry to name a few. However, it’s important to work closely with the domain experts and utilise their understanding of the area.