NLP course
This is the webpage for the Introduction to Natural Language Processing course, which closely follows the material of my new book “Getting Started with Natural Language Processing” (available on Manning and Amazon). It covers a wide range of topics in NLP and provides you with fundamental knowledge of NLP concepts as well as practical skills. By the end of this course, you will be able to build your own NLP application in an end-to-end manner.
The course was taught in Fall 2022 at the University of Bath.
Prerequisites: The course assumes programming knowledge of Python and some familiarity with Machine Learning algorithms; it does not require any prior knowledge of linguistics or Natural Language Processing.
Contents:
- Overview
- Learning outcomes
- Reading list
- Material: Week 1, Week 2, Week 3, Week 4, Week 5, Week 6, Week 7, Week 8, Week 9
Overview
This is a semester-long course with one 2-hour long lecture per week. In addition to lectures, students are provided with detailed handouts and practical programming exercises.
Each week addresses a different NLP application and discusses it in detail, introducing relevant NLP concepts and techniques.
Applications and topics covered include, among others:
- Information retrieval
- Information extraction
- Text classification
- Topic modelling
- Word embeddings
- Semantic models
Learning outcomes
- Demonstrate knowledge of the fundamental principles of natural language processing.
- Demonstrate understanding of key algorithms for natural language processing.
- Write programs that process language.
- Design your own end-to-end projects in NLP.
- Evaluate the performance of programs that process language.
- Assess feasibility and appropriateness of novel NLP approaches presented in literature.
Reading list
- Ekaterina Kochmar (2022). Getting Started with Natural Language Processing. Manning Publications, ISBN: 9781617296765. URL: Manning; Amazon; for the students at the University of Bath, the book is available via the university’s library service.
- Dan Jurafsky and James H. Martin (2009). Speech and Language Processing (2nd edition). Prentice-Hall Inc., Upper Saddle River, NJ, USA. ISBN: 0131873210. URL: 2nd edition; 3rd edition
- Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media, Inc. ISBN: 978-0-596-51649-9. URL: NLTK book
Week 1: Introduction to Natural Language Processing
This week’s material will introduce you to the field of Natural Language Processing, first via overviewing its history and the way the field and its algorithms developed over decades, then by presenting and discussing the most popular NLP applications and techniques used to tackle NLP tasks, and finally by linking the NLP techniques and concepts to other fields and approaches.
In addition, in the course of this week you will make your first practical steps in implementation of an NLP algorithm. You will learn how to structure an NLP project from the beginning to end and you will focus on the first crucial step in an NLP application – tokenization. You will learn why this is challenging and how a tokenizer can be implemented in practice. Finally, we will conclude with the remarks on language use and word distribution, and observations on the implications this has for NLP algorithms. You will also run some frequency analysis yourself – such analysis is often an important step in the preliminary data investigation, which may inform and help you shape your approach to the specific NLP task you are working on.
Week 2: Introduction to Information Retrieval
This week, we will “zoom in” on one of the popular and widely used NLP applications – Information Search or Information Retrieval (IR). We will look closely into each step involved in this application, and by the end of this week you will be able to implement an information search algorithm yourself.
Information search is not only a popular application (you may recall that it helps you find relevant information on the Internet as well as in a collection of documents on your computer), but also a suitable one to be working on in Week 2. Besides learning a few practical aspects about the information search algorithms, you will also learn about such fundamental NLP concepts and techniques as vector-based representations, lemmatization and stemming, and term and document weighting. These concepts and techniques are used across multiple tasks in NLP, and you will be using them again and again in the next weeks.
- Handout for Week 2
- Slides on Introduction to IR
- Programming exercises:
- Sample solutions:
- For TermWeighting.ipynb
- For EndToEnd.ipynb
Week 3: Part-of-Speech Tagging
This week and next week will follow up with another popular and widely used NLP application – Information Extraction.
In addition to searching for a set of documents that answer your information need, which is performed by Information Retrieval algorithms, you may be interested in getting a precise answer to a specific question. For example, if you Google for “artificial intelligence”, the search engine will come back with a long list of pages discussing various aspects of artificial intelligence, from the definition and an overview of the field, to specific techniques and applications. However, if you are interested in the definition only, you would ask “What is Artificial Intelligence?” and expect to get a specific answer giving such a definition. Information Extraction (IE) is the NLP task that addresses such challenges.
As with IR, there is a reason for why we are talking about IE relatively early in the course on NLP: while working on an IE algorithm, you will also learn about fundamental NLP concepts and techniques, starting this week with part-of-speech tagging. This week, we will focus on how this NLP task is solved: specifically, we will discuss sequence modelling approaches used in NLP, look into the theory behind such models, and learn how part-of-speech tagging is solved using a sequence model.
Week 4: Syntactic Analysis
Last week we started looking into another popular and widely used NLP application – Information Extraction (IE). A reminder: while Information Retrieval algorithms help you find a set of documents that generally answer your information need, IE algorithms are used to identify precise answers to specific questions.
Last week, we focused on part-of-speech tagging – the task that helps you identify what category (part of speech) a word belongs to. This week we will continue looking into the challenges that have to be solved in order to implement an IE application. Specifically, we will focus on how to detect the grammatical relations that link words of different parts of speech to each other and identify the roles that different groups of words play in a sentence. We will look into chunking and parsing. The final section then will bring the concepts and techniques studied over these two weeks together and show you how to implement an IE application in practice.
Week 5: Text Classification Approaches
This week you will build upon the knowledge acquired over the previous weeks and will start working on the applications at the intersection of machine learning (ML) and NLP. You may recall from Week 1 that ML is widely used in many NLP tasks. This week, you will start with one of the most popular frameworks – supervised machine learning, and specifically text classification tasks. Classification is an activity in which we humans engage on a regular basis: it is concerned with identification of groups of objects or phenomena on the basis of their traits, similarities, features or other criteria. Often, the categories are clearly defined, and the features are easy to determine; however, in some cases, classification may be a challenging task even for humans. While a computer can perform classification too, within a supervised learning framework, it needs to be “told” what the classes are and which features may distinguish between these classes.
This week, you will focus on two popular text classification tasks – sentiment analysis, concerned with classifying texts into those expressing positive and negative sentiment, and topic classification, concerned with classifying texts based on their topic.
- Handout for Week 5
- Slides on Text classification approaches
- Programming exercises:
- Sample solutions:
Week 6: Unsupervised approaches in NLP
This week you will continue learning about the application of the machine learning approaches to NLP. One of the key aspects that allow you to frame a task as a supervised machine learning task is the availability of clearly defined classes and, importantly, data that is annotated with such classes, for example, by domain experts. Then a machine learning algorithm can be trained on such labelled data, and it can learn to associate the features with the classes. Despite the fact that the amount of such labelled training data is consistently growing, enabling researchers and practitioners to develop further ML and NLP applications, data annotation is a challenging, time-consuming and often expensive task. An alternative to this framework is unsupervised machine learning. Unsupervised approaches are useful not only in cases where labelled data is unavailable or hard to collect, but also for the tasks where classes are not known in advance or can change over time.
This week, you will learn about the applications of unsupervised machine learning in NLP and, continuing with the theme of topic analysis, you will apply two unsupervised methods in practice – k-means clustering for topic analysis and Latent Dirichlet allocation (LDA) for topic modelling.
- Handout for Week 6
- Slides on Unsupervised approaches in NLP
- Programming exercises:
- Sample solutions:
Week 7: Semantics and meaning representation
All approaches discussed so far essentially used words as symbols devoid of any particular meaning. While it is true that the algorithms you have been looking into did not need to know what a word means to use it as an informative feature in a particular task (e.g., a spam filter does not actually need to understand what the word lottery means to associate it with the spam class), it is a simplistic view of language, and word meaning plays a central role in more challenging, natural language understanding and reasoning tasks.
The subfield of linguistics and NLP that studies meaning in language is called semantics, and this course would not have been complete without the discussion of the methods of semantic analysis and meaning representation.
- Handout for Week 7
- Slides on Semantics and meaning representation
- Programming exercises:
- Sample solutions:
Week 8: Sequence modelling and labelling
Most of the tasks that you’ve addressed so far have treated text as a collection of individual words or groups of words. Such approach is called bag-of-words (or bag-of-ngrams) as it does not take into account the order in which words and groups of words follow each other. In Week 3, we made an observation that text is not a mere collection of disconnected words: behind the way the words are put together in sentences (and sentences are put together in larger units) lies a well-defined structure determined by the laws of language, and we first discussed sequence models then. You explored the syntactic structures and grammatical relations that link words together, however, you have not yet explored the structure that governs word composition in language to the full extent.
This week’s topic is sequence modelling and sequence labelling in NLP. You have already encountered one NLP task that relies on sequential information (part-of-speech tagging in Week 3), and this week will explore in more detail further sequence modelling and labelling approaches and their application to such tasks as named entity recognition and language modelling.
- Handout for Week 8
- Slides on Sequence modelling and labelling
- Programming exercises:
- Sample solutions:
- For SimpleLM.ipynb
Week 9: Current trends and challenges in NLP
This week concludes the course on Natural Language Processing. Previous weeks introduced the fundamental concepts and techniques in NLP and provided you with an in-depth analysis of the main tasks and applications. Like any other sub-field of Artificial Intelligence, NLP is a fast-developing field that has seen an increased level of attention in the past years. Firstly, language is our primary means of communication: the ability to use language is one of the core intellectual abilities in humans, which means that NLP is one of the key areas to address in AI. Secondly, language is a highly structured system, which lends itself to the application of formalisms and machine learning models, which makes it feasible for computers to process, understand and generate natural language with a relative success. However, thirdly, despite being a structured system, language is highly creative and full of exceptions, which makes this field interesting and challenging to work in: despite impressive progress on many tasks achieved by NLP researchers in the recent years, we are still far from having systems that actually understand natural language. These factors combined make this field popular and actively researched.
With the amount of research going on in NLP, it would be impossible to cover all current NLP approaches in one course. Therefore, this week describes current challenges in the field and points in the direction of the current trends.
- Handout for Week 9
- Exercises:
- Sample solutions:
- For the exercise sheet