data:image/s3,"s3://crabby-images/f7c19/f7c192cb69cdb27d07988a5abc5a4fbe41e1b370" alt="Textual features"
data:image/s3,"s3://crabby-images/cf0d4/cf0d4bc37d4f4f583a595e73c6d1d90e4169b879" alt="textual features textual features"
Is this enough data? Well, that depends on: Tagging 1,069 entities took me about 4 hours, given that I have only a basic understanding of Spanish and no familiarity with Colombia. The larger the quantity and variety of training data you supply, the better the ML algorithm will accurately discern those patterns. You label hundreds, thousands, millions of entities in a block of text, and the model picks up the patterns based on machine learning (ML) algorithms (more on this in a different article). In simple terms, training data teaches the model what defines an entity. This is why English models for common entities are more accurate - they have more training data. The time you put in is roughly proportional to the accuracy you’ll get. The first thing you should know is that training a model is time-consuming and tedious. If you’re working on a digital humanities (or any) project with someone who isn’t particularly tech-savvy, this workflow will help.īefore any of you embark on the model training adventure, I should address the first question on your mind: How difficult is this going to be? Here is the most time-efficient and collaboration-friendly way I have found to improve upon spaCy’s existing NER model.
data:image/s3,"s3://crabby-images/fc539/fc5392331e4dbc4767a480d52ae7fb78aef9bdc7" alt="textual features textual features"
Are getting low accuracy with current NER models.Perhaps you’ve run into the same problem if you’re looking for rather niche kinds of entities with less popular languages. The model fails to recognise entities that I care about. You could use readily available models like spaCy, which has a brilliant NER model for English but is not terribly accurate for other languages (in my use case Colombian Spanish). Named Entity Recognition (NER): A method for identifying groups of words that represent a specific entity (like a person, organisation, brand, place). It is simply not an efficient use of your time. By reading, you can tell which groups of words (tokens) represent each entity, BUT you don’t want to have to identify all of it yourself, especially if you’ve got hundreds or thousands of pages to go through. We’ll call each of these entities to be general. The text has names of people, places, organisations, dates, positions, etc.
data:image/s3,"s3://crabby-images/f7c19/f7c192cb69cdb27d07988a5abc5a4fbe41e1b370" alt="Textual features"