ai-labs.org blog

Context-free grammars for English noun phases in Brown terms.

We designed an extremely simple np extractor. See it on github.

Simple noun phrases extractor which uses context-free 
grammar appears to be more robust for cases with noise 
and grammatically broken structure 

There are many modern NLP tools, like spacy that are able to find noun-phrases. All of them work well on a consistent text that can be parsed into a correct syntax tree. In cases when syntax is broken, POS and tree parsers become weak and things go worse. This solution has a goal to use extremely simple context free grammars on top of very simple and in general less accurate POS tagger consuming it’s only strong side - robustness for noised broken text. It is expected to be much weaker on a normal text, but perform better on input like twitter and chats.

Context-free grammar

Can be represented in code as a set of templates like this (see here with detailed examples):

np_rules = {
    ('CD', 'JJ', 'NN'),
    ('CD', 'NN'),
    ('CD', 'RB', 'JJ', 'NN'),
    ('DT', 'CD', 'JJ', 'NN'),
    ('DT', 'CD', 'NN'),
    ('DT', 'CD', 'RB', 'JJ', 'NN'),
    ('DT', 'JJ', 'CD', 'JJ', 'NN'),
    ('DT', 'JJ', 'CD', 'NN'),
    ('DT', 'JJ', 'CD', 'RB', 'JJ', 'NN'),
    ('DT', 'JJ', 'JJ', 'NN'),
    ('DT', 'JJ', 'NN'),
    ('DT', 'JJ', 'RB', 'JJ', 'NN'),
    ('DT', 'NN'),
    ('DT', 'RB', 'JJ', 'NN'),
    ('DT', 'RB', 'VB', 'NN'),
    ('DT', 'VB', 'NN'),
    ('JJ', 'CD', 'JJ', 'NN'),
    ('JJ', 'CD', 'NN'),
    ('JJ', 'CD', 'RB', 'JJ', 'NN'),
    ('JJ', 'JJ', 'NN'),
    ('JJ', 'NN'),
    ('JJ', 'RB', 'JJ', 'NN'),
    ('NN',)
}

What is a Noun Phrase?

A noun phrase is a group of words that work together to name and describe a person, place, thing, or idea. Better to say “an entity” that behaves like a simple noun. Paraphrasing this, we can say that when we look at the structure of writing, we treat a noun phrase the same way we treat a common noun. See this to read related grammar topic.

Simple guide

here you can see a jupyter notebook with examples

  • First, ipmort required functions
    from np_extractor import get_nps_from_text, get_nps_from_tokens
    
  • If you just want to get NP groups from a text Just call a get_nps_from_text() giving the text as an input.
    text = 'What a great shame you do not look after the wildlife as well as a noun chunk.'
    print(get_nps_from_text(text))
    
    >>> ['a great shame', 'the wildlife', 'a noun chunk']
    
  • If you want get some more advanced info on rules which caused mathces and tokens indecies Use brown-based POS tagger (for instance NLTK) to obtain tagged tokens like this
    from nltk.tokenize import word_tokenize
    from nltk import pos_tag
    pos_tag(word_tokenize(text))
    

    and feed the list to get_nps_from_tokens()

print(get_nps_from_tokens(pos_tag(word_tokenize(text))))
>>> {'matches': [(1, 3), (9, 10), (14, 16)], 'rules': [('DT', 'NN'), ('DT', 'NN'), ('DT', 'JJ', 'NN')], 
'matches_text': [['a', 'great', 'shame'], ['the', 'wildlife'], ['a', 'noun', 'chunk']]}