Обнаружение границ абзаца с помощью регулярного выражения в грязном тексте в Python

У меня есть текст, который выглядит так:

\n1. When an injunction is obtained against an innocent intermediary to prevent \nthe use of his facilities by wrongdoers for unlawful purposes, who should pay the \n\ncost of complying with the order? \n\n2. The respondents are three Swiss or German companies belonging to the \nRichemont Group. They design, manufacture and sell luxury branded goods such as \n\njewellery, watches and pens under well-known trade marks including Cartier, \n\nMontblanc and IWC. The internet has provided infringers with a powerful tool for \n\nselling counterfeit copies of branded luxury goods, generally of lower quality than \n\nthe genuine article and at lower prices. It allows them access to a world-wide market, \n\nas well as a simple way of concluding sales and collecting the price with practically \n\ncomplete anonymity. This illicit business is carried out on a large scale. The \n\nevidence is that at the outset of this litigation the respondents alone had identified \n\nsome 46,000 websites offering infringing copies of their branded goods. \n\n3. The two appellants and three other defendants in the proceedings below (who \ndid not participate in this appeal) are the five largest internet service providers (or \n\n"ISPs") serving the United Kingdom, with a combined market share exceeding 90%.

Я пытаюсь разделить блок на два абзаца:

Параграф 1

\n1. When an injunction is obtained against an innocent intermediary to prevent \nthe use of his facilities by wrongdoers for unlawful purposes, who should pay the \n\ncost of complying with the order?

Параграф 2 \n\n2. The respondents are three Swiss or German companies belonging to the \nRichemont Group. They design, manufacture and sell luxury branded goods such as \n\njewellery, watches and pens under well-known trade marks including Cartier, \n\nMontblanc and IWC. The internet has provided infringers with a powerful tool for \n\nselling counterfeit copies of branded luxury goods, generally of lower quality than \n\nthe genuine article and at lower prices. It allows them access to a world-wide market, \n\nas well as a simple way of concluding sales and collecting the price with practically \n\ncomplete anonymity. This illicit business is carried out on a large scale. The \n\nevidence is that at the outset of this litigation the respondents alone had identified \n\nsome 46,000 websites offering infringing copies of their branded goods. \n\n3. The two appellants and three other defendants in the proceedings below (who \ndid not participate in this appeal) are the five largest internet service providers (or \n\n"ISPs") serving the United Kingdom, with a combined market share exceeding 90%.

Я пытаюсь добиться этого разделения с помощью регулярного выражения, но изо всех сил стараюсь, чтобы просмотр вперед работал правильно.

Я тестирую в regex101 со следующим выражением: (\\n\d+\..*)(?!\\n\\n\d+\.)

В группе 1 я пытаюсь зафиксировать номер абзаца и все, что следует за ним, пока не сработает просмотр вперед и не остановит сопоставление на следующем номере абзаца. Регулярное выражение просто потребляет весь блок, и я немного, но не уверен, где я ошибаюсь. Я был бы признателен за любое направление в правильном направлении, которое SO мог бы предоставить.

Демонстрация regex101 здесь

python-3.x regex paragraph

DanielH 08.11.2018 источник

comment

Почему бы не разделить регулярное выражение на \\n\\n\d+? В первом абзаце нет \n перед \n1, но это не проблема, поскольку на самом деле это первый абзац. - Asunez 08.11.2018

comment

Да, ты прав. Также была проблема с квантификатором в первой группе. Рабочее регулярное выражение \\n\\n\d+\..+?(?=\\n\\n\d+\.). Демо - DanielH 08.11.2018

comment

Я думал о чем-то более похожем на это: ссылка - Asunez 08.11.2018

Обнаружение границ абзаца с помощью регулярного выражения в грязном тексте в Python

Похожие вопросы