gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

11 KiB

Raw Permalink Blame History

This model was published in HF papers on 2021-10-16 and contributed to Hugging Face Transformers on 2022-09-30.

MarkupLM

Overview

The MarkupLM model was proposed in MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. MarkupLM is BERT, but applied to HTML pages instead of raw text documents. The model incorporates additional embedding layers to improve performance, similar to LayoutLM.

The model can be used for tasks like question answering on web pages or information extraction from web pages. It obtains state-of-the-art results on 2 important benchmarks:

WebSRC, a dataset for Web-Based Structural Reading Comprehension (a bit like SQuAD but for web pages)
SWDE, a dataset for information extraction from web pages (basically named-entity recognition on web pages)

The abstract from the paper is the following:

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available.

This model was contributed by nielsr. The original code can be found here.

Usage tips

In addition to input_ids, [~MarkupLMModel.forward] expects 2 additional inputs, namely xpath_tags_seq and xpath_subs_seq. These are the XPATH tags and subscripts respectively for each token in the input sequence.
One can use [MarkupLMProcessor] to prepare all data for the model. Refer to the usage guide for more info.

MarkupLM architecture. Taken from the original paper.

Usage: MarkupLMProcessor

The easiest way to prepare data for the model is to use [MarkupLMProcessor], which internally combines a feature extractor ([MarkupLMFeatureExtractor]) and a tokenizer ([MarkupLMTokenizer] or [MarkupLMTokenizerFast]). The feature extractor is used to extract all nodes and xpaths from the HTML strings, which are then provided to the tokenizer, which turns them into the token-level inputs of the model (input_ids etc.). Note that you can still use the feature extractor and tokenizer separately, if you only want to handle one of the two tasks.

from transformers import MarkupLMFeatureExtractor, MarkupLMProcessor, MarkupLMTokenizerFast


feature_extractor = MarkupLMFeatureExtractor()
tokenizer = MarkupLMTokenizerFast.from_pretrained("microsoft/markuplm-base")
processor = MarkupLMProcessor(feature_extractor, tokenizer)

In short, one can provide HTML strings (and possibly additional data) to [MarkupLMProcessor], and it will create the inputs expected by the model. Internally, the processor first uses [MarkupLMFeatureExtractor] to get a list of nodes and corresponding xpaths. The nodes and xpaths are then provided to [MarkupLMTokenizer] or [MarkupLMTokenizerFast], which converts them to token-level input_ids, attention_mask, token_type_ids, xpath_subs_seq, xpath_tags_seq. Optionally, one can provide node labels to the processor, which are turned into token-level labels.

[MarkupLMFeatureExtractor] uses Beautiful Soup, a Python library for pulling data out of HTML and XML files, under the hood. Note that you can still use your own parsing solution of choice, and provide the nodes and xpaths yourself to [MarkupLMTokenizer] or [MarkupLMTokenizerFast].

In total, there are 5 use cases that are supported by the processor. Below, we list them all. Note that each of these use cases work for both batched and non-batched inputs (we illustrate them for non-batched inputs).

Use case 1: web page classification (training, inference) + token classification (inference), parse_html = True

This is the simplest case, in which the processor will use the feature extractor to get all nodes and xpaths from the HTML.

from transformers import MarkupLMProcessor


processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")

html_string = """
 <!DOCTYPE html>
 <html>
 <head>
 <title>Hello world</title>
 </head>
 <body>
 <h1>Welcome</h1>
 <p>Here is my website.</p>
 </body>
 </html>"""

# note that you can also add provide all tokenizer parameters here such as padding, truncation
encoding = processor(html_string, return_tensors="pt").to(model.device)
print(encoding.keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])

Use case 2: web page classification (training, inference) + token classification (inference), parse_html=False

In case one already has obtained all nodes and xpaths, one doesn't need the feature extractor. In that case, one should provide the nodes and corresponding xpaths themselves to the processor, and make sure to set parse_html to False.

from transformers import MarkupLMProcessor


processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")
processor.parse_html = False

nodes = ["hello", "world", "how", "are"]
xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span", "html/body", "html/body/div"]
encoding = processor(nodes=nodes, xpaths=xpaths, return_tensors="pt").to(model.device)
print(encoding.keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])

Use case 3: token classification (training), parse_html=False

For token classification tasks (such as SWDE), one can also provide the corresponding node labels in order to train a model. The processor will then convert these into token-level labels. By default, it will only label the first wordpiece of a word, and label the remaining wordpieces with -100, which is the ignore_index of PyTorch's CrossEntropyLoss. In case you want all wordpieces of a word to be labeled, you can initialize the tokenizer with only_label_first_subword set to False.

from transformers import MarkupLMProcessor


processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")
processor.parse_html = False

nodes = ["hello", "world", "how", "are"]
xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span", "html/body", "html/body/div"]
node_labels = [1, 2, 2, 1]
encoding = processor(nodes=nodes, xpaths=xpaths, node_labels=node_labels, return_tensors="pt").to(model.device)
print(encoding.keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq', 'labels'])

Use case 4: web page question answering (inference), parse_html=True

For question answering tasks on web pages, you can provide a question to the processor. By default, the processor will use the feature extractor to get all nodes and xpaths, and create [CLS] question tokens [SEP] word tokens [SEP].

from transformers import MarkupLMProcessor


processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")

html_string = """
 <!DOCTYPE html>
 <html>
 <head>
 <title>Hello world</title>
 </head>
 <body>
 <h1>Welcome</h1>
 <p>My name is Niels.</p>
 </body>
 </html>"""

question = "What's his name?"
encoding = processor(html_string, questions=question, return_tensors="pt").to(model.device)
print(encoding.keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])

Use case 5: web page question answering (inference), parse_html=False

For question answering tasks (such as WebSRC), you can provide a question to the processor. If you have extracted all nodes and xpaths yourself, you can provide them directly to the processor. Make sure to set parse_html to False.

from transformers import MarkupLMProcessor


processor = MarkupLMProcessor.from_pretrained("microsoft/markuplm-base")
processor.parse_html = False

nodes = ["hello", "world", "how", "are"]
xpaths = ["/html/body/div/li[1]/div/span", "/html/body/div/li[1]/div/span", "html/body", "html/body/div"]
question = "What's his name?"
encoding = processor(nodes=nodes, xpaths=xpaths, questions=question, return_tensors="pt").to(model.device)
print(encoding.keys())
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'xpath_tags_seq', 'xpath_subs_seq'])

Resources

MarkupLMConfig

autodoc MarkupLMConfig - all

MarkupLMFeatureExtractor

autodoc MarkupLMFeatureExtractor - call

MarkupLMTokenizer

autodoc MarkupLMTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

MarkupLMTokenizerFast

autodoc MarkupLMTokenizerFast - all

MarkupLMProcessor

autodoc MarkupLMProcessor - call

MarkupLMModel

autodoc MarkupLMModel - forward

MarkupLMForSequenceClassification

autodoc MarkupLMForSequenceClassification - forward

MarkupLMForTokenClassification

autodoc MarkupLMForTokenClassification - forward

MarkupLMForQuestionAnswering

autodoc MarkupLMForQuestionAnswering - forward

11 KiB Raw Permalink Blame History

MarkupLM

Overview

Usage tips

Usage: MarkupLMProcessor

Resources

MarkupLMConfig

MarkupLMFeatureExtractor

MarkupLMTokenizer

MarkupLMTokenizerFast

MarkupLMProcessor

MarkupLMModel

MarkupLMForSequenceClassification

MarkupLMForTokenClassification

MarkupLMForQuestionAnswering

11 KiB

Raw Permalink Blame History