How to Read an Xml File in Python

Python classical way of handling big XML files

Logically reading the XML file in python

What are the modules we can attempt parsing XML files?

xml.etree.ElementTree

xml.dom.minidom

xml.dom.pulldown

xml.sax

How do nosotros measure out the operation of XML file processing?

In real world scenarios, well-nigh of the time we read the larger XML file, merely we might need but a few blocks of elements to compute the results.

          <ProteinEntry id="123">
<header>
<uid>CCCZ</uid>
<accession>A00002</accession>
<created_date>17-Mar-1987</created_date>
<seq-rev_date>17-Mar-1987</seq-rev_date>
<txt-rev_date>03-Mar-2000</txt-rev_date>
</header>

<protein>
<proper noun>cytochrome c</name>
</poly peptide>
<organism>
<source>chimpanzee</source>
<mutual>chimpanzee</mutual>
<formal>Pan troglodytes</formal>
</organism>
<reference>
<refinfo refid="A94601">
<authors>
<writer>Needleman, Southward.B.</author>
</authors>
<citation type="submission">submitted to the Atlas</citation>
<calendar month>October</month><twelvemonth>1968</year>
</refinfo>
<accinfo label="NEE">
<accession>A00002</accession>
<mol-type>protein</mol-type>
<seq-spec>1-104</seq-spec>
</accinfo>
</reference>
</ProteinEntry>

1. Get the file size

Go the file size in MB

2. Process the file through xml.etree

xml.etree.ElementTree file process

3. Identify the processing time

Processing Time with xml.etree

It took 34.11 seconds to read 683 MB of XML file, and there are 262K headers we captured.

Executed with xml.etree, took 34.xi seconds to read 600 MB of file

4. Writing customized XML processor

Compare xml.etree and custom method

All it took only 4.92 seconds to read the 683 MB of file and there are same 262K headers nosotros catpured . We are able to save virtually thirty–33 seconds of processing time. I would say execution time plays a key role when information technology comes to process multiple files concurrently (similar multi-threading).

Executed with customized Python classic manner of reading the file and information technology just took four.92 seconds to read 600 MB size file
          import os
import fourth dimension
import xml.etree.cElementTree as ET
def get_file_size_in_mb(file_name):
"""
Method to cheque the file size in MB
"""
file_size = 0
if bone.path.isfile(file_name):
file_size = bone.path.getsize(file_name) # Get the file size
file_size = round(file_size / (1024 * 1024.0), 2) # Convert into MB
file_size = '{:,.2f}'.format(file_size)
return str(file_size) + ' MB'
def read_xml_file(xml_file, element):
"""
Parse the xml file to xml.etree.cElementTree
"""
tree = ET.parse(xml_file)
root = tree.getroot()
number_of_element = len(root.findall(chemical element))
render '{:,.0f}'.format(number_of_element)
def read_xml_file_line_basis(xml_file, chemical element):
"""
Read the xml file and capture only the elements we need.
"""
start_tag = f'<{chemical element}>' # 3
end_tag = f'</{chemical element}>' # 2
start_tag_identified = Fake # three
captured_records = list() # four
captured_line = ''
with open up(xml_file) every bit f: # v
for line in f: # 6
if start_tag in line: # 7
start_tag_identified = True
if start_tag_identified: # 8
captured_line += line
if end_tag in line: # 9
captured_records.suspend(captured_line)
start_tag_identified = Fake # 10
captured_line = '' # x
render '{:,.0f}'.format(len(captured_records))
if __name__ == '__main__':
xml_file_name = 'large_xml_file.xml'
print(f'File Size: {get_file_size_in_mb(xml_file_name)}')
start_time = time.perf_counter()
counter = read_xml_file(xml_file_name, 'ProteinEntry/header')
end_time = time.perf_counter()
total_time = round(end_time - start_time, 2)
print(f'xml.etree.cElementTree - Total time taken:[{total_time}] seconds to identify the number of elements: [{counter}]')
print("<---------------------------------------->") start_time = time.perf_counter()
counter = read_xml_file_line_basis(xml_file_name, 'header')
end_time = time.perf_counter()
total_time = round(end_time - start_time, 2)
print(f'Customized XML Read - Total time taken:[{total_time}] seconds to identify the number of elements: [{counter}]')

Determination:

palumboformand54.blogspot.com

Source: https://medium.com/@ggopi19/python-classical-way-of-handling-large-xml-files-e7f86e95e6f9

0 Response to "How to Read an Xml File in Python"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel