I'm working with a massive XML file that is exported from Confluence to represent the current state of a given Confluence space. For those familiar with Confluence this is used for backing up and restoring or migrating Confluence spaces in or across environments.
I'm trying to automate some basic analysis on the XML so I can output some useful information for determining if our export data is "OK" based on a set of rules we have defined.
Given the size of some of these exports and the structure of the XML it can be a pain and very time consuming to analyze this manually.
Essentially I've whittled down the XML to a IEnumerable of "object" XElements.
var filename = "export.xml";
var currentDirectory = Directory.GetCurrentDirectory();
var confluenceExportFilePath = Path.Combine(currentDirectory, filename);
XDocument confluenceExport = XDocument.Load(confluenceExportFilePath);
var objects = confluenceExport.Descendants("object");
Then I've taken that further and only selected objects that contain a class attribute equal to "Page" as I only care about the "objects" that are Page "objects". Up to this point I've returned some basic "header" information about each Page.
var pages =
from page in objects
where (string)page.Attribute("class") == "Page"
select new Page
{
Id = (string)page.Element("id"),
Title = (string)page.Elements("property").FirstOrDefault(property =>
property.Attribute("name").Value == "title"),
Version = (int)page.Elements("property").FirstOrDefault(property =>
property.Attribute("name").Value == "version"),
};
An example page "object" may look like this:
<object package="com.atlassian.confluence.pages">
<id name="id">001</id>
<property name="title"><![CDATA[Test Page]]></property>
<property name="lowerTitle"><![CDATA[test page]]></property>
<property name="version">022</property>
<property name="creationDate">2020-06-15 20:13:00.195</property>
<property name="lastModificationDate">2020-06-18 12:01:04.482</property>
<property name="versionComment"><![CDATA[]]></property>
<collection name="bodyContents" >
<element package="com.atlassian.confluence.core">
<id name="id">011</id>
</element>
</collection>
<collection name="historicalVersions" >
<element package="com.atlassian.confluence.pages">
<id name="id">021</id>
</element>
<element package="com.atlassian.confluence.pages">
<id name="id">022</id>
</element>
</collection>
<property name="contentStatus"><![CDATA[current]]></property>
<collection name="attachments" >
<element package="com.atlassian.confluence.pages">
<id name="id">031</id>
</element>
<element package="com.atlassian.confluence.pages">
<id name="id">032</id>
</element>
</collection>
</object>
However, I wanted to dig a little deeper into the XML and get some more specific data and I'm struggling to do that. For example, I would like to select the "id" value that is nested inside the BodyContent collection.
<collection name="bodyContents" >
<element package="com.atlassian.confluence.core">
<id name="id">011</id>
</element>
</collection>
Ultimately what I would like is to be able to output:
Page ID: 001
Page Title: Test Page
Page Version: 022
Page Body Content ID: 011
How can I go about getting this?
CodePudding user response:
The code below looks for the first element with the class BodyContent and takes the value of its id child element. For the xml in your example, these search criteria will suffice.
var pages =
from page in objects
where (string)page.Attribute("class") == "Page"
select new Page
{
BodyContentId =
(string)page
.Descendants("element")
.Where(o => (string)o.Attribute("class") == "BodyContent")
.FirstOrDefault()?.Element("id")
// Other properties
};
Giving you also a pointer to a post about how to handle large xml files.
In short, use an XmlReader to loop over the page <object https://www.w3schools.com/xml/xpath_syntax.asp" rel="nofollow noreferrer">XPath to retrieve the required values.
Code snippet:
var docNav = new XPathDocument(FILE_PATH);
var navigator = docNav.CreateNavigator();
var nodeIterator = navigator.Select("//object");
while (nodeIterator.MoveNext())
{
Console.WriteLine("Page ID: {0}", nodeIterator.Current.SelectSingleNode("id")?.Value);
Console.WriteLine("Page Title: {0}", nodeIterator.Current.SelectSingleNode("property[@name='title']")?.Value);
Console.WriteLine("Page Version: {0}", nodeIterator.Current.SelectSingleNode("property[@name='version']")?.Value);
Console.WriteLine("Page Body Content ID: {0}", nodeIterator.Current.SelectSingleNode("collection[@name='bodyContents']//id")?.Value);
};
