- Importance of Excel reporting
- Common approach to generating Excel files
- Passing large datasets in Excel
- Unexpected memory leaks: unraveling the Enigma
- Final thoughts
Importance of Excel Reporting
In large enterprise companies, generating Excel reports has become an indispensable process for managing and analyzing extensive datasets efficiently. These reports are crucial for tracking performance metrics, financial records, and operational statistics, offering valuable insights that drive strategic decision-making.
In such environments, automation tools that generate these files play a pivotal role in streamlining report creation and ensuring accuracy. As we advance into 2024, the ability to generate Excel files should be an easy and common task, right?
Common Approach to Generating Excel Files
To generate an Excel file with your own dataset, we will use the OpenXML library. The first thing you should do is install this library into your project:
dotnet add package DocumentFormat.OpenXml
After installing the necessary library and creating our template Excel file named “Test.xlsx,” we added this code to our application:
// this custom type is for your input data
public class DataSet
{
public List<DataRow> Rows { get; set; }
}
// this row will contain number of our row and info about each cell
public class DataRow
{
public int Index { get; set; }
public Dictionary<string, string> Cells { get; set; }
}
private void SetValuesToExcel(string filePath, DataSet dataSet)
{
if (string.IsNullOrWhiteSpace(filePath))
{
throw new FileNotFoundException($"File not found at this path: {filePath}");
}
using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
{
//each excel document has XML-structure,
//so we need to go deeper to our sheet
WorkbookPart wbPart = document.WorkbookPart;
//feel free to pass sheet name as parameter.
//here we'll just use the default one
Sheet theSheet = wbPart.Workbook
.Descendants<Sheet>()
.FirstOrDefault(s => s.Name.Value.Trim() == "Sheet1");
//next element in hierarchy is worksheetpart
//we need to dive deeper to SheetData object
WorksheetPart wsPart = (WorksheetPart)(wbPart.GetPartById(theSheet.Id));
Worksheet worksheet = wsPart.Worksheet;
SheetData sheetData = worksheet.GetFirstChild<SheetData>();
//iterating through our data
foreach (var dataRow in dataSet.Rows)
{
//getting Row element from Excel's DOM
var rowIndex = dataRow.Index;
var row = sheetData
.Elements<Row>()
.FirstOrDefault(r => r.RowIndex == rowIndex);
//if there is no row - we'll create new one
if (row == null)
{
row = new Row { RowIndex = (uint)rowIndex };
sheetData.Append(row);
}
//now we need to iterate though each cell in the row
foreach (var dataCell in dataRow.Cells)
{
var cell = row.Elements<Cell>()
.FirstOrDefault(c => c.CellReference.Value == dataCell.Key);
if (cell == null)
{
cell = new Cell
{
CellReference = dataCell.Key,
DataType = CellValues.String
};
row.AppendChild(cell);
}
cell.CellValue = new CellValue(dataCell.Value);
}
}
//after all changes in Excel DOM we need to save it
wbPart.Workbook.Save();
}
}
And that is how to use the code above:
var filePath = "Test.xlsx";
// number of rows that we want to add to our Excel file
var testRowsCounter = 100;
// creating some data for it
var dataSet = new DataSet();
dataSet.Rows = new List<DataRow>();
string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
for (int i = 0; i < testRowsCounter; i++)
{
var row = new DataRow
{
Cells = new Dictionary<string, string>(), Index = i + 1
};
for (int j = 0; j < 10; j++)
{
row.Cells.Add($"{alphabet[j]}{i+1}", Guid.NewGuid().ToString());
}
dataSet.Rows.Add(row);
}
//passing path to our file and data object
SetValuesToExcel(filePath, dataSet);
Metrics
Count of rows |
Time to process |
Memory gained (MB) |
---|---|---|
100 |
454ms |
21 Mb |
10 000 |
2.92s |
132 Mb |
100 000 |
10min 47s 270ms |
333 Mb |
In this table, we tried to test our function with various numbers of rows. As expected - increasing number of rows will lead to decreasing of performance. To fix that, we can try another approach.
Passing Large Datasets in Excel
The approach demonstrated above is straightforward and sufficient for small datasets. However, as illustrated in the table, processing large datasets can be significantly slow. This method involves DOM manipulations, which are inherently slow. In such cases, the SAX (Simple API for XML) approach becomes invaluable. As the name suggests, SAX allows us to work directly with the XML of the Excel document, providing a more efficient solution for handling large datasets.
Changing code from the first example to this:
using (SpreadsheetDocument document = SpreadsheetDocument.Open(filePath, true))
{
WorkbookPart workbookPart = document.WorkbookPart;
//we taking the original worksheetpart of our template
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
//adding the new one
WorksheetPart replacementPart = workbookPart.AddNewPart<WorksheetPart>();
string originalSheetId = workbookPart.GetIdOfPart(worksheetPart);
string replacementPartId = workbookPart.GetIdOfPart(replacementPart);
//the main idea is read through XML of original sheet object
OpenXmlReader openXmlReader = OpenXmlReader.Create(worksheetPart);
//and write it to the new one with some injection of our custom data
OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(replacementPart);
while (openXmlReader.Read())
{
if (openXmlReader.ElementType == typeof(SheetData))
{
if (openXmlReader.IsEndElement)
continue;
// write sheet element
openXmlWriter.WriteStartElement(new SheetData());
// write data rows
foreach (var row in dataSet.Rows)
{
Row r = new Row
{
RowIndex = (uint)row.Index
};
// start row
openXmlWriter.WriteStartElement(r);
foreach (var rowCell in row.Cells)
{
Cell c = new Cell
{
DataType = CellValues.String,
CellReference = rowCell.Key,
CellValue = new CellValue(rowCell.Value)
};
// cell
openXmlWriter.WriteElement(c);
}
// end row
openXmlWriter.WriteEndElement();
}
// end sheet
openXmlWriter.WriteEndElement();
}
else
{
//this block is for writing all not so interesting parts of XML
//but they are still are necessary
if (openXmlReader.ElementType == typeof(Row)
&& openXmlReader.ElementType == typeof(Cell)
&& openXmlReader.ElementType == typeof(CellValue))
{
openXmlReader.ReadNextSibling();
continue;
}
if (openXmlReader.IsStartElement)
{
openXmlWriter.WriteStartElement(openXmlReader);
}
else if (openXmlReader.IsEndElement)
{
openXmlWriter.WriteEndElement();
}
}
}
openXmlReader.Close();
openXmlWriter.Close();
//after all modifications we switch sheets inserting
//the new one to the original file
Sheet sheet = workbookPart.Workbook
.Descendants<Sheet>()
.First(c => c.Id == originalSheetId);
sheet.Id.Value = replacementPartId;
//deleting the original worksheet
workbookPart.DeletePart(worksheetPart);
}
Explanation: This code reads XML elements from a source Excel file one by one and copies its elements to a new sheet. After some manipulation of the data, it deletes the old sheet and saves the new one.
Metrics
Count of rows |
Time to process |
Memory gained (MB) |
---|---|---|
100 |
414ms |
22 Mb |
10 000 |
961ms |
87 Mb |
100 000 |
3s 488ms |
492 Mb |
1 000 000 |
30s 224ms |
over 4.5 GB |
As you can see, the speed of processing a large number of rows has significantly increased. However, we now have a memory issue that we need to address.
Unexpected Memory Leaks: Unraveling the Enigma
A discerning observer might have noticed an unexpected surge in memory consumption while processing 10 million cells in Excel. Although the weight of 1 million strings is considerable, it shouldn't account for such a substantial increase. After meticulous investigation with memory profilers, the culprit was identified within the OpenXML library.
Specifically, the root cause can be traced to a flaw in the .NET package System.IO.Packaging, affecting both .NET Standard and .NET Core versions. Interestingly, this issue seems absent in classic .NET, likely due to differences in the underlying Windows Base code. Shortly, the OpenXML library uses ZipArchive in it, which copies data in MemoryStream each time when you update the file.
It happens only if you open it in update mode, but you can’t do it in another way because it’s the behavior of .NET itself.
For those interested in delving deeper into this issue, further details can be found at GitHub Issue #23750.
Subsequently, after poring over the .NET source code and consulting peers facing similar challenges, I devised a workaround solution. If we can’t use the SpreadsheetDocument object to work with our Excel file in Open mode - let’s use it in Create mode with our own Package object. It will not use buggy ZipArchive under the hood and will work as it should.
(Warning: this code works now only with OpenXML v.2.19.0 and earlier).
Change our code to this:
public class Builder
{
public async Task Build(string filePath, string sheetName, DataSet dataSet)
{
var workbookId = await FillData(filePath, sheetName, dataSet);
await WriteAdditionalElements(filePath, sheetName, workbookId);
}
public async Task<string> FillData(string filePath,
string sheetName, DataSet excelDataRows)
{
//opening our file in create mode
await using var fileStream = File.Create(filePath);
using var package = Package.Open(fileStream, FileMode.Create, FileAccess.Write);
using var excel = SpreadsheetDocument.Create(package, SpreadsheetDocumentType.Workbook);
//adding new workbookpart
excel.AddWorkbookPart();
var worksheetPart = excel.WorkbookPart.AddNewPart<WorksheetPart>();
var workbookId = excel.WorkbookPart.GetIdOfPart(worksheetPart);
//creating necessary worksheet and sheetdata
OpenXmlWriter openXmlWriter = OpenXmlWriter.Create(worksheetPart);
openXmlWriter.WriteStartElement(new Worksheet());
openXmlWriter.WriteStartElement(new SheetData());
// write data rows
foreach (var row in excelDataRows.Rows.OrderBy(r => r.Index))
{
Row r = new Row
{
RowIndex = (uint)row.Index
};
openXmlWriter.WriteStartElement(r);
foreach (var rowCell in row.Cells)
{
Cell c = new Cell
{
DataType = CellValues.String,
CellReference = rowCell.Key
};
//cell
openXmlWriter.WriteStartElement(c);
CellValue v = new CellValue(rowCell.Value);
openXmlWriter.WriteElement(v);
//cell end
openXmlWriter.WriteEndElement();
}
// end row
openXmlWriter.WriteEndElement();
}
//sheetdata end
openXmlWriter.WriteEndElement();
//worksheet end
openXmlWriter.WriteEndElement();
openXmlWriter.Close();
return workbookId;
}
public async Task WriteAdditionalElements(string filePath, string sheetName, string worksheetPartId)
{
//here we should add our workbook to the file
//without this - our document will be incomplete
await using var fileStream = File.Open(filePath, FileMode.Open, FileAccess.ReadWrite, FileShare.None);
using var package = Package.Open(fileStream, FileMode.Open, FileAccess.ReadWrite);
using var excel = SpreadsheetDocument.Open(package);
if (excel.WorkbookPart is null)
throw new InvalidOperationException("Workbook part cannot be null!");
var xmlWriter = OpenXmlWriter.Create(excel.WorkbookPart);
xmlWriter.WriteStartElement(new Workbook());
xmlWriter.WriteStartElement(new Sheets());
xmlWriter.WriteElement(new Sheet { Id = worksheetPartId, Name = sheetName, SheetId = 1 });
xmlWriter.WriteEndElement();
xmlWriter.WriteEndElement();
xmlWriter.Close();
xmlWriter.Dispose();
}
}
And use it like this:
var builder = new Builder();
await builder.Build(filePath, "Sheet1", dataSet);
Metrics
Count of rows |
Time to process |
Memory gained (MB) |
---|---|---|
100 |
291ms |
18 Mb |
10 000 |
940ms |
62 Mb |
100 000 |
3s 767ms |
297 Mb |
1 000 000 |
31s 354ms |
2.7 GB |
Now, our measurements look satisfactory compared to the initial ones.
Final Thoughts
Initially, the showcased code serves purely demonstrative purposes. In practical applications, additional features such as support for various cell types or the replication of cell styles should be considered. Despite the significant optimizations demonstrated in the previous example, its direct application in real-world scenarios may not be feasible. Typically, for handling large Excel files, a chunk-based approach is more suitable.
P.S.: If you prefer to avoid delving into the intricacies of generating office documents, you're welcome to explore my NuGet package, which simplifies and integrates all these functionalities seamlessly.
Feature Image by vecstock on Freepik