settingsLogin | Registersettings

[Openstack-operators] Openstack Ceph Backend and Performance Information Sharing

0 votes

Hello All ,

For a long time we are testing Ceph from Firefly to Kraken , tried to
optimise many things which are very very common I guess like test tcmalloc
version 2.1 , 2,4 , jemalloc , setting debugs 0/0 , op_tracker and such
others and I believe with out hardware we almost reach to end of the road.

Some vendor tests mixed us a lot like samsung
http://www.samsung.com/semiconductor/support/tools-utilities/All-Flash-Array
-Reference-Design/downloads/SamsungNVMeSSDsandRedHatCephStorageCS20
160712.pdf , DELL Dell PowerEdge R730xd Performance and Sizing Guide for
Red Hat
https://www.google.com.tr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&
uact=8&ved=0ahUKEwiA4Z28
pTSAhXCJZoKHSYVD0AQFggeMAA&url=http%3A%2F%2Fen.comm
unity.dell.com%2Ftechcenter%2Fcloud%2Fm%2Fdellcloudresources%2F20442913%2F
download&usg=AFQjCNGGADYZkbABDGZ8YMct4E19KSAXA&sig2=YZCEHMq7tnXSpVydMDacIg>
and from intel
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/201508
13
S303E_Zhang.pdf

At the end using 3 replica (Actually most of vendors are testing with 2 but
I believe that its very very wrong way to do because when some of failure
happen you should wait 300 sec which is configurable but from blogs we
understaood that sometimes OSDs can be down and up again because of that I
believe very important to set related number but we do not want instances
freeze ) with config below with 4K , random and fully write only .

I red a lot about OSD and OSD process eating huge CPU , yes it is and we are
very well know that we couldn¹t get total of iOPS capacity of each raw SSD
drives.

My question is , can you pls share almost same or closer config or any
config test or production results ? Key is write, not %70 of read % 30 write
or full read things

Hardware :

6 x Node
Each Node Have :
2 Socker CPU 1.8 GHZ each and total 16 core
3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD)
Raid Cards Configured Raid 0
We did not see any performance different with JBOD mode of raid card because
of that continued with raid 0
Also raid card write back cache is used because its adding extra IOPS too !

Achieved IOPS : 35 K (Single Client)
We tested up to 10 Clients which ceph fairly share this usage like almost 4K
for each

Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1
--gtodreduce=1 --name=test --filename=test --bs=4k iodepth=256 --size=1G
--numjobs=8 --readwrite=randwrite group
reporting

Regards
Vahric Muhtaryan


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
asked Feb 17, 2017 in openstack-operators by Vahric_Muhtaryan (800 points)   2 3 4

2 Responses

0 votes

There is quite some information missing: how much RAM do the nodes
have? What SSDs? What Kernel (there has been complaints of a
performance regression on 4.4+).

You also never state how you have configured the OSDs, their journals,
filestore or bluestore, etc...

You never specify how you're accessing the RBD device...

For you to achieve high IOPS you need higher frequency CPUs. Also you
have to remember that the scale-out architecture of ceph means the
more nodes you add the better performance you'll have.

On Thu, Feb 16, 2017 at 4:26 PM, Vahric Muhtaryan vahric@doruk.net.tr wrote:
Hello All ,

For a long time we are testing Ceph from Firefly to Kraken , tried to
optimise many things which are very very common I guess like test tcmalloc
version 2.1 , 2,4 , jemalloc , setting debugs 0/0 , op_tracker and such
others and I believe with out hardware we almost reach to end of the road.

Some vendor tests mixed us a lot like samsung
http://www.samsung.com/semiconductor/support/tools-utilities/All-Flash-Array-Reference-Design/downloads/Samsung_NVMe_SSDs_and_Red_Hat_Ceph_Storage_CS_20160712.pdf
, DELL Dell PowerEdge R730xd Performance and Sizing Guide for Red Hat … and
from intel
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf

At the end using 3 replica (Actually most of vendors are testing with 2 but
I believe that its very very wrong way to do because when some of failure
happen you should wait 300 sec which is configurable but from blogs we
understaood that sometimes OSDs can be down and up again because of that I
believe very important to set related number but we do not want instances
freeze ) with config below with 4K , random and fully write only .

I red a lot about OSD and OSD process eating huge CPU , yes it is and we are
very well know that we couldn’t get total of iOPS capacity of each raw SSD
drives.

My question is , can you pls share almost same or closer config or any
config test or production results ? Key is write, not %70 of read % 30 write
or full read things …

Hardware :

6 x Node
Each Node Have :
2 Socker CPU 1.8 GHZ each and total 16 core
3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD)
Raid Cards Configured Raid 0
We did not see any performance different with JBOD mode of raid card because
of that continued with raid 0
Also raid card write back cache is used because its adding extra IOPS too !

Achieved IOPS : 35 K (Single Client)
We tested up to 10 Clients which ceph fairly share this usage like almost 4K
for each

Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1
--gtodreduce=1 --name=test --filename=test --bs=4k —iodepth=256 --size=1G
--numjobs=8 --readwrite=randwrite —group
reporting

Regards
Vahric Muhtaryan


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 17, 2017 by Luis_Periquito (140 points)  
0 votes

@Vahric, FYI, if you use directio, instead of sync (like a database is is
default configured for), you will just be using the RBD cache. Look at the
latency on your numbers. It is lower than is possible for a packet to
traverse the network. You'll need to use sync=1 if you want to see what the
performance is like for sync writes. You can reduce it with higher CPU
frequencies (change the governor), c-state disable, better network, the
right NVMe for journal, and other stuff. In the end, we're happy to see
even 500-600 IOPS for sync writes with a numjobs=1, iodepth=1 (256 is
unreasonable).

@Luis, since this is an OpenStack list, I assume he is accessing it via
Cinder.

Warren

On Fri, Feb 17, 2017 at 7:11 AM, Luis Periquito periquito@gmail.com wrote:

There is quite some information missing: how much RAM do the nodes
have? What SSDs? What Kernel (there has been complaints of a
performance regression on 4.4+).

You also never state how you have configured the OSDs, their journals,
filestore or bluestore, etc...

You never specify how you're accessing the RBD device...

For you to achieve high IOPS you need higher frequency CPUs. Also you
have to remember that the scale-out architecture of ceph means the
more nodes you add the better performance you'll have.

On Thu, Feb 16, 2017 at 4:26 PM, Vahric Muhtaryan vahric@doruk.net.tr
wrote:

Hello All ,

For a long time we are testing Ceph from Firefly to Kraken , tried to
optimise many things which are very very common I guess like test
tcmalloc
version 2.1 , 2,4 , jemalloc , setting debugs 0/0 , op_tracker and such
others and I believe with out hardware we almost reach to end of the
road.

Some vendor tests mixed us a lot like samsung
http://www.samsung.com/semiconductor/support/tools-
utilities/All-Flash-Array-Reference-Design/downloads/
SamsungNVMeSSDsandRedHatCephStorageCS20160712.pdf
, DELL Dell PowerEdge R730xd Performance and Sizing Guide for Red Hat …
and
from intel
http://www.flashmemorysummit.com/English/Collaterals/
Proceedings/2015/20150813
S303E_Zhang.pdf

At the end using 3 replica (Actually most of vendors are testing with 2
but
I believe that its very very wrong way to do because when some of failure
happen you should wait 300 sec which is configurable but from blogs we
understaood that sometimes OSDs can be down and up again because of that
I
believe very important to set related number but we do not want instances
freeze ) with config below with 4K , random and fully write only .

I red a lot about OSD and OSD process eating huge CPU , yes it is and we
are
very well know that we couldn’t get total of iOPS capacity of each raw
SSD
drives.

My question is , can you pls share almost same or closer config or any
config test or production results ? Key is write, not %70 of read % 30
write
or full read things …

Hardware :

6 x Node
Each Node Have :
2 Socker CPU 1.8 GHZ each and total 16 core
3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD)
Raid Cards Configured Raid 0
We did not see any performance different with JBOD mode of raid card
because
of that continued with raid 0
Also raid card write back cache is used because its adding extra IOPS
too !

Achieved IOPS : 35 K (Single Client)
We tested up to 10 Clients which ceph fairly share this usage like
almost 4K
for each

Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1
--gtodreduce=1 --name=test --filename=test --bs=4k —iodepth=256
--size=1G
--numjobs=8 --readwrite=randwrite —group
reporting

Regards
Vahric Muhtaryan


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
responded Feb 17, 2017 by Warren_Wang (680 points)   1
...